breakpoints and halting in distributed systems
DESCRIPTION
Breakpoints and Halting in Distributed Systems. Presented by Abhishek Saxena CS 739 Distributed Systems Spring 2002. References. Detecting Relational Global Predicates in Distributed Systems by Alexander I. Tomlinson and Vijay K. Garg, 1993 - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/1.jpg)
Breakpoints and Halting in Distributed Systems
Presented by
Abhishek Saxena
CS 739 Distributed Systems
Spring 2002
![Page 2: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/2.jpg)
2
References
• Detecting Relational Global Predicates in Distributed Systems by Alexander I. Tomlinson and Vijay K. Garg, 1993
• Breakpoints and Halting in Distributed Programs by Barton P. Miller and Jong-Deok Choi, 1992
• Restoring Consistent Global States of Distributed Computations by Goldberg et al., 1991
![Page 3: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/3.jpg)
3
Presentation Layout
• Introduction• Motivation• Halting in Distributed Systems• Detecting Breakpoints for:
• Conjunctive/Disjunctive/Linked Predicates• Relational Predicates
• Applications to Research• Relevance to papers read• Conclusions
![Page 4: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/4.jpg)
4
Introduction
• General problems of:– Halting distributed programs– Detecting breakpoints – Validating resource conflicts– Recording, restoration and replay of program
sequences
![Page 5: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/5.jpg)
5
Motivation
• Why halt?– Interactive debugging– Issues in distributed systems:
• No single global notion of time• Unpredictable communication delays• How to issue instant command to all processes?• Command to simultaneously reach all processes?
![Page 6: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/6.jpg)
6
Halting
• 2 pertinent questions:– How to halt a distributed program?
• Halting Algorithm
– When to halt?• Breakpoint Detection
![Page 7: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/7.jpg)
7
Halting Algorithm
• Extends Chandy & Lamport’s algorithm• Sending rule:
– Increments last_halt_id– Send halt marker containing this value to
outgoing channels
• Receiving rule:– Compare the halt_id with its last_halt_id &
update – Send halt marker like sender
![Page 8: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/8.jpg)
8
Receiving process Q
Process T
Process U
Halt marker
Sending process P
Process R
Process S
Halt markerHalt marker
The Halting Algorithm
Halt marker Halt marker
![Page 9: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/9.jpg)
9
The Halting Algorithm
• Intuitive extension to Chandy & Lamport’s Algorithm[1]
• Leads to a global consistent state since:– Process states same as recorded process
states in [1]– Undelivered messages same as recorded
channels states in [1]
![Page 10: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/10.jpg)
10
Problems with this Algorithm
• Processes that infrequently interact with other computation processes• Long halting time
• Acyclic network connection
P Q
Producer Consumer
Communication Channel
![Page 11: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/11.jpg)
11
A Solution…• Centralized debugger process:
d
qp
Debugger process
![Page 12: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/12.jpg)
12
Problems with this solution
• Communication overheads
• Possible change in execution of program
• Complex to build
![Page 13: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/13.jpg)
13
Detecting Breakpoints
• Breakpoints & Predicates
• Predicate satisfaction = breakpoint detection
• Distributed processes’ system needs: – Simple predicates– Disjunctive predicates– Linked predicates…interesting!– Conjunctive predicates…very interesting!
![Page 14: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/14.jpg)
14
Simple Predicates
• Encapsulate single process behavior
• Detect simple events:– Entered procedure– Message sent / received– Channel created / destroyed– Process created / destroyed
![Page 15: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/15.jpg)
15
Disjunctive predicates
• Form:
DP ::= SP [ U SP ]*
• Satisfied when any SP is satisfied
• Initiate halting when DP is true
![Page 16: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/16.jpg)
16
Linked Predicates
• Specify sequences of events
• Form:
LP ::= DP [ ->DP ]*
• Debugger process sends the LP {DP1->...} to processes involved in DP1
• Upon DP1, strip off DP1 & send stripped LP to processes involved in DP2
![Page 17: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/17.jpg)
17
Process S
Process P
Linked predicates’ implementation
Debugger process
Process Q
Process R
Processes involved in DP1
Processes
involved in DP2
DP1->DP2DP1->DP2DP1->DP2
Start Halting
Process T
DP2DP2
Start halting
Start halting
![Page 18: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/18.jpg)
18
Conjunctive Predicates
• Form:
CP ::= SP [ ∩ SP ]*• Hardest to detect! • No single time reference across machines• Interpretation based on virtual time:
– Consider processes P1, P2 with virtual time axes T1, T2
– Define
SCP = { (t1, t2) | t1ε T1, t2ε T2, SP(t1) ∩ SP(T2) }
![Page 19: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/19.jpg)
19
Conjunctive predicates
• Split SCP into:– Ordered-SCP:
{ (t1, t2) | (t1, t2)ε SCP, ((SP1) i -> (SP2) j) U ((SP2) i ->(SP1) j) }
– Unordered-SCP:{ (t1, t2) | (t1, t2)ε SCP, (t1, t2) € ordered-SCP }
![Page 20: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/20.jpg)
20
Conjunctive Predicates
t11
t12
t13
t21
t22
t23
unordered- SCP pair
ordered-SCP pair
![Page 21: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/21.jpg)
21
Conjunctive Predicates
• Detecting unordered-SCP events difficult
• Requires:– Global information gathering process– Time delay!– Cannot preserve meaningful process states
![Page 22: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/22.jpg)
22
Detecting Relational Global Predicates
• Resource conflict validation problems undetectable by earlier predicate classes
• Form:
( x0 +…+ xn > C )– xi: resource usage at Pi– C: total resource available
• Undecomposable into earlier classes of predicates
![Page 23: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/23.jpg)
23
How to detect such predicates?
• 2 algorithms:– Decentralized: runs concurrently– Centralized: decoupled from the target
program
![Page 24: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/24.jpg)
24
Model & Notation
• Partial ordering on S = { S0, …, Sn } where, Si <= Sj, for 0 <= i,j <= n
• Happens-before relation: “->”
• pred.u.i: Intuitively, is the state just preceding u in process i
• succ.u.i: The state just succeeding u in process i
![Page 25: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/25.jpg)
25
Concurrent States & Intervals
Deterministic event
Non-deterministic event
Local state
P Q
State Interval
Receive Interval
2
3
411
10
9
![Page 26: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/26.jpg)
26
Concurrent Intervals
1, lo1
0, lo0 0, i 0, hi0 KEY
1, j 1, hi1
pred relation
P1
P0
![Page 27: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/27.jpg)
27
Concurrent Intervals
• Intervals (0,i) & (1, j) concurrent iff
KEY exists in P0 or P1 s.t.,
lo0 < i <= hi0 & lo1 < j <= hi1,
where,
the lo0, lo1, hi0, hi1 as defined by the previous diagram
![Page 28: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/28.jpg)
28
Overview of algorithms
• Gather information– What?– How?
• Consider 2 processes P0 & P1
• Gather concurrent interval sequences: – { lo0 to hi0 } at P0 & { lo1 to hi1 } at P1
• Check resource violations at all possible pairs of states in these sequences!!
![Page 29: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/29.jpg)
29
Algorithms contd…
• Representation of
(0, lo0) (0, hi0)
(1, lo1) (1, hi1)
as a 2x2 Matrix clock• Row i of Pi’s matrix clock = Pi’s vector clock• Current interval at Pk = (k, Mk[ , ])• Row k of Mk…pred() of current interval at Pk• Row i<>k…pred.pred() of current interval at Pk
![Page 30: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/30.jpg)
30
Maintaining Matrix Clocks
• Initialize– Initialize matrix to 0– If k=0 or k=1 Mk[k, k] ++
• Send message tagged with Mk[., .] ; Increment Mk[k,k] for k=0 V 1
• Upon message receive update matrix clock; Increment Mk[k,k] ; – Mk[k, ]= diagonal(Mk)
![Page 31: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/31.jpg)
31
Matrix Clock Example
1 00 0
0 00 1
0 00 2
2 12 3
2 10 1
3 10 1
0 0
0 1
2 1
0 1
P0
P1
![Page 32: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/32.jpg)
32
Decentralized Algorithm
• Consider process P0
• Upon mesg receive evaluate lo0, lo1, hi0, hi1
• Find min value of resource(x) at P0
• Send debug mesg (min_x0, lo1, hi1) to P1
• P1 detects the predicate :
(min_x0 + min_x1 > C)
![Page 33: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/33.jpg)
33
Overheads & Complexity at P0
• Message overheads:– (# of receive intervals at P0)* sizeof ( 3
integers)………………..Debug mesgs– Sizeof(4 integers)…………Application mesgs
• Memory:– # intervals at P0; min_x for each interval
• Computation:– (# intervals at P0)*( # debug mesgs sent +
received)
![Page 34: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/34.jpg)
34
Centralized Algorithm
• Checker process runs concurrently or, post-mortem
• Consider the latter: processes P0 & P1– Processes keep trace files containing:
• min_x for each interval• an array of {lo0, lo1, hi0, hi1} for each interval
– Runs a check algorithm• Builds heaps by inserting the min_x values for all
concurrent interval sequences at P0 & P1 • Use these heap-tops to detect the predicate
![Page 35: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/35.jpg)
35
Overheads & Complexity for P0
• Memory:– 4 integers for matrix clock each application process
• Computation:– Monitor local variables– Rest offloaded to checker– O(R0 + M0logM0 + M1logM1)
Where, R0 & M0 = # rec intervals & total intervals at P0
![Page 36: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/36.jpg)
36
Major Practical Problems
• Reduced complexity from exp to O(nlogn) but still…
• Large overheads even for 2 processes
• Lots of messages!
• Lots of memory space!
• Lots of computation!
![Page 37: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/37.jpg)
37
Applications to Research
• Development of distributed debugging environment– Recording of execution sequences– Rollback– Replay– Exploration of new execution scenarios
• Command of mission-control distributed systems
![Page 38: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/38.jpg)
38
Relevance to Papers Read
• The S/Net’s Linda kernel:– Debugging distributed tuple space– Detecting race conditions, deadlocks, probe
effects
• Chandy & Lamport’s paper explores the detection of stable predicates and Garg’s paper explores unstable predicate detection
![Page 39: Breakpoints and Halting in Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568152f8550346895dc1143f/html5/thumbnails/39.jpg)
39
Conclusions
• Distributed debugging still challenging
• No efficient algorithm
• Hard to do away with overheads
• Need for efficient event monitoring & manipulation tools
• Message sequence chart generators
• Program flow analysis for more independent program splitting