dcatch: automatically detecting distributed concurrency ...shanlu/paper/dcatch-final.pdfmachine1...
TRANSCRIPT
1
DCatch: Automatically Detecting Distributed Concurrency Bugs
in Cloud Systems
Haopeng Liu, Guangpu Li, Jeffrey Lukman, Jiaxin Li,
Shan Lu, Haryadi Gunawi, and Chen Tian*
*
2
Cloud systems
3
Cloud systems
4
Distributed concurrency bugs (DCbugs)
5
Distributed concurrency bugs (DCbugs)
• Unexpected timing among distributed operations
6
Distributed concurrency bugs (DCbugs)
• Unexpected timing among distributed operations
• Example
BA C
MapReduce-3274
7
Distributed concurrency bugs (DCbugs)
• Unexpected timing among distributed operations
• Example
BA C BA C
MapReduce-3274
hang
8
DCbugs need to be tackled
• Common in distributed systems [1, 2, 3]
– 26% failures caused by non-deterministic [1]
– 6% software bugs in clouds system [2]
[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16
9
DCbugs need to be tackled
• Common in distributed systems [1, 2, 3]
• Difficult to avoid, expose and diagnose
[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16
10
DCbugs need to be tackled
• Common in distributed systems [1, 2, 3]
• Difficult to avoid, expose and diagnose
“That is one monster of a race!”
Hadoop Map/Reduce / MAPREDUCE-3274
[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16
11
DCbugs need to be tackled
• Common in distributed systems [1, 2, 3]
• Difficult to avoid, expose and diagnose
“That is one monster of a race!”
Hadoop Map/Reduce / MAPREDUCE-3274
“There isn’t a week going by without new bugs about races.”
HBase / HBASE-4397
[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16
12
DCbugs need to be tackled
• Common in distributed systems [1, 2, 3]
• Difficult to avoid, expose and diagnose
“That is one monster of a race!”
Hadoop Map/Reduce / MAPREDUCE-3274
“There isn’t a week going by without new bugs about races.”
HBase / HBASE-4397
“We have already fix many cases, however it seems exist many other [racing] cases.”
HBase / HBASE-6147
[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16
13
DCbugs need to be tackled
• Common in distributed systems [1, 2, 3]
• Difficult to avoid, expose and diagnose
“That is one monster of a race!”
Hadoop Map/Reduce / MAPREDUCE-3274
“There isn’t a week going by without new bugs about races.”
HBase / HBASE-4397
“We have already found and fix many cases, however it seems exist many other cases.”
HBase / HBASE-6147
“This has become quite messy, sigh.”
Hadoop Map/Reduce / MAPREDUCE-4819
[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16
14
DCbugs need to be tackled
• Common in distributed systems [1, 2, 3]
• Difficult to avoid, expose and diagnose
“That is one monster of a race!”
Hadoop Map/Reduce / MAPREDUCE-3274
“There isn’t a week going by without new bugs about races.”
HBase / HBASE-4397
“We have already found and fix many cases, however it seems exist many other cases.”
HBase / HBASE-6147
“This has become quite messy, sigh.”
Hadoop Map/Reduce / MAPREDUCE-4819
“Great catch, Sid! Apologies for missing the race condition.”
Hadoop Map/Reduce / MAPREDUCE-4099
[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16
15
[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16
DCbugs need to be tackled
• Common in distributed systems [1, 2, 3]
• Difficult to avoid, expose and diagnose
“That is one monster of a race!”
Hadoop Map/Reduce / MAPREDUCE-3274
“There isn’t a week going by without new bugs about races.”
HBase / HBASE-4397
“We have already found and fix many cases, however it seems exist many other cases.”
HBase / HBASE-6147
“This has become quite messy, sigh.”
Hadoop Map/Reduce / MAPREDUCE-4819
“Great catch, Sid! Apologies for missing the race condition.”
Hadoop Map/Reduce / MAPREDUCE-4099
“We [prefer] debug crashes instead of hanging jobs.”
Hadoop Map/Reduce / MAPREDUCE-3634
16
[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16
DCbugs need to be tackled
• Common in distributed systems [1, 2, 3]
• Difficult to avoid, expose and diagnose
“That is one monster of a race!”
Hadoop Map/Reduce / MAPREDUCE-3274
“There isn’t a week going by without new bugs about races.”
HBase / HBASE-4397
“We have already found and fix many cases, however it seems exist many other cases.”
HBase / HBASE-6147
“This has become quite messy, sigh.”
Hadoop Map/Reduce / MAPREDUCE-4819
“Great catch, Sid! Apologies for missing the race condition.”
Hadoop Map/Reduce / MAPREDUCE-4099
“We [prefer] debug crashes instead of hanging jobs.”
Hadoop Map/Reduce / MAPREDUCE-3634
Can we detect DCbugs before they manifest?
17
Previous work
• Model checking
– Work on abstracted models
– Face state-space explosion issue
18
Our idea
• Follow the philosophy of traditional concurrency bug detection
19
Our idea
• Follow the philosophy of traditional concurrency bug detection
Machine 2Machine 1 Machine 3 Machine 4
20
Our idea
• Follow the philosophy of traditional concurrency bug detection
Machine 2Machine 1 Machine 3 Machine 4
21
Our idea
• Follow the philosophy of traditional concurrency bug detection
Machine 2Machine 1 Machine 3 Machine 4
22
Example
BA C
23
Example
BA C
B
//UnReg thread
void unReg(jID){
jMap.remove(jID);
....
}
//RPC thread
Task getTask(jID){
...
return jMap.get(jID);
}
24
Local concurrency bug detection
25
Local concurrency bug detection
26
Local concurrency bug detection
Is the problem solved?
27
Local concurrency bug detection
Trace
T1 T2 T3
28
Local concurrency bug detection
Trace
T1 T2 T3
C1
C1: How to handle the hugeamount of mem accesses?
Challenges
29
Local concurrency bug detection
.
.
.
Trace HB
T1 T2 T3
C1
C1: How to handle the hugeamount of mem accesses?
Challenges
30
Local concurrency bug detection
.
.
.
Trace HB
T1 T2 T3
C1
C1: How to handle the hugeamount of mem accesses?
Challenges
rw
31
Local concurrency bug detection
.
.
.
Trace HB
T1 T2 T3
C1
C2
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
Challenges
rw
32
Local concurrency bug detection
.
.
.
Trace TriageHB
T1 T2 T3 .
.
.
C1
C2
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
Challenges
assert(r)
rw
rw
33
Local concurrency bug detection
.
.
.
Trace TriageHB
T1 T2 T3 .
.
.
C1
C2
C3
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
C3: How to estimate thedistributed impact of a race?
Challenges
assert(r)
rw
rw
34
Local concurrency bug detection
.
.
.
T1 T2 T3 .
.
.
.
.
.
Trace Triage TriggerHB
C1
C2
C3
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
C3: How to estimate thedistributed impact of a race?
Challenges
+sleeprw
rw
assert(r)
35
Local concurrency bug detection
.
.
.
Trace Triage TriggerHB
T1 T2 T3 .
.
.
.
.
.
C1
C2
C3
C4
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
C3: How to estimate thedistributed impact of a race?
C4: How to trigger withdistributed time manipulation?
Challenges
+sleeprw
assert(r)
rw
36
Contribution
• A comprehensive HB Model for distributed systems
37
Contribution
Trace Triage TriggerHB
C1
C2
C3
C4
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-before model?
C3: How to estimate thedistributed impact of a race?
C4: How to trigger withdistributed time manipulation?
ChallengesSolved by
DCatch
• A comprehensive HB Model for distributed systems
• DCatch tool detects DCbugs from correct runs
38
Contribution
• A comprehensive HB Model for distributed systems
• DCatch tool detects DCbugs from correct runs
• Evaluate on 4 systems
• Report 32 DCbugs, with 20 of them being truly harmful
39
Outline
• Motivation
• DCatch Happens-before Model
• DCatch tool
• Evaluation
• Conclusion
40
DCatch Happens-before Model
HMaster
Thd
w
41
DCatch Happens-before Model
Thd
HMaster
Thd
w
42
DCatch Happens-before Model
ThdThd
HMaster HRegionServer
Thd
w
43
DCatch Happens-before Model
Thd Event threadThd
HMaster
e
HRegionServer
Thd
w
44
DCatch Happens-before Model
Thd Event threadThd
HMaster
e
HRegionServer
Thd
ZK Coordinator
w
45
DCatch Happens-before Model
ThdThd Event threadThd
HMaster
e
HRegionServer
Thd
ZK Coordinator
w
46
DCatch Happens-before Model
ThdThd Event threadThd
HMaster
e
HRegionServer
Thd
ZK Coordinator
w
r
47
DCatch Happens-before Model
ThdThd Event threadThd
HMaster
e
HRegionServer
Thd
ZK Coordinator
w
r
Where is HB model for distributed systems?
48
DCatch Happens-before Model
ThdThd Eve thdThd
HMaster HRegionServer
Thd
ZK Coordinator
w
r
49
DCatch Happens-before Model
Dist.Loc.
ThdThd Eve thdThd
HMaster HRegionServer
Thd
ZK Coordinator
w
r
Dist.
50
DCatch Happens-before Model
Dist.Loc.
Async.
Sync.ThdThd Eve thdThd
HMaster HRegionServer
Thd
ZK Coordinator
w
r
Dist.Async.
51
DCatch Happens-before Model
Stand.
Dist.Loc.Cust.
Async.
Sync.ThdThd Eve thdThd
HMaster HRegionServer
Thd
ZK Coordinator
w
r
Dist.Async.
Cust.
52
Distributed rules
DistributedLocal
StandardCustom
Async.
Sync.
53
Distributed rule #1
• Logical time clock (Leslie Lamport, 1978)
Send
Recv
Machine1 Machine2
Standard
Asy
nch
.
Customize
Syn
ch.
Socket
54
Distributed rule #1
• Logical time clock (Leslie Lamport, 1978)
Send
Recv
Machine1 Machine2
Standard
Asy
nch
.
Customize
Syn
ch.
Socket
55
Distributed rule #2
RPC-call
RPC-begin
Machine1 Machine2
RPC-rtRPC-end
Standard
Asy
nch
.
Customize
Syn
ch.
RPC
Socket
56
Distributed rule #2
RPC-call
RPC-begin
Machine1 Machine2
RPC-rtRPC-end
Standard
Asy
nch
.
Customize
Syn
ch.
RPC
Socket
waiting
57
Distributed rule #2
RPC-call
RPC-begin
Machine1 Machine2
RPC-rtRPC-end
Standard
Asy
nch
.
Customize
Syn
ch.
RPC
Socket
waiting
58
Distributed rule #3
//Thread1
flag = True;
//Thread
while(!flag){
}
...
In multi-threaded systems:
Standard
Asy
nch
.
Customize
Syn
ch.
Socket
RPC
Dist. while-loop
59
Distributed rule #3
//Thread1
flag = True;
//Thread
while(!flag){
}
...
In multi-threaded systems:
Standard
Asy
nch
.
Customize
Syn
ch.
Socket
RPC
Dist. while-loop
60
Distributed rule #3
//Thread1
flag = True;
//Thread
while(!flag){
}
...
In multi-threaded systems:
Standard
Asy
nch
.
Customize
Syn
ch.
Socket
RPC
Dist. while-loop
In distributed systems:
61
Distributed rule #3
//Thread1
flag = True;
//Thread
while(!getFlag()){
}
...
//Thread2
bool getFlag(){
return flag;
}
Machine BMachine A
In distributed systems:
Standard
Asy
nch
.
Customize
Syn
ch.
Socket
RPC
Dist. while-loop
62
Distributed rule #3
//Thread1
flag = True;
//Thread
while(!getFlag()){
}
...
//Thread2
bool getFlag(){
return flag;
}
Machine BMachine A
In distributed systems:
Standard
Asy
nch
.
Customize
Syn
ch.
Socket
RPC
Dist. while-loop
63
Distributed rule #4
ThdThd Eve thdThdThd
ZK Coordinator
w
r
Standard
Asy
nch
.
Customize
Syn
ch.
Socket
RPC
ZooKeeper Service
64
Distributed rule #4
Standard
Asy
nch
.
Customize
Syn
ch.
ZooKeeper Service
Socket
RPCThdThd Eve thdThdThd
ZK Coordinator
w
r
65
Distributed rules
StandardCustomized
Asy
nch
ron
ou
sSy
nch
ron
ou
sRPCDist. While-loop
SocketZookeeper Service
66
Local rules
DistributedLocal
67
Local rules
StandardCustomized
Asy
nch
ron
ou
sSy
nch
ron
ou
s
Event-relatedn/a
68
Local rules
StandardCustomized
Asy
nch
ron
ou
sSy
nch
ron
ou
s
Event-related
Thread fork/joinWhile-loop
n/a
69
Local rules
StandardCustomized
Asy
nch
ron
ou
sSy
nch
ron
ou
sThread fork/joinWhile-loop
Event-relatedn/a
70
Outline
• Motivation
• DCatch Happens-before Model
• DCatch tool
• Evaluation
• Conclusion
71
Triage TriggerTrace HB
72
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
C3: How to estimate thedistributed impact of a race?
C4: How to trigger withdistributed time manipulation?
Challenges
Triage TriggerTrace HB
Selective tracing: only mem accesses inEvent/message handlers and their callee.
73
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
C3: How to estimate thedistributed impact of a race?
C4: How to trigger withdistributed time manipulation?
Challenges
Triage TriggerTrace HB
Local Distributed
[1] Raychev. Effective Race Detection for Event-Driven Programs. In OOPSLA’13
74
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
C3: How to estimate thedistributed impact of a race?
C4: How to trigger withdistributed time manipulation?
Challenges
Triage TriggerTrace HB
Machine B
//RPC thread
Task getTask(jID){
...
return jMap.get(jID);
}
//UnReg thread
void unReg(jID){
jMap.remove(jID);
....
}
75
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
C3: How to estimate thedistributed impact of a race?
C4: How to trigger withdistributed time manipulation?
Challenges
Triage TriggerTrace HB
Machine BMachine A
//RPC thread
Task getTask(jID){
...
return jMap.get(jID);
}
//UnReg thread
void unReg(jID){
jMap.remove(jID);
....
}
while(!getTask(jID)){
}
76
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
C3: How to estimate thedistributed impact of a race?
C4: How to trigger withdistributed time manipulation?
Challenges
Triage TriggerTrace HB
Machine BMachine A
//RPC thread
Task getTask(jID){
...
return jMap.get(jID);
}
//UnReg thread
void unReg(jID){
jMap.remove(jID);
....
}
while(!getTask(jID)){
}
77
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
C3: How to estimate thedistributed impact of a race?
C4: How to trigger withdistributed time manipulation?
Challenges
Triage TriggerTrace HB
Machine BMachine A
//RPC thread
Task getTask(jID){
...
return jMap.get(jID);
}
//UnReg thread
void unReg(jID){
jMap.remove(jID);
....
}
while(!getTask(jID)){
}
hang
78
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
C3: How to estimate thedistributed impact of a race?
C4: How to trigger withdistributed time manipulation?
Challenges
Triage TriggerTrace HB
Thd ThdEvent thread
e2
e1
w
r
79
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
C3: How to estimate thedistributed impact of a race?
C4: How to trigger withdistributed time manipulation?
Challenges
Triage TriggerTrace HB
Thd ThdEvent thread
e2
e1
w
r +sleep
80
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
C3: How to estimate thedistributed impact of a race?
C4: How to trigger withdistributed time manipulation?
Challenges
Triage TriggerTrace HB
Thd ThdEvent thread
e2
e1
w
r
+sleep
81
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
C3: How to estimate thedistributed impact of a race?
C4: How to trigger withdistributed time manipulation?
Challenges
Triage TriggerTrace HB
Thd
Machine A
Thd
Machine C
Thd ThdEvent thread
e2
e1
w
r
Machine B
82
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
C3: How to estimate thedistributed impact of a race?
C4: How to trigger withdistributed time manipulation?
Challenges
Triage TriggerTrace HB
Thd
Machine A
Thd
Machine C
Thd ThdEvent thread
e2
e1
w
r
Machine B
+sleep
+sleep
83
C1: How to handle the hugeamount of mem accesses?
C2: What’s the happens-beforemodel?
C3: How to estimate thedistributed impact of a race?
C4: How to trigger withdistributed time manipulation?
Challenges
Triage TriggerTrace HB
Thd
Machine A
Thd
Machine C
Thd ThdEvent thread
e2
e1
w
r
Machine B
+sleep
+sleep+sleep
+sleep
84
Outline
• Motivation
• DCatch Happens-before Model
• DCatch tool
• Evaluation
• Conclusion
85
• Benchmarks:
– 7 real-world DCbugs from TaxDC[1]
– 4 distributed systems
Methodology
[1] Leesatapornwongsa. TaxDC. In ASPLOS’16
86
Overall resultsBugID Detected? #. Bugs #. Benign #. false-pos
CA-1011 ✔ 3 0 0
HB-4539 ✔ 3 0 1
HB-4729 ✔ 4 1 0
MR-3274
✔ 2 0 4
MR-4637
✔ 1 2 4
ZK-1144 ✔ 5 1 1
ZK-1270 ✔ 6 2 0
Total 20 5 7
87
Overall resultsBugID Detected? #. Bugs #. Benign #. false-pos
CA-1011 ✔ 3 0 0
HB-4539 ✔ 3 0 1
HB-4729 ✔ 4 1 0
MR-3274
✔ 2 0 4
MR-4637
✔ 1 2 4
ZK-1144 ✔ 5 1 1
ZK-1270 ✔ 6 2 0
Total 20 5 7
88
Overall resultsBugID Detected? #. Bugs #. Benign #. false-pos
CA-1011 ✔ 3 0 0
HB-4539 ✔ 3 0 1
HB-4729 ✔ 4 1 0
MR-3274
✔ 2 0 4
MR-4637
✔ 1 2 4
ZK-1144 ✔ 5 1 1
ZK-1270 ✔ 6 2 0
Total 20 5 7
= 12 + 8
89
Overall resultsBugID Detected? #. Bugs #. Benign #. false-pos
CA-1011 ✔ 3 0 0
HB-4539 ✔ 3 0 1
HB-4729 ✔ 4 1 0
MR-3274
✔ 2 0 4
MR-4637
✔ 1 2 4
ZK-1144 ✔ 5 1 1
ZK-1270 ✔ 6 2 0
Total 20 5 7
90
Other results in our paper
• Performance overhead
• Trace compositions
• HB model impact
– False-positive
– False-negatives
• …
91
Outline
• Motivation
• DCatch Happens-before Model
• DCatch tool
• Evaluation
• Conclusion
92
Conclusion
• A HB Model for distributed systems
• DCatch detects DCbugs from correct runs with low false positive rates.
Trace Triage TriggerHB
C1
C2
C3
C4
Local Distributed
93
Thank you!Q&A