dcatch: automatically detecting distributed concurrency ...shanlu/paper/dcatch-final.pdfmachine1...

1

DCatch: Automatically Detecting Distributed Concurrency Bugs

in Cloud Systems

Haopeng Liu, Guangpu Li, Jeffrey Lukman, Jiaxin Li,

Shan Lu, Haryadi Gunawi, and Chen Tian*

*

2

Cloud systems

3

Cloud systems

4

Distributed concurrency bugs (DCbugs)

5


• Unexpected timing among distributed operations

6



• Example

BA C

MapReduce-3274

7



• Example

BA C BA C

MapReduce-3274

hang

8

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

– 26% failures caused by non-deterministic [1]

– 6% software bugs in clouds system [2]

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

9



• Difficult to avoid, expose and diagnose


10




“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274


11






“There isn’t a week going by without new bugs about races.”

HBase / HBASE-4397


12







HBase / HBASE-4397

“We have already fix many cases, however it seems exist many other [racing] cases.”

HBase / HBASE-6147


13







HBase / HBASE-4397

“We have already found and fix many cases, however it seems exist many other cases.”

HBase / HBASE-6147

“This has become quite messy, sigh.”



14







HBase / HBASE-4397


HBase / HBASE-6147



“Great catch, Sid! Apologies for missing the race condition.”



15








HBase / HBASE-4397


HBase / HBASE-6147





“We [prefer] debug crashes instead of hanging jobs.”


16








HBase / HBASE-4397


HBase / HBASE-6147





“We [prefer] debug crashes instead of hanging jobs.”


Can we detect DCbugs before they manifest?

17

Previous work

• Model checking

– Work on abstracted models

– Face state-space explosion issue

18

Our idea

• Follow the philosophy of traditional concurrency bug detection

19

Our idea


Machine 2Machine 1 Machine 3 Machine 4

20

Our idea



21

Our idea



22

Example

BA C

23

Example

BA C

B

//UnReg thread

void unReg(jID){

jMap.remove(jID);

....

}

//RPC thread

Task getTask(jID){

...

return jMap.get(jID);

}

24

Local concurrency bug detection

25


26


Is the problem solved?

27


Trace

T1 T2 T3

28


Trace

T1 T2 T3

C1

C1: How to handle the hugeamount of mem accesses?

Challenges

29


.

.

.

Trace HB

T1 T2 T3

C1


Challenges

30


.

.

.

Trace HB

T1 T2 T3

C1


Challenges

rw

31


.

.

.

Trace HB

T1 T2 T3

C1

C2


C2: What’s the happens-beforemodel?

Challenges

rw

32


.

.

.

Trace TriageHB

T1 T2 T3 .

.

.

C1

C2



Challenges

assert(r)

rw

rw

33


.

.

.

Trace TriageHB

T1 T2 T3 .

.

.

C1

C2

C3



C3: How to estimate thedistributed impact of a race?

Challenges

assert(r)

rw

rw

34


.

.

.

T1 T2 T3 .

.

.

.

.

.

Trace Triage TriggerHB

C1

C2

C3




Challenges

+sleeprw

rw

assert(r)

35


.

.

.


T1 T2 T3 .

.

.

.

.

.

C1

C2

C3

C4




C4: How to trigger withdistributed time manipulation?

Challenges

+sleeprw

assert(r)

rw

36

Contribution

• A comprehensive HB Model for distributed systems

37

Contribution


C1

C2

C3

C4


C2: What’s the happens-before model?



ChallengesSolved by

DCatch


• DCatch tool detects DCbugs from correct runs

38

Contribution


• DCatch tool detects DCbugs from correct runs

• Evaluate on 4 systems

• Report 32 DCbugs, with 20 of them being truly harmful

39

Outline

• Motivation

• DCatch Happens-before Model

• DCatch tool

• Evaluation

• Conclusion

40

DCatch Happens-before Model

HMaster

Thd

w

41


Thd

HMaster

Thd

w

42


ThdThd

HMaster HRegionServer

Thd

w

43


Thd Event threadThd

HMaster

e

HRegionServer

Thd

w

44


Thd Event threadThd

HMaster

e

HRegionServer

Thd

ZK Coordinator

w

45


ThdThd Event threadThd

HMaster

e

HRegionServer

Thd

ZK Coordinator

w

46



HMaster

e

HRegionServer

Thd

ZK Coordinator

w

r

47



HMaster

e

HRegionServer

Thd

ZK Coordinator

w

r

Where is HB model for distributed systems?

48


ThdThd Eve thdThd


Thd

ZK Coordinator

w

r

49


Dist.Loc.

ThdThd Eve thdThd


Thd

ZK Coordinator

w

r

Dist.

50


Dist.Loc.

Async.

Sync.ThdThd Eve thdThd


Thd

ZK Coordinator

w

r

Dist.Async.

51


Stand.

Dist.Loc.Cust.

Async.

Sync.ThdThd Eve thdThd


Thd

ZK Coordinator

w

r

Dist.Async.

Cust.

52

Distributed rules

DistributedLocal

StandardCustom

Async.

Sync.

53

Distributed rule #1

• Logical time clock (Leslie Lamport, 1978)

Send

Recv

Machine1 Machine2

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

54

Distributed rule #1

• Logical time clock (Leslie Lamport, 1978)

Send

Recv

Machine1 Machine2

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

55

Distributed rule #2

RPC-call

RPC-begin

Machine1 Machine2

RPC-rtRPC-end

Standard

Asy

nch

.

Customize

Syn

ch.

RPC

Socket

56

Distributed rule #2

RPC-call

RPC-begin

Machine1 Machine2

RPC-rtRPC-end

Standard

Asy

nch

.

Customize

Syn

ch.

RPC

Socket

waiting

57

Distributed rule #2

RPC-call

RPC-begin

Machine1 Machine2

RPC-rtRPC-end

Standard

Asy

nch

.

Customize

Syn

ch.

RPC

Socket

waiting

58

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!flag){

}

...

In multi-threaded systems:

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

Dist. while-loop

59

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!flag){

}

...


Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

Dist. while-loop

60

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!flag){

}

...


Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

Dist. while-loop

In distributed systems:

61

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!getFlag()){

}

...

//Thread2

bool getFlag(){

return flag;

}

Machine BMachine A


Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

Dist. while-loop

62

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!getFlag()){

}

...

//Thread2

bool getFlag(){

return flag;

}

Machine BMachine A


Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

Dist. while-loop

63

Distributed rule #4

ThdThd Eve thdThdThd

ZK Coordinator

w

r

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

ZooKeeper Service

64

Distributed rule #4

Standard

Asy

nch

.

Customize

Syn

ch.

ZooKeeper Service

Socket

RPCThdThd Eve thdThdThd

ZK Coordinator

w

r

65

Distributed rules

StandardCustomized

Asy

nch

ron

ou

sSy

nch

ron

ou

sRPCDist. While-loop

SocketZookeeper Service

66

Local rules

DistributedLocal

67

Local rules

StandardCustomized

Asy

nch

ron

ou

sSy

nch

ron

ou

s

Event-relatedn/a

68

Local rules

StandardCustomized

Asy

nch

ron

ou

sSy

nch

ron

ou

s

Event-related

Thread fork/joinWhile-loop

n/a

69

Local rules

StandardCustomized

Asy

nch

ron

ou

sSy

nch

ron

ou

sThread fork/joinWhile-loop

Event-relatedn/a

70

Outline

• Motivation


• DCatch tool

• Evaluation

• Conclusion

71

Triage TriggerTrace HB

72





Challenges


Selective tracing: only mem accesses inEvent/message handlers and their callee.

73





Challenges


Local Distributed

[1] Raychev. Effective Race Detection for Event-Driven Programs. In OOPSLA’13

74





Challenges


Machine B

//RPC thread

Task getTask(jID){

...


}

//UnReg thread

void unReg(jID){

jMap.remove(jID);

....

}

75





Challenges


Machine BMachine A

//RPC thread

Task getTask(jID){

...


}

//UnReg thread

void unReg(jID){

jMap.remove(jID);

....

}

while(!getTask(jID)){

}

76





Challenges


Machine BMachine A

//RPC thread

Task getTask(jID){

...


}

//UnReg thread

void unReg(jID){

jMap.remove(jID);

....

}


}

77





Challenges


Machine BMachine A

//RPC thread

Task getTask(jID){

...


}

//UnReg thread

void unReg(jID){

jMap.remove(jID);

....

}


}

hang

78





Challenges


Thd ThdEvent thread

e2

e1

w

r

79





Challenges


Thd ThdEvent thread

e2

e1

w

r +sleep

80





Challenges


Thd ThdEvent thread

e2

e1

w

r

+sleep

81





Challenges


Thd

Machine A

Thd

Machine C

Thd ThdEvent thread

e2

e1

w

r

Machine B

82





Challenges


Thd

Machine A

Thd

Machine C

Thd ThdEvent thread

e2

e1

w

r

Machine B

+sleep

+sleep

83





Challenges


Thd

Machine A

Thd

Machine C

Thd ThdEvent thread

e2

e1

w

r

Machine B

+sleep

+sleep+sleep

+sleep

84

Outline

• Motivation


• DCatch tool

• Evaluation

• Conclusion

85

• Benchmarks:

– 7 real-world DCbugs from TaxDC[1]

– 4 distributed systems

Methodology

[1] Leesatapornwongsa. TaxDC. In ASPLOS’16

86

Overall resultsBugID Detected? #. Bugs #. Benign #. false-pos

CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

87


CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

88


CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

= 12 + 8

89


CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

90

Other results in our paper

• Performance overhead

• Trace compositions

• HB model impact

– False-positive

– False-negatives

• …

91

Outline

• Motivation


• DCatch tool

• Evaluation

• Conclusion

92

Conclusion

• A HB Model for distributed systems

• DCatch detects DCbugs from correct runs with low false positive rates.


C1

C2

C3

C4

Local Distributed

93

Thank you!Q&A

dcatch: automatically detecting distributed concurrency ...shanlu/paper/dcatch-final.pdfmachine1...

Documents