diagnosing and fixing concurrency bugs credits to dr. guoliang jin, computer science, nc state...
TRANSCRIPT
Diagnosing and FixingConcurrency Bugs
Credits to Dr. Guoliang Jin, Computer Science, NC STATE
Presented by Tao Wang
2
We need reliable software People’s daily life now depends on reliable software
Software companies spend lots of resources on debugging More than 50% effort on finding and fixing bugs Around $300 billion per year
Concurrency bugs hurt It is an increasingly parallel world
Concurrency bugs in history
3
Multi-threaded program Concurrent programs under the shared-memory model
Programs execute multiple interacting threads in parallel Threads communicate via shared memory Shared-memory accesses should be well-synchronized
Multicore chip
core1
cache
thread1
core2
cache
thread2
core3
cache
thread3
core4
cache
thread4
shared memory
4
HugeInterleavingspace
An example of concurrency bug
Thread 1if (ptr != NULL) { ptr->field = 1;}
Thread 2
ptr = NULL;
Thread 1
if (ptr != NULL) { ptr->field = 1;}
Thread 2ptr = NULL;
The interleaving space
5
Bad interleavings
Previous research focuses on finding
Thread 1if (ptr != NULL) {
ptr->field = 1;}
Thread 2
ptr = NULL;
Thread 1if (ptr != NULL) {
ptr->field = 1;}
Thread 2
ptr = NULL;
Segmentation Fault
Software quality does not improve until bugs are fixed
Manual concurrency bug fixing is time-consuming: 73 days on average error-prone: 39% patches are buggy in the first release
CFix: automated concurrency-bug fixing [PLDI’11*, OSDI’12] Program behaves correctly if bad interleavings do not occur Fix concurrency bugs by disabling bad interleavings
Bug fixing
6
*SIGPLAN:“one of the first papers to attack the problem of automated bug fixing”
HugeInterleavingspace
Bad interleavings
Bad interleavings
Bad interleavingsDisabled
The interleaving space (again)
lead to production-run failures
7
Bad interleavings
Bad interleavings
Bad interleavings
Disabled
Failures still happen in production runs The reason behind failure needs to be understood
Tools dealing with production runs demand low overhead Diagnostic information needs to be informative
Production-run concurrency-bug failure diagnosis Design new monitoring schemes and sampling strategies
CCI: a pure software solution [OOPSLA’10]
PBI, LXR: hardware-assisted solutions [ASPLOS’13 & 14]
Failure diagnosis
8
My work on concurrency bugs
[ASPLOS’11]
Production-RunFailure Diagnosis:
CCI/PBI/LXR[OOPSLA’10,
ASPLOS’13 & 14]9
[PLDI’11*, OSDI’12]*Received a SIGPLAN
CACM nomination
Bug Detection and software testing:
ConSeq
AutomatedConcurrency-Bug
Fixing: CFix
Outline Motivation and Overview Automated Concurrency-Bug Fixing
The problem and idea Overview Internals of CFix Evaluation and summary
10
What is the correct behavior? Usually requires developers’ knowledge
How to get the correct behavior? Correct program states under bug-triggering inputs No change to program states under other inputs
Automated fixing is difficult
11
Description:SymptomTriggering condition…
Patch:CorrectnessPerformanceSimplicity
�?
What is the correct behavior? The program state is correct as long as the buggy interleaving does not occur
How to get the correct behavior? Only need to disable failure-inducing interleavings Can leverage well-defined synchronization operations
CFix’ insights
12
Description:SymptomTriggering condition…
Patch:CorrectnessPerformanceSimplicity
?
Description:SymptomTriggering condition…
Description:Interleavings that lead to software failure
13
Patch:CorrectnessPerformanceSimplicity
?atomicity violation detectorsParkASPLOS’09,FlanaganPOPL’04,LuASPLOS’06,ChewEuroSys’10
order violation detectorsZhangASPLOS’10,LuciaMICRO’09,YuISCA’09,GaoASPLOS’11
data race detectorsSenPLDI’08,SavageTOCS’97,YuSOSP’05,EricksonOSDI’10,KasikciASPLOS’10
abnormal data flow detectorsZhangASPLOS’11,ShiOOPSLA’10
p rc
A B
Wb
RWg
I1
I2
How to get a general solution that generates good patches?
. . .
. . .Patched binaryPatched binary
Patched binaryPatched binary
Merged binary
. . .Selected binary Selected binary
Mutual exclusionOrder
Mutual exclusionOrder
Final patched binary14
Description:Interleavings that lead to software failure
Patch:CorrectnessPerformanceSimplicity
CFix
Run-time Support
PatchMerging
Patch Testing & Selection
Synchronization Enforcement
Fix-Strategy Design
Source codeBug reports
Fix-strategy design: what to fix
Challenges: Huge variety of bugs
15
Run-time Support
Fix-Strategy Design
SynchronizationEnforcement
PatchMerging
Patch Testing& Selection
Why these two? Real-world concurrency bug characteristics study[SHAN
ASPLOS’08]: 97% either atomicity violation or order violation Either can be fixed by mutual exclusion or order enforcement
Two types of Concurrency bugs
16
Atomicity violation
Order violation
Fix-strategy design: how to fix
Challenges: Inaccurate root cause
17
Run-time Support
Fix-Strategy Design
SynchronizationEnforcement
PatchMerging
Patch Testing& Selection
atomicity-violation
Thread 1
if (ptr != NULL) {
ptr->field = 1;}
ptr = NULL;
Thread 2
18
P
C
R
Fix-strategy for atomicity-voilation
Thread 1
if (ptr != NULL) {
ptr->field = 1;}
ptr = NULL;
Thread 2
19
CFix: fix-strategy design
Challenges: Inaccurate root cause Huge variety of bugsSolution: A combination of
mutual exclusion & order relationship enforcement
20
Run-time Support
Fix-Strategy Design
SynchronizationEnforcement
PatchMerging
Patch Testing& Selection
Fix-strategies Overview
OV DetectorAV Detector Race Detector DU Detector
I1
I2
A B
p rc
Wb
RWg
21
CFix: synchronization enforcement
Challenges: Correctness Performance simplicitySolution: Mutual exclusion
enforcement: AFix [PLDI’11] Order relationship
enforcement: OFix [OSDI’12]
22
Run-time Support
Fix-Strategy Design
SynchronizationEnforcement
PatchMerging
Patch Testing& Selection
Input: three statements (p, c, r) with contexts
Idea: making the code region from p to c be mutually exclusive with r
Atomicity violation in Fixing
23
Thread 1if (ptr != NULL) {
ptr->field = 1;}
Thread 2
ptr = NULL; rp
c
Approach: lock
Goal: Correctness: paired lock acquisition and release operations Performance: Make the critical section as small as possible
Mutual exclusion enforcement: AFix
p
c
r
24
A naïve solution Add lock on edges reaching p Add unlock on edges leaving c
Potential new bugs Could lock without unlock Could unlock without lock etc.
A naïve solution
p
c
p
c
p
c
p
c
25
Assume p and c are in the same function f
Step 1: find protected nodes in critical section
Step 2: add lock operations
unprotected node protected node
protected node unprotected node
Avoid those potential bugs mentioned
The AFix solution
p
c
26
p and c adjustment when they are in different functions Observation: people put lock and unlock in one function Find the longest common prefix of p’s and c’s stack traces Adjust p and c accordingly
Put r into a critical section Do nothing if we can reach r from the p–c critical section
Lock type: Lock with timeout: if critical section has blocking operations Reentrant lock: if recursion is possible within critical section
Subtle details
27
use read
initialization
destroy
OFix: two order relationshipsAi
A BAj
…… ?
firstA-BallA-B
A1
B
An
…
A1
B
An
…28
Approach: condition variable and flag Insert signal operations in A-threads Insert wait operation before B
Rules A-thread signals exactly once when it will not execute more A A-thread signals as soon as possible B proceeds when each A-thread has signaled
OFix allA-B enforcement
29
OFix allA-B enforcement: A sideHow to identify the last A instance in one thread
A
. . .; for (. . .) . . . ; // A . . .;
Each thread that executes A exactly once as soon as it can execute no more A
30
OFix allA-B enforcement: A sideHow to identify the last thread that executes A
void main() { for (. . .) thread_create(thr_main); . . .;} void ofix_signal() {
mutex_lock(L);
--;
if ( == 0) cond_broadcast(con); mutex_unlock(L);}
void thr_main() { for (. . .) . . . ; // A . . .;}
counter for signal threads
=1
++
thread_create A
31
Safe to execute only when is 0
Give up if OFix knows that it introduces new deadlock Timed wait-operation to mask potential deadlocks
OFix allA-B enforcement: B side
B
void ofix_wait() { mutex_lock(L); if ( != 0) cond_timedwait(con, L, t); mutex_unlock(L);}
32
Basic enforcement
When A may not execute Add a safety-net of signal with allA-B algorithm
OFix firstA-B
BA
33
CFix: patch testing & selection
Challenge: Multi-thread software
testingSolution: CFix-patch oriented testing
34
Run-time Support
Fix-Strategy Design
SynchronizationEnforcement
PatchMerging
Patch Testing& Selection
Patch testing principles Two ideas:
No exhaustive testing, but patch oriented testing Leverage existing testing techniques, with extra heuristics
The work-flow Step 1 Prune incorrect patches
• Patches causing failures due to wrong fix strategies, etc Step 2 Prune slow patches Step 3 Prune complicated patches
35
Run once without external perturbation Reject if there is a time-out or failure
Patches fixing wrong root cause Make software to fail deterministically
Thread 1
ptr->field = 1;
ptr->field = 1;
Thread 2
ptr = NULL;
36
Implicit bad patch A failure in patch_b implies a failure in patch_a
If patch_a is less restrictive than patch_b
Helpful to prune patch_a Traditional testing may not find the failure in patch_a
aMutual Exclusion
b cOrder Relationships
37
Challenge: One single programming
mistake usually leads to multiple bug reports
Solution: Heuristics to merge patches
CFix: patch merging
38
Run-time Support
Fix-Strategy Design
SynchronizationEnforcement
PatchMerging
Patch Testing& Selection
c1
r1
p1
p2
c2, r2
void buf_write() { int tmp = buf_len + str_len; if (tmp > MAX) return;
memcpy(buf[buf_len], str, str_len); buf_len = tmp;}
An example with multiple reports
p1
c1 p2r1 c2, r2
Too many lock/unlock operations Potential new deadlocks May hurt performance and simplicity
39
Related patch: a case of AFix Merge if p, c, or r is in some other patch’s critical sections
lock(L1)p1
lock(L2)p2c1
unlock(L1)c2
unlock(L2)
lock(L1)r1
unlock(L1)
lock(L2)r2
unlock(L2)
lock(L1)p1
p2c1
c2unlock(L1)
lock(L1)r2
unlock(L1)
40
c1
r1
p1
p2
c2,r2
void buf_write() { int tmp = buf_len + str_len; if (tmp > MAX) { return; }
memcpy(buf[buf_len], str, str_len); buf_len = tmp; }
The merged patch for the example
p1
c1 p2r1 c2, r2
c1,p2
c2,r1,r2
p1
41
To understand whether there is a deadlock underlying time-out
Low-overhead, and suitable for production runs
CFix: run-time support
42
Run-time Support
Fix-Strategy Design
SynchronizationEnforcement
PatchMerging
Patch Testing& Selection
Evaluation methodologyAPP.
PBZIP2
x264
FFT
HTTrack
Mozilla-1
transmission
ZSNES
Apache
MySQL-1
MySQL-2
Mozilla-2
Cherokee
Mozilla-3
AV Detector OV Detector RA Detector DU Detector
43
Evaluation resultAV Detector OV Detector RA Detector DU Detector
ü ü ü
ü
ü ü ü ü
ü ü ü
ü ü
ü ü ü ü
ü
ü ü
ü ü ü
ü ü û
ü ü
ü ü ü
ü ü ü
# of Ops
5
7
5
2
2
2
3
3
5
9
3
2
5
APP.
PBZIP2
x264
FFT
HTTrack
Mozilla-1
transmission
ZSNES
Apache
MySQL-1
MySQL-2
Mozilla-2
Cherokee
Mozilla-344
Summary Software reliability is critical Fixing Concurrency bugs is costly and error-prone CFix uses some heuristics, with good results in practice
A combination of mutual exclusion and order enforcement Use testing to select the best patch Fix root cause without requiring detectors to report it Small overhead and good simplicity
45
Questions ?
Thank you
46