handling production-run concurrency-bug...
TRANSCRIPT
![Page 1: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago](https://reader035.vdocuments.mx/reader035/viewer/2022062922/5f05c0287e708231d4148507/html5/thumbnails/1.jpg)
Handling Production-Run Concurrency-Bug Failures
Shan Lu
University of Chicago
Karu Sankaralingam
University of Wisconsin - Madison
1
![Page 2: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago](https://reader035.vdocuments.mx/reader035/viewer/2022062922/5f05c0287e708231d4148507/html5/thumbnails/2.jpg)
Reliability is crucial
• Production-run failures are costly
• Reliability affects other aspects of parallel systems
2
“We don’t use thread-level parallelism because
that is difficult to get correct“
![Page 3: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago](https://reader035.vdocuments.mx/reader035/viewer/2022062922/5f05c0287e708231d4148507/html5/thumbnails/3.jpg)
Reliability is challenging
• Concurrency bugs
– Synchronization problems in multi-threaded s/w
• Concurrency bugs widely exist in production
– In-house testing is ineffective for concurrency bugs
3
How to handle production-run failures caused by concurrency bugs?
if (proc){tmp=*proc;
}
Thread 2Thread 1
proc = NULL;
MySQL
_state=mThdstate;
Thread 2Thread 1
mThd
MySQL
![Page 4: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago](https://reader035.vdocuments.mx/reader035/viewer/2022062922/5f05c0287e708231d4148507/html5/thumbnails/4.jpg)
Rollback recovery
4
Thread 1
Thread 3
Thread 2
![Page 5: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago](https://reader035.vdocuments.mx/reader035/viewer/2022062922/5f05c0287e708231d4148507/html5/thumbnails/5.jpg)
Problems of the traditional approach
5
Thread 1
Thread 3
Thread 2
![Page 6: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago](https://reader035.vdocuments.mx/reader035/viewer/2022062922/5f05c0287e708231d4148507/html5/thumbnails/6.jpg)
Bug examples
6
Do we need to roll back all threads?
Do we need memory checkpoint?
ConAir: Featherweight Concurrency Bug Recovery Via Single-Threaded Idempotent Execution [ASPLOS13]
if (proc){tmp=*proc;
}
Thread 2Thread 1
proc = NULL;
MySQL
_state=mThdstate;
Thread 2Thread 1
mThd = CreateThd();
MySQL
![Page 7: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago](https://reader035.vdocuments.mx/reader035/viewer/2022062922/5f05c0287e708231d4148507/html5/thumbnails/7.jpg)
ConAir system
7
Thread 1
Thread 3
Thread 2
• Guarantee no change to program semantics
• No change to OS/Hardware
• No prior bug knowledge required
• Negligible overhead (<0.2%)
• Work for 16 out of 26 real-world bugs
ConAir: Featherweight Concurrency Bug Recovery Via Single-Threaded Idempotent Execution [ASPLOS13]
![Page 8: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago](https://reader035.vdocuments.mx/reader035/viewer/2022062922/5f05c0287e708231d4148507/html5/thumbnails/8.jpg)
The problems
• What if the error propagation distance is long?
• What if the failure thread was already too slow?
tmp=*ptr;
Thread 2Thread 1
free (ptr);
MySQL
![Page 9: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago](https://reader035.vdocuments.mx/reader035/viewer/2022062922/5f05c0287e708231d4148507/html5/thumbnails/9.jpg)
Proactive prevention
9
Thread 1
Thread 3
Thread 2
![Page 10: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago](https://reader035.vdocuments.mx/reader035/viewer/2022062922/5f05c0287e708231d4148507/html5/thumbnails/10.jpg)
Proactive prevention
10
Thread 1
Thread 3
Thread 2
![Page 11: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago](https://reader035.vdocuments.mx/reader035/viewer/2022062922/5f05c0287e708231d4148507/html5/thumbnails/11.jpg)
Bug examples
11
if (proc){tmp=*proc;
}
Thread 2Thread 1
proc = NULL;
MySQL
_state=mThdstate;
Thread 2Thread 1
mThd = CreateThd();
MySQL
Do we need to perturb multiple threads?
Do we perturb at random place?
AI: A Lightweight System for Tolerating Concurrency Bugs [FSE14] ACM SIGSOFT Distinguished Paper Award
![Page 12: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago](https://reader035.vdocuments.mx/reader035/viewer/2022062922/5f05c0287e708231d4148507/html5/thumbnails/12.jpg)
A simple and generic scheme
• The best perturbation point is
Right before a memory access that has an
incorrect/abnormal remote predecessor
12AI: A Lightweight System for Tolerating Concurrency Bugs [FSE14] ACM SIGSOFT Distinguished Paper Award
![Page 13: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago](https://reader035.vdocuments.mx/reader035/viewer/2022062922/5f05c0287e708231d4148507/html5/thumbnails/13.jpg)
AI system
13
if (proc){tmp=*proc;
}
Thread 2Thread 1
proc = NULL;
MySQL
_state=mThdstate;
Thread 2Thread 1
mThd = CreateThd();
MySQL
• Guarantee no change to program semantics
• No change to OS/Hardware
• No prior bug knowledge required
• Negligible overhead (<0.2%)
• Work for 16 out of 26 real-world bugs
AI: A Lightweight System for Tolerating Concurrency Bugs [FSE14]
35 out of 35
1% ~ 10X
Training required
ACM SIGSOFT Distinguished Paper Award
![Page 14: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago](https://reader035.vdocuments.mx/reader035/viewer/2022062922/5f05c0287e708231d4148507/html5/thumbnails/14.jpg)
ConAir vs. AI
• Can we combine ConAir and AI?
14
ConAir AI
Performance Great Poor when there are intensive shared-memory accesses
Functionality Poor when failure thread is too slowPoor when error propagation is long
Great, but require training
![Page 15: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago](https://reader035.vdocuments.mx/reader035/viewer/2022062922/5f05c0287e708231d4148507/html5/thumbnails/15.jpg)
Summary & Other efforts
• Reactive approach to production-run failures
• Proactive approach to production-run failures
• How much can developers contribute?
• How much can hardware contribute?
15
What change history tells us about thread synchronization [FSE15]
Thank XPS!
AI: A Lightweight System for Tolerating Concurrency Bugs [FSE14]
ConAir: Featherweight Concurrency Bug Recovery Via Single-Threaded Idempotent Execution [ASPLOS13]
ACM SIGSOFT Distinguished Paper Award