wukong: automatically detecting and localizing bugs that manifest at large system scales bowen...

39
WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales Bowen Zhou Jonathan Too Milind KulkarniSaurabh Bagchi Purdue University

Upload: belinda-kelly

Post on 24-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System

Scales

Bowen Zhou Jonathan TooMilind Kulkarni Saurabh Bagchi

Purdue University

2

Ever Changing Behavior of Software

• Software has to be adaptive to accommodate for different platforms, inputs and configurations.

• As a side effect, manifestation of a bug may depend on a particular platform, input or configuration.

3

Ever Changing Behavior of Software

4

Software Development Process

Develop a new feature and its unit tests

Test the new feature on a local machine

Push the feature into productoin systems

Break production systems

Roll back the feature

Not tested in production systems!!!

5

Bugs in Production Run

• Properties– Remains unnoticed when the application is tested

on developer's workstation– Breaks production system when the application is

running on a cluster and/or serving real user requests

• Examples– Configuration Error– Integer Overflow

6

Bugs in Production Run

• Properties– Remains unnoticed when the application is tested

on developer's workstation– Breaks production system when the application is

running on a cluster and/or serving real user requests

• Examples– Configuration Error– Integer Overflow

Scale-Dependent Bugs

7

Modeling Program Behavior for Finding Bugs

• Dubbed as Statistical Debugging[Bronevetsky DSN ‘10] [Mirgorodskiy SC ’06] [Chilimbi ICSE ‘09] [Liblit PLDI ‘03]– Represents program behavior as a set of features

that can be measured in runtime– Builds a model to describe and predict the

features based on data collected from many runs– Detects abnormal features that deviate from the

model's prediction beyond a certain threshold

8

Modeling Program Behavior for Finding Bugs

• Dubbed as Statistical Debugging[Bronevetsky DSN ‘10] [Mirgorodskiy SC ’06] [Chilimbi ICSE ‘09] [Liblit PLDI ‘03]– Represents program behavior as a set of features that

can be measured in runtime– Builds a model to describe and predict the

features based on data collected from many runs– Detects abnormal features that deviate from the

model's prediction beyond a certain threshold

Does not account for scale-induced variation in program

behavior

9

Modeling Scale-dependent Behavior

RUN #

# O

F TI

MES

LO

OP

EXEC

UTE

S

Is there a bug in one of the

production runs?

Training runs Production runs

10

Modeling Scale-dependent Behavior

SCALE

# O

F TI

MES

LO

OP

EXEC

UTE

S

Training runs Production runs

Accounting for scale makes trends clear,

errors at large scales obvious

11

Modeling Scale-dependent Behavior

• Our Previous Research– Vrisha [HPDC '11]• Builds a collective model for all features of a program to

detect bugs at any feature

– Abhranta [HotDep '12]• Tweaks Vrisha's model to allow per-feature bug

detection and localization

12

Modeling Scale-dependent Behavior

• Our Previous Efforts– Vrisha [HPDC '11]• Builds a collective model for all features of a program to

detect bugs at any feature

– Abhranta [HotDep '12]• Tweaks Vrisha's model to allow per-feature bug

detection and localization

They have limitations...

13

Modeling Scale-dependent Behavior

• Big gap in scale– e.g. training runs on up to 128 nodes, production

runs on 1024 nodes• Noisy features– Too many false positives render the model useless

14

Reconstructing Scale-dependent Behavior: the WuKong way

• Covers a wide range of program features• Predicts the expected value in a large-scale

run for each feature separately• Prunes unpredictable features to improve

localization quality• Provides a shortlist of suspicious features in its

localization roadmap

15

The Workflow

APPPINRUN 1

APPPINRUN 3

APPPINRUN 2

APPPINRUN 4

APPPINRUN N

...

SCALEFEATURERUN 1

SCALEFEATURERUN 3

SCALEFEATURERUN 2

SCALEFEATURERUN 4

SCALEFEATURERUN N

...

SCALE

FEATURE

MODEL

SCALEFEATURE

ProductionTraining

= ?

16

Feature Collection

17

Features considered by WuKongvoid foo(int a) { if (a > 0) { } else { } if (a > 100) {   int i = 0;   while (i < a) {     if (i % 2 == 0) {     }     ++i;   } }}

18

Features considered by WuKongvoid foo(int a) {1:if (a > 0) { } else { }2:if (a > 100) {   int i = 0; 3:while (i < a) {   4:if (i % 2 == 0) {     }     ++i;   } }}

2

1

3

4

19

Modeling

20

Predict Feature from Scale

• X ~ vector of scale parameters X1...XN

• Y ~ number of times a particular feature occurs

• The model to predict Y from X:

• Compute the prediction error:

21

Predict Feature from Scale

• X ~ vector of scale parameters X1...XN

• Y ~ number of times a particular feature occurs

• The model to predict Y from X:

• Compute the prediction error:

22

Bug Localization

23

Locate Buggy Features

• First, we need to know if the production run is buggy, by doing detection as follows:

• If there is a bug in this run, we can start looking at the prediction error of each feature:– Rank all features by their prediction error to provide a

localization roadmap that contains the top N features

ii ME

Error of feature iin the production run

Constant parameter Max error of feature iin all training runs

24

Improve Localization Quality by Feature Pruning

25

Noisy Feature Pruning

• Some features cannot be effectively predicted by the above model– Random– Not scale-determined– Discontinuous

• The trade-off– Keep those feature would pollute the diagnosis by

pushing real faults down the list– Remove these features could miss some faults if the

faults happens to be in such features

26

Noisy Feature Pruning

• How to remove them?For each feature:1. Do a cross validation with training runs2. Remove the feature if it triggers greater-than-

100% prediction error in more than (100-x)% of training runs

• Parameter x > 0 is for tolerating outliers in training runs

27

Evaluation

• Fault injection in Sequoia AMG2006– Up to 1024 processes– Randomly selected conditionals to be flipped

• Two case studies– Integer overflow in a MPI library– Deadlock in a P2P file sharing application

28

Evaluation

• Fault injection in Sequoia AMG2006– Up to 1024 processes– Randomly selected conditionals to be flipped

• Two case studies– Integer overflow in a MPI library– Deadlock in a P2P file sharing application

29

Fault Injection Study

• Fault– Injected at process 0– Randomly pick a feature to flip

• Data– Training (w/o fault): 110 runs, 8-128 processes– Production (w/ fault): 100 runs, 1024 processes

30

Fault Injection Study

• Result– Total 100– Noncrashing 57– Detected 53– Located 49

Successful Localized: 92.5%

31

Evaluation

• Fault injection in Sequoia AMG2006– Up to 1024 processes– Randomly selected conditionals to be flipped

• Two case studies– Integer overflow in a MPI library– Deadlock in a P2P file sharing application

32

Evaluation

• Fault injection in Sequoia AMG2006– Up to 1024 processes– Randomly selected conditionals to be flipped

• Two case studies– Integer overflow in a MPI library– Deadlock in a P2P file sharing application

33

Case Study: A Deadlock in Transmission’s DHT Implemenation

34

Case Study: A Deadlock in Transmission’s DHT Implemenation

35

Case Study: A Deadlock in Transmission’s DHT Implemenation

Feature 53, 66

36

Conclusion

• Debugging scale-dependent program behavior is a difficult and important problem

• WuKong incorporates scale of run into a predictive model for each individual program feature for accurate bug diagnosis

• We demonstrated the effectiveness of WuKong through a large-scale fault injection study and two case studies of real bugs

37

Q&A

[email protected]

38

Backup

39

Runtime Overhead

Geometric Mean: 11.4%