promise 2011: "does measuring code change improve fault prediction?"

23
© 2007 AT&T Knowledge Ventures. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Knowledge Ventures. Code Change and Fault Prediction Tom Ostrand, Robert Bell, Elaine Weyuker AT&T Labs Research Florham Park, NJ, USA PROMISE 2011 Banff, Alberta, September 20-21, 2011

Upload: cs-ncstate

Post on 06-May-2015

3.101 views

Category:

Technology


0 download

DESCRIPTION

Promise 2011:"Does Measuring Code Change Improve Fault Prediction?"Robert Bell, Thomas Ostrand and Elaine Weyuker.

TRANSCRIPT

Page 1: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

© 2007 AT&T Knowledge Ventures. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Knowledge Ventures.

Code Change and Fault Prediction Tom Ostrand, Robert Bell, Elaine Weyuker AT&T Labs – Research Florham Park, NJ, USA

PROMISE 2011

Banff, Alberta, September 20-21, 2011

Page 2: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Overview

•Do measures of code change or churn provide useful input to fault prediction models?

•Standard model

•Base models

•Churn-augmented models

Page 3: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

The Standard Model

• Underlying statistical model

• Negative binomial regression

• Output (dependent) variable

• Predicted fault count in each file of release n

• Predictor (independent) variables

• KLOC (n)

• Previous faults (n-1)

• Previous changes (n-1, n-2)

• File age (number of releases)

• File type (C,C++,java,sql,make,sh,perl,...)

Page 4: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Evaluating prediction models

• Model produces ranking of files in a release, from predicted most faults to fewest faults

• Choose cutoff point in ranking, X%

• Yield = percent of all faults in the release that are in the first X% of the ranked files

We’ve usually evaluated models at a 20% cutoff.

• Fault-percentile average (FPA) is the average yield over all values of X

Page 5: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Prediction Results, from the Standard

Model

83 83

75 81

93

76

91 87 88

93 88

93 92

0

10

20

30

40

50

60

70

80

90

100

Percent of faults in top 20% of files FPA

Page 6: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Measures of Code Change

•Changed/not changed

•Number of changes during a release

•Number of lines added

•Number of lines deleted

•Number of lines modified

•Relative churn (line changes/LOC)

Page 7: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Two Subject Systems

Large provisioning system

• 18 releases, 5 year lifespan

• 6 programming languages:

• Java (60%), C, C++, SQL, SQL-C, SQL-C++

• 3000+ files

• 1.5Mil LOC

• Average of 395 faults/release

Page 8: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Two Subject Systems

Utility, data aggregation system

• 18 releases, 5 year lifespan

• >10 programming languages:

• Java (77%), Perl, xml, sh, ...

• 800 files

• 280K LOC

• Average of 90 faults/release

Page 9: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Distribution of files,

averages over all releases.

6.8% 11.0%

82.2%

Percent of Files: Provisioning

New

Changed

Unchanged

1.6% 15.1%

84.4%

Percent of Files: Utility

New

Changed

Unchanged

Page 10: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Where do faults occur?

Distribution of faults over files

0.24

0.80

0.02

Faults/file: Provisioning

New

Changed

Unchanged

0.12

0.82

Faults/file: Utility

New

Changed

Unchanged

Page 11: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Provisioning system faults per file, by

release

0

0.5

1

1.5

2

2.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Fau

lt-p

er-

File

Release

Faults per File, by Change Status and Release

New (Mean=0.24) Unchanged (Mean=0.02) Changed (Mean=0.80)

Page 12: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Utility system faults per file, by release

0

0.5

1

1.5

2

2.5

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Fau

lts

pe

r fi

le

Faults per File, by Change Status and Release

New (Mean=.09) Unchanged (Mean=.002) Changed (Mean=.92)

Page 13: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Potential predictor combinations

• Added lines only

• Deleted lines only

• Modified lines only

• Adds & Deletes

• Adds & Mods

• Deletes & Mods

• Adds & Deletes & Mods

• Relative values: changed lines/LOC

Page 14: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Distribution of change combinations,

all check-ins, all releases:

Provisioning system

Mods, 683 Deletes, 296

Adds, 597

Mods & Deletes, 168

Mods & Adds, 1894

Deletes & Adds, 126

M & D & A, 2625

Number of Files

Page 15: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Average lines touched for each combination of

changes

Mods, 4 Deletes, 5

Adds, 21

Mods & Deletes, 23

Mods & Adds, 37

Deletes & Adds, 21

M & D & A, 210

Average Lines touched

Page 16: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Faults per file, changed files only:

Provisioning system

Mods, 0.19

Deletes, 0.04

Adds, 0.3

Mods & Deletes, 0.36

Mods & Adds, 0.55

Deletes & Adds, 0.5

M & D & A, 1.38

Faults per File

Page 17: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Fault prediction models

•Univariate models

•Base model: log(KLOC), File age, File type

•Augmented models:

• Previous Changes

• Previous {Adds / Deletes / Mods}

• Previous Adds + Deletes + Modifications

• Previous {Adds / Deletes / Mods} / LOC (relative churn)

• Previous Developers

Page 18: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Fault-percentile averages for univariate

predictor models: Provisioning system (best result from raw variable, square root, fourth root)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

log(KLOC)

Prior Changes

Prior Adds+Deletes+Mods

Prior Developers

Prior Lines Added

Prior Lines Modified

Prior Changed

Prior Faults

Prior Lines Deleted

Language

Age

Standard Model

FPA, univariate models

Page 19: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Base Model 1

• KLOC

• File age (number of releases)

• File type (C,C++,java,sql,make,sh,perl,...)

Page 20: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Base Model 1, and added variables

• Base model 1

• KLOC

• File age (number of releases)

• File type (C,C++,java,sql,make,sh,perl,...)

89 90 91 92 93 94

Base 1 prev-prev changes

prev-deletes prev-mods

prev-changed prev-adds

prev-developers (prev-adds,dels,mods)/LOC

prev-adds,dels,mods prev-changes

Standard Model

Mean FPA, Provisioning System

87 88 89 90 91 92 93

Base 1

prev-prev changes

prev-deletes

prev-mods

prev-changed

prev-adds

prev-developers

prev-adds,dels,mods

prev-changes

Standard Model

Mean FPA, Utility System

Page 21: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Base Model 2

• KLOC

• File age (number of releases)

• File type (C,C++,java,sql,make,sh,perl,...)

• (Previous changes)1/2

Page 22: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Base Model 2, and added variables

93.2 93.25 93.3 93.35 93.4 93.45 93.5 93.55

Base 2

prev-changed

prev-deletes

(prev-adds,dels,mods)/LOC

prev-developers

prev-mods

prev-adds

prev-adds,dels,mods

prev-prev changes

Mean FPA, Provisioning System

• Base model 2

• KLOC

• File age (number of releases)

• File type (C,C++,java,sql,make,sh,perl,...)

• (Previous changes)1/2

Page 23: Promise 2011: "Does Measuring Code Change Improve Fault Prediction?"

Summary

• Churn can be an effective aid for improving fault prediction

• {Adds+Deletes+Mods} improves the accuracy of a model that doesn’t include any change information

BUT

• a simple count of prior changes slightly outperforms {Adds+Deletes+Mods}

• Prior changed is nearly as good as either, when added to a model without change info

• Lines added is the most effective single predictor

• Lines deleted is least effective single predictor

• Relative churn is no better than absolute churn for predicting total fault count