machine learning project final report: bug localization ...fuyaoz/courses/10701/report.pdf ·...

8

Click here to load reader

Upload: phungquynh

Post on 17-Jul-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning Project Final Report: Bug Localization ...fuyaoz/courses/10701/report.pdf · Machine Learning Project Final Report: Bug Localization using Classification for Behavior

Machine Learning Project Final Report:Bug Localization using Classification for Behavior

Graph

Fuyao ZhaoSchool of Computer ScienceCarnegie Mellon [email protected]

Abstract

Bug localization is a widely studied problem in program analysis. And in severalstudies, machine learning technique such as graph classification is introduced toaid the analysis of the software. These methods usually generate two sets of graphsof different runs. Knowing whether they have correct outputs, graph classifiercan be built and its performance can be evaluated. If generate checkpoints foreach program, partial graphs in each checkpoint could be built. By examining theclassification performance boost in each checkpoint, a set of bug-like functionscould be reported to help programmer identify and fix them. In this project, aboveidea is implemented. In addition, weighted subgraph model of Behavior Graphis built, and a technique to reduce the number of checkpoints and increase theresult consistency is proposed, therefore both the classification performance andthe localization performance will be improved.

1 Introduction

Debugging is a very painful process for each programmer. And along with the growing of softwaresize nowadays, it become harder and harder to find what is bug in programs. Many approaches havebeen proposed to deal with computer aided debugging and testing, and some researches [6, 5, 2, 1]integrate machine technique to identify bug lines. Yuriy and Michael [2] uses dynamic invariantdetection to generation properties for program, and build fault invariant classifier. Hong, David andYang [1] mine the top-k discriminative graphs to find bug information. [2] is also a graph miningapproach focus on backtracking the noncrashing bugs such as logic error, the framework of whichcan be basically divided into two parts:

1. Graph classification:

(a) Generate graphs from different executions of program.(b) Mine the graph set, find closed frequent subgraphs as features of program in addition

to edges.(c) Assign values to this features for each graphs(d) Build SVM classifier and do cross validation

2. Bug localization:(a) For each function, we assign two checkpoints, and for each checkpoints, we can gen-

erate a set of partial graphs can be generated represent different stages of the programruns, in which each partial graph corresponds to the test case running up to the check-point.

1

Page 2: Machine Learning Project Final Report: Bug Localization ...fuyaoz/courses/10701/report.pdf · Machine Learning Project Final Report: Bug Localization using Classification for Behavior

(b) For each set of partial graphs, we build graph classifier as part 1, and evaluate itsperformance

(c) By looking the classification performance boost after each function is executed, wereport a set of possible bug functions.

In the rest parts of the report, Section 2 discuss the method used in each step in detail, including theweighted graph model for CFG and checkpoints reduction. Section 3 gives a series of experiment,analyzing which method is better and how to choose parameters. We conclude our study in Section4.

2 Method

2.1 Graph Generation

[2] uses behavior graph to describe the runs. A behavior graph is showed in Figure 1.a, whichcontains two parts: call flow graph (CFG) in Figure 1.b and transition graph in Figure 1.c. Functionsare represented as nodes, solid arrows are calls and dash arrows are transitions.

(a)

transition relationships in α. Edge (vi, vj) belongs toE(Gt(α)) if and only if function j is called immediatelyafter function i returns. It is also required that functionsi and j are called by the same caller function. Thesuperposition of Gc(α) and Gt(α) forms the behaviorgraph G(α) of run α. Fig. 2 shows three behaviorgraphs, where solid and dashed arrows represent callrelation and transition relation respectively.

We use behavior graphs to model program execu-tions. Call graphs represent the task-subtask relation-ship, while transition graphs record the sequential orderof the subtasks. Behavior graph only preserves the first-order transition and is thus succinct compared with theentire execution sequences. This is necessary for a scal-able mining and classification method.

1

3

4

5

2 1: makepat

2: esc

3: addstr

4: getccl

5: dodash

6: in_set_2

7: stclose

(1)

1

3

4

5

2

1

3

4

5

2

6

7

(2) (3)

Figure 2: A Behavior Graph Dataset

Example 1. Fig. 2 shows behavior graph segmentsderived from three different runs of a program “replace”,a regular expression matching and substitution utilitysoftware. Taking the run corresponding to the thirdgraph for instance, getccl, addstr, esc, in set 2 andstclose are subtasks of function makepat. They worktogether to complete the task associated with makepat.As to transition, the dashed arrow from getccl toaddstr means that addstr is called immediately aftergetccl returns.

If a behavior graph G is a subgraph of G′, thenG′ is a supergraph of G, written G ⊆ G′. G′ is theproper supergraph of G if G ⊂ G′. In the followingdiscussion, we introduce the concepts of frequent andclosed frequent graphs.

Definition 1. (Frequent (closed) graph) Givena graph dataset D, support(g) (or frequency(g)) isthe percentage (or number) of graphs in D, of which gis a subgraph. A graph is frequent if its support is noless than a minimum support threshold, min sup. Afrequent graph is closed if there exists no supergraphthat has the same support.

1

3

4

5

2

(1)

3

4

5

2

(2)

Figure 3: Frequent Graphs

Example 2. Fig. 3 depicts two of frequent subgraphsin the dataset shown in Fig. 1, assuming that min supis equal to 66.6%. In Fig. 3, the first graph is closedwhile the second is not since the latter is a subgraph ofthe former and both of them have the same support.

3 The Classification Framework

Given a set of behavior graphs that are labelled eitherpositive (for incorrect runs) or negative (for correctruns), we intend to train a classifier to identify newbehavior graphs with unknown labels. The dynamics ofclassification accuracy will be analyzed to identify thebacktrace of non-crashing bugs. In our study, we usesupport vector machine (SVM) [13] with linear kernel todo classification. The classification framework consistsof three steps:

1. extract features from behavior graphs (trainingdataset),

2. learn an SVM classifier using these features, and

3. classify new behavior graphs.

In order to apply SVM in behavior graph classifica-tion, we represent graphs as vectors in a feature space.A naive representation is to treat edges as features anda graph as a vector of edges. The vector is {0, 1} val-ued. If a graph has a specific edge, it has value “1” inthe corresponding dimension, otherwise “0”. Using thisrepresentation, the dot product of two feature vectorsis the number of common edges that two graphs have

(3.1) xi · xj = |E(gi) ∩ E(gj)|,

where xi and xj are the vector representation of graphsgi and gj . For example, the dot product of the first twographs in Fig. 2 is 10.

The similarity measure given in Eq. (3.1) is mean-ingful since it captures the relationship between two be-havior graphs. As shown in our experiments, SVMstrained by the above measure work well in identify-ing some incorrect runs. Unfortunately, the hyperplane

(b)

transition relationships in α. Edge (vi, vj) belongs toE(Gt(α)) if and only if function j is called immediatelyafter function i returns. It is also required that functionsi and j are called by the same caller function. Thesuperposition of Gc(α) and Gt(α) forms the behaviorgraph G(α) of run α. Fig. 2 shows three behaviorgraphs, where solid and dashed arrows represent callrelation and transition relation respectively.

We use behavior graphs to model program execu-tions. Call graphs represent the task-subtask relation-ship, while transition graphs record the sequential orderof the subtasks. Behavior graph only preserves the first-order transition and is thus succinct compared with theentire execution sequences. This is necessary for a scal-able mining and classification method.

1

3

4

5

2 1: makepat

2: esc

3: addstr

4: getccl

5: dodash

6: in_set_2

7: stclose

(1)

1

3

4

5

2

1

3

4

5

2

6

7

(2) (3)

Figure 2: A Behavior Graph Dataset

Example 1. Fig. 2 shows behavior graph segmentsderived from three different runs of a program “replace”,a regular expression matching and substitution utilitysoftware. Taking the run corresponding to the thirdgraph for instance, getccl, addstr, esc, in set 2 andstclose are subtasks of function makepat. They worktogether to complete the task associated with makepat.As to transition, the dashed arrow from getccl toaddstr means that addstr is called immediately aftergetccl returns.

If a behavior graph G is a subgraph of G′, thenG′ is a supergraph of G, written G ⊆ G′. G′ is theproper supergraph of G if G ⊂ G′. In the followingdiscussion, we introduce the concepts of frequent andclosed frequent graphs.

Definition 1. (Frequent (closed) graph) Givena graph dataset D, support(g) (or frequency(g)) isthe percentage (or number) of graphs in D, of which gis a subgraph. A graph is frequent if its support is noless than a minimum support threshold, min sup. Afrequent graph is closed if there exists no supergraphthat has the same support.

1

3

4

5

2

(1)

3

4

5

2

(2)

Figure 3: Frequent Graphs

Example 2. Fig. 3 depicts two of frequent subgraphsin the dataset shown in Fig. 1, assuming that min supis equal to 66.6%. In Fig. 3, the first graph is closedwhile the second is not since the latter is a subgraph ofthe former and both of them have the same support.

3 The Classification Framework

Given a set of behavior graphs that are labelled eitherpositive (for incorrect runs) or negative (for correctruns), we intend to train a classifier to identify newbehavior graphs with unknown labels. The dynamics ofclassification accuracy will be analyzed to identify thebacktrace of non-crashing bugs. In our study, we usesupport vector machine (SVM) [13] with linear kernel todo classification. The classification framework consistsof three steps:

1. extract features from behavior graphs (trainingdataset),

2. learn an SVM classifier using these features, and

3. classify new behavior graphs.

In order to apply SVM in behavior graph classifica-tion, we represent graphs as vectors in a feature space.A naive representation is to treat edges as features anda graph as a vector of edges. The vector is {0, 1} val-ued. If a graph has a specific edge, it has value “1” inthe corresponding dimension, otherwise “0”. Using thisrepresentation, the dot product of two feature vectorsis the number of common edges that two graphs have

(3.1) xi · xj = |E(gi) ∩ E(gj)|,

where xi and xj are the vector representation of graphsgi and gj . For example, the dot product of the first twographs in Fig. 2 is 10.

The similarity measure given in Eq. (3.1) is mean-ingful since it captures the relationship between two be-havior graphs. As shown in our experiments, SVMstrained by the above measure work well in identify-ing some incorrect runs. Unfortunately, the hyperplane

(c)

Figure 1: Behavior Graph, Call Flow Graph and Transition Graph.

However, we should notice that transition edges are not independent of call edges. For example, if inFigure 1.b, the call edge (1, 3) is not exist, then the transition edge (4, 3) in Figure 1.c would also notexist. So call graph might be sometimes more useful than behavior graph since it contains majorityinformation of the program and will generate less features, which means, we may need less trainingexamples. Though we use behavior graph in the project, an empirical comparison between this twokinds of graphs is made in Section 3, which will show the behavior graph has better classificationresult, but we get less features by using CFG.

2.2 Subgraph Extraction

LetG′ ⊆ G denote thatG′ is a subgraph ofG, D be the dataset of graph, and support(g) denote thefrequent of g appear in D. A subgraph g is frequent if support(g) > threshold. And a subgraphis closed if there is no supergraph g′ which support(g′) = support(g). We also restrict that andthe subgraphs should be connected graphs. By naive depth first search, we can get a set of closedfrequent subgraphs. Figure 2 shows a dataset of behavior graph, and Figure 3 shows two subgraphs,where the first is closed frequent subgraph and the second is not.

2

Page 3: Machine Learning Project Final Report: Bug Localization ...fuyaoz/courses/10701/report.pdf · Machine Learning Project Final Report: Bug Localization using Classification for Behavior

transition relationships in α. Edge (vi, vj) belongs toE(Gt(α)) if and only if function j is called immediatelyafter function i returns. It is also required that functionsi and j are called by the same caller function. Thesuperposition of Gc(α) and Gt(α) forms the behaviorgraph G(α) of run α. Fig. 2 shows three behaviorgraphs, where solid and dashed arrows represent callrelation and transition relation respectively.

We use behavior graphs to model program execu-tions. Call graphs represent the task-subtask relation-ship, while transition graphs record the sequential orderof the subtasks. Behavior graph only preserves the first-order transition and is thus succinct compared with theentire execution sequences. This is necessary for a scal-able mining and classification method.

1

3

4

5

2 1: makepat

2: esc

3: addstr

4: getccl

5: dodash

6: in_set_2

7: stclose

(1)

1

3

4

5

2

1

3

4

5

2

6

7

(2) (3)

Figure 2: A Behavior Graph Dataset

Example 1. Fig. 2 shows behavior graph segmentsderived from three different runs of a program “replace”,a regular expression matching and substitution utilitysoftware. Taking the run corresponding to the thirdgraph for instance, getccl, addstr, esc, in set 2 andstclose are subtasks of function makepat. They worktogether to complete the task associated with makepat.As to transition, the dashed arrow from getccl toaddstr means that addstr is called immediately aftergetccl returns.

If a behavior graph G is a subgraph of G′, thenG′ is a supergraph of G, written G ⊆ G′. G′ is theproper supergraph of G if G ⊂ G′. In the followingdiscussion, we introduce the concepts of frequent andclosed frequent graphs.

Definition 1. (Frequent (closed) graph) Givena graph dataset D, support(g) (or frequency(g)) isthe percentage (or number) of graphs in D, of which gis a subgraph. A graph is frequent if its support is noless than a minimum support threshold, min sup. Afrequent graph is closed if there exists no supergraphthat has the same support.

1

3

4

5

2

(1)

3

4

5

2

(2)

Figure 3: Frequent Graphs

Example 2. Fig. 3 depicts two of frequent subgraphsin the dataset shown in Fig. 1, assuming that min supis equal to 66.6%. In Fig. 3, the first graph is closedwhile the second is not since the latter is a subgraph ofthe former and both of them have the same support.

3 The Classification Framework

Given a set of behavior graphs that are labelled eitherpositive (for incorrect runs) or negative (for correctruns), we intend to train a classifier to identify newbehavior graphs with unknown labels. The dynamics ofclassification accuracy will be analyzed to identify thebacktrace of non-crashing bugs. In our study, we usesupport vector machine (SVM) [13] with linear kernel todo classification. The classification framework consistsof three steps:

1. extract features from behavior graphs (trainingdataset),

2. learn an SVM classifier using these features, and

3. classify new behavior graphs.

In order to apply SVM in behavior graph classifica-tion, we represent graphs as vectors in a feature space.A naive representation is to treat edges as features anda graph as a vector of edges. The vector is {0, 1} val-ued. If a graph has a specific edge, it has value “1” inthe corresponding dimension, otherwise “0”. Using thisrepresentation, the dot product of two feature vectorsis the number of common edges that two graphs have

(3.1) xi · xj = |E(gi) ∩ E(gj)|,

where xi and xj are the vector representation of graphsgi and gj . For example, the dot product of the first twographs in Fig. 2 is 10.

The similarity measure given in Eq. (3.1) is mean-ingful since it captures the relationship between two be-havior graphs. As shown in our experiments, SVMstrained by the above measure work well in identify-ing some incorrect runs. Unfortunately, the hyperplane

Figure 2: Behavior Graph dataset.

transition relationships in α. Edge (vi, vj) belongs toE(Gt(α)) if and only if function j is called immediatelyafter function i returns. It is also required that functionsi and j are called by the same caller function. Thesuperposition of Gc(α) and Gt(α) forms the behaviorgraph G(α) of run α. Fig. 2 shows three behaviorgraphs, where solid and dashed arrows represent callrelation and transition relation respectively.

We use behavior graphs to model program execu-tions. Call graphs represent the task-subtask relation-ship, while transition graphs record the sequential orderof the subtasks. Behavior graph only preserves the first-order transition and is thus succinct compared with theentire execution sequences. This is necessary for a scal-able mining and classification method.

1

3

4

5

2 1: makepat

2: esc

3: addstr

4: getccl

5: dodash

6: in_set_2

7: stclose

(1)

1

3

4

5

2

1

3

4

5

2

6

7

(2) (3)

Figure 2: A Behavior Graph Dataset

Example 1. Fig. 2 shows behavior graph segmentsderived from three different runs of a program “replace”,a regular expression matching and substitution utilitysoftware. Taking the run corresponding to the thirdgraph for instance, getccl, addstr, esc, in set 2 andstclose are subtasks of function makepat. They worktogether to complete the task associated with makepat.As to transition, the dashed arrow from getccl toaddstr means that addstr is called immediately aftergetccl returns.

If a behavior graph G is a subgraph of G′, thenG′ is a supergraph of G, written G ⊆ G′. G′ is theproper supergraph of G if G ⊂ G′. In the followingdiscussion, we introduce the concepts of frequent andclosed frequent graphs.

Definition 1. (Frequent (closed) graph) Givena graph dataset D, support(g) (or frequency(g)) isthe percentage (or number) of graphs in D, of which gis a subgraph. A graph is frequent if its support is noless than a minimum support threshold, min sup. Afrequent graph is closed if there exists no supergraphthat has the same support.

1

3

4

5

2

(1)

3

4

5

2

(2)

Figure 3: Frequent Graphs

Example 2. Fig. 3 depicts two of frequent subgraphsin the dataset shown in Fig. 1, assuming that min supis equal to 66.6%. In Fig. 3, the first graph is closedwhile the second is not since the latter is a subgraph ofthe former and both of them have the same support.

3 The Classification Framework

Given a set of behavior graphs that are labelled eitherpositive (for incorrect runs) or negative (for correctruns), we intend to train a classifier to identify newbehavior graphs with unknown labels. The dynamics ofclassification accuracy will be analyzed to identify thebacktrace of non-crashing bugs. In our study, we usesupport vector machine (SVM) [13] with linear kernel todo classification. The classification framework consistsof three steps:

1. extract features from behavior graphs (trainingdataset),

2. learn an SVM classifier using these features, and

3. classify new behavior graphs.

In order to apply SVM in behavior graph classifica-tion, we represent graphs as vectors in a feature space.A naive representation is to treat edges as features anda graph as a vector of edges. The vector is {0, 1} val-ued. If a graph has a specific edge, it has value “1” inthe corresponding dimension, otherwise “0”. Using thisrepresentation, the dot product of two feature vectorsis the number of common edges that two graphs have

(3.1) xi · xj = |E(gi) ∩ E(gj)|,

where xi and xj are the vector representation of graphsgi and gj . For example, the dot product of the first twographs in Fig. 2 is 10.

The similarity measure given in Eq. (3.1) is mean-ingful since it captures the relationship between two be-havior graphs. As shown in our experiments, SVMstrained by the above measure work well in identify-ing some incorrect runs. Unfortunately, the hyperplane

Figure 3: Frequent subgraphs.

2.3 Graph Classification

In [5], all the edges are used as features. If a graph has a particular edge, it will have a correspondingfeature value 1, otherwise 0. And features of subgraph is defined in similar way.

In this project, a weighted graph model is used. For each edge e = (u, v) in graph g, its weight:

w(e) =ce∑

e′∈g w(e′).

where ce is define as times that u call v. The weight of a subgraphs g′ of g is defined as:

w(g′) =∑e∈g′

w(e).

This weight make sense because it can be thought as a measurement about how important is g′ in g.Section 3 will give a comparison of these two model.

To evaluate the performance of classification, solely use classification accuracy is not enough sincewe have few incorrect runs. We use recall and precision, and combine them into F-Score, which isdefined as:

recall =#succefully classified incorrect runs

#incorrect runs

3

Page 4: Machine Learning Project Final Report: Bug Localization ...fuyaoz/courses/10701/report.pdf · Machine Learning Project Final Report: Bug Localization using Classification for Behavior

precision =#succefully classified incorrect runs

#all runs classified as incorrect

F-Score = 2precision · recallprecision+ recall

2.4 Checkpoint Generation

In [5], each function Fi has two checkpoints: Biin and Bi

out, correspond to the entrance and the exitof Fi. And let P i

in be the classification performance in Biin, and P i

out is the performance in Biout. A

set of graphs can be generated at each checkpoint, which represent the test cases running up to thecheckpoints respectively.

Therefore, given a graph of n nodes, we should generate 2n graph sets and do graph classificationon each set. In this project, a more efficient method is proposed. Let Fj be the first function calledby Fi, and Fk be the last function called by Fi, we define

P iin = P j

in, P iout = P k

out

We define the node of function which will not call others the ender, notice the ender may havemultiple caller. So, we actually only need generate a few checkpoints, which can be uniquelyidentified as pairs of (caller, ender). Then, by walking on the behavior graph of an incorrect wrong,we can assign classification score for each checkpoints. The advantage of doing this is not just forreducing the number of checkpoints, in addition, we can also reduce the result inconsistent as muchas possible, e.g we will never have P j

in < P iin, where Fi call Fj , but similar things could happen

in original method due to the inaccuracy of the classifier. And in original method, P iin − P i

out willalways be 0 where Fi is an ender which might also be buggy. Figure 4 shows how to generatecheckpoints from the program using behavior graph representation.

1!

2!

(a)

1!

3!

2!

(b)

1!

3!

4!

2!

(c)

1!

3!

4!

5!

2!

(d)

Figure 4: Checkpoints generated from Figure 1. The 4 checkpoints could be generated by depth-firsttraversal of behavior graph: when we walk down to ender, we generate subgraph contain all visitednodes.

2.5 Bug like function dectection

Now, we define a functions Fi is bug like if P iout − P i

in > θ. It can be interpreted as: if theclassification performance has been boost a lot (exceed θ) after execution the function, then it is buglike. There should be a set of functions wich is bug like, and if we choose a incorrect run, then theybe lined up to form a backtracing according to the graph.

It should be noticed that this is the most difficult part of the project. Because this part is highlydepend on the graph classification part. If the performance of the classifier is not as high as wewanted, then the result of bug localization may be nonsense. In addition, the number of test examplefor each checkpoint will be different since there may be some nodes exist in some graphs but notexist in other graphs, therefore, it is possible that for some checkpoints which should have highclassification performance, they actually don’t have such performance due to the lack of trainingdata. And it is possible that latter checkpoints have lower performance than that of the previous

4

Page 5: Machine Learning Project Final Report: Bug Localization ...fuyaoz/courses/10701/report.pdf · Machine Learning Project Final Report: Bug Localization using Classification for Behavior

checkpoints. However, we have already reduce such inconsistent as much as possible by reducingthe number of checkpoints.

3 Experiment

We use replace program in Simens Programs as test data. It is a regular expression matching andreplacing program, and is widely used as program analysis. It has a correct source file, and 32versions contain different bugs. It also has 5542 input file. So we first run the standard program andget a set of correct output. Then run other version, and compare its outputs with standard ones. Thedetail of the data set is show in Table 1

Table 1: Dataset version 1 ∼ 5.correct runs incorrect runs correct graph incorrect graph

version 1 5478 64 300 24version 2 5507 35 238 10version 3 5414 128 300 12version 4 5401 141 300 17version 5 5280 262 302 39version 6 5459 83 299 5

We use gprof [3] to generate the execution information of each run. For example, we use followingcommand to compiler a program

gcc replace .c −o replace −pg

Run the program./ replace ’−?’ ’a&’ < input1

Generate run informationgprof −b replace gmon.out

In the classification stage, we use SVM light [4] with linear kernel. 5-fold cross-validation is usedto evaluate the performance of classifiers.

3.1 Threshold Choose

The threshold for subgraph extraction is important parameters, low threshold means more featureswe use, but we may also need more examples, high threshold means less features, but the classifiermay also not perform well if too little features provided. Table 2 shows the experiment result fordifferent threshold on version 4.

Table 2: Performance for different thresholdThreshold F-Score Time

0.17 0.17 104.97s0.33 0.18 19.73s0.50 0.20 5.88s0.67 0.44 1.84s0.83 0.42 1.31s0.95 0.42 1.13s

3.2 Call Flow Graph v.s. Behavior Graph

Table 3 shows the comparison between Behavior Graph and CFG (Threshold = 0.65). We can seethat Behavior graph have slight advantage over CFG, especially for the case that incorrect runs is

5

Page 6: Machine Learning Project Final Report: Bug Localization ...fuyaoz/courses/10701/report.pdf · Machine Learning Project Final Report: Bug Localization using Classification for Behavior

scarce. However, we should notice that CFG may also useful in some classify application since itneed less features.

Table 3: Behavior Graph v.s. CFG. The result of version 2 is pessimistic, since all the incorrectgraphs are also in the set of correct graphs.

Behavior Graph Call Flow GraphF-Score Features F-Score Features

version 1 0.44 103 0.26 57version 2 0.00 103 0.00 55version 3 0.37 105 0.48 57version 4 0.49 105 0.35 57version 5 0.31 105 0.04 58version 6 0.17 105 0.00 57

3.3 Weighted Model v.s Unweighted Model

From Table 5 we can see that weighted model have higher F-Score compare to unweighted model(graphs are all weighted). Further more, weighted model converges much faster in SVM.

Table 4: Weighted Model v.s. Unweighted Model.

Weighted UnweightedF-Score Time F-Score Time

version 1 0.44 1.96s 0.27 30.02sversion 2 0.00 0.83s 0.00 7.53sversion 3 0.37 0.22s 0.39 3.02sversion 4 0.49 0.80s 0.35 7.13sversion 5 0.31 3.13s 0.11 72.20sversion 6 0.17 0.16s 0.00 2.00s

3.4 Checkpoint Reduction

Checkpoint reduction trick can reduce the number of checkpoints we should examine. Since the theclassification might be a very time consuming process, so less checkpoints will save times to finishthe whole process.

Table 5: Total checkpoints with checkpoint reduction v.s. without checkpoint reduction

With CR Without CRversion 1 16 40version 2 15 40version 3 16 40version 4 16 40version 5 16 40version 6 16 40

3.5 Bug localization

Figure 5 shows a incorrect run of version 4, in which bug appear in line 494 of the program:if ((m >= 0) && (/∗ lastm BUG! ∗/ i != m))

And all non-enders have been marked their [Pin, Pout]. If we choose θ = 0.10, then the bug traceprocedure is as follow:

6

Page 7: Machine Learning Project Final Report: Bug Localization ...fuyaoz/courses/10701/report.pdf · Machine Learning Project Final Report: Bug Localization using Classification for Behavior

• Find all the function which Pout − Pin > 0.10

• Backtrace these function in the graph

Then we can find two sequence of function to be suspicious

• main→ getpat→ matpat

• main→ change→ subline→ amatch→ patsize→ in pat set

Thus, the scope for finding the bug is successfully narrowed down so that it is more easier forprogrammer to find the bug of program.

However, the result of our experiment is not perfect because of the limitation of our graph classifier,therefore you can see for some function Pout < Pin. The example here is a very good case in allthe experiments, in some other cases, the performance boost for different checkpoints may not makesense because even the the final F-Score is so low.

main

[0, 37]

getpat

[0, 0.11]

getsub

[0.11, 0.17]

change

[0.17, 37]

dodash[0.00, 0.00]

addstr

getccl

[0.00, 0.00]

esc

in_set_2

stclose

[0.00, 0.00]

makepat

[0.00, 0.11]

makesub

[0.11, 0.17]

omatch

[0.39, 0.38]

in_pat_set

locate

amatch

[0.17, 0.38]

patsize

[0.17, 0.38]

putsub

subline

[0.17, 0.33]getline

Figure 5: Trace bug at an incorrect run.

7

Page 8: Machine Learning Project Final Report: Bug Localization ...fuyaoz/courses/10701/report.pdf · Machine Learning Project Final Report: Bug Localization using Classification for Behavior

4 Conclusion

In the project, we first generate behavior graphs from program runs, and compare the output to thestandard output to give the graphs label. Then we try to classify the graphs using SVM with linearkernel under different models. We’ve shown that weighted model is superior than unweighted modelbecause it reflect how important the subgraph is for its supergraph. And it is shown that the behaviorgraph performs better than CFG in most cases, but it do not have decided advantage, so CFG mayalso useful since we will need much less training examples.

The most important part of this project is to take advantage of the graph classification performanceboost to locate bugs. We’ve shown how to reduce the number of checkpoints, and how to success-fully backtrack the bug like functions. However, due to the low performance of our graph classifier,the bug location algorithm do not perform well on all program runs.

To improve the classification performance, one work could be done is to use more sophisticatedgraph mining technique [7]. Since number of negative examples is so low in our dataset, then it ispossible that other measurement of classification is better than us, such as picking precision underhighest recall [5]. In addition, using other kernel for SVM may help us get better result.

References

[1] Yuriy Brun and Michael D. Ernst. Finding latent code errors via machine learning over pro-gram executions. In ICSE’04, Proceedings of the 26th International Conference on SoftwareEngineering, pages 480–490, Edinburgh, Scotland, May 26–28, 2004.

[2] Hong Cheng, David Lo, Yang Zhou, Xiaoyin Wang, and Xifeng Yan. Identifying bug signaturesusing discriminative graph mining. In Proceedings of the eighteenth international symposiumon Software testing and analysis, ISSTA ’09, pages 141–152, New York, NY, USA, 2009. ACM.

[3] Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick. gprof: a call graph executionprofiler. SIGPLAN Not., 39:49–57, April 2004.

[4] Thorsten Joachims. Making large-scale SVM learning practical. In B. Schlkopf, C. Burges,and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chapter 11. MITPress, Cambridge, MA, 1999.

[5] Chao Liu, Xifeng Yan, Hwanjo Yu, Jiawei Han, and Philip S. Yu. Mining behavior graphs forbacktrace of noncrashing bugs. In ICSE ’10: Proceedings of the 32nd ACM/IEEE InternationalConference on Software Engineering, 2005.

[6] Tao Xie, Suresh Thummalapenta, David Lo, and Chao Liu. Data mining for software engineer-ing. Computer, 42(8):55–62, 2009.

[7] Xifeng Yan, Hong Cheng, Jiawei Han, and Philip S. Yu. Mining significant graph patterns byleap search. In Proceedings of the 2008 ACM SIGMOD international conference on Manage-ment of data, SIGMOD ’08, pages 433–444, New York, NY, USA, 2008. ACM.

8