Predicting Fault-Prone Files using Machine Learning

Download Predicting Fault-Prone Files using Machine Learning

Post on 08-Apr-2017




10 download

Embed Size (px)


<ul><li><p>Predicting Fault-Prone Files using Machine Learning: A Java Open Source CaseStudy</p><p>Priya Krishnan Sundararajan, Ole J. Mengshoel, SureshBabu Rajasekaran,Guido A. Ciollaro and Xinyao HuCarnegie Mellon University, Silicon Valley</p><p>{priya.sundararajan, ole.mengshoel,},, guido</p><p>Abstract</p><p>In this paper, we study 9 open source Java projects,10 classical software metrics, 6 machine learning algo-rithms, and the FindBugs software. These Java projectscontain a total of 18,985 files and 3 million lines ofcode. We used the machine learning approaches to esti-mate the posterior probability of a file being buggy. Ourmain finding is that classification using decision trees,Naive Bayes, and Bayesian networks perform best.</p><p>1 IntroductionBugs account for 40% of system failures (Marcus and Stern2000). The correctness and dependability of software sys-tems are very important in mission-critical systems. In thepast, the way to find bugs in software was by manual inspec-tion or by extensive testing. Nowadays, there are many toolbased approaches (Rutar, Almazan, and Foster 2004) whichcan be broadly classified as dynamic analysis and static anal-ysis techniques.Dynamic analysis techniques discover properties by mon-</p><p>itoring program executions for particular inputs; standardtesting is the most commonly used form of dynamic anal-ysis (Dillig, Dillig, and Aiken 2010). Static analysis tools(Balakrishnan and Reps 2010) does not require code exe-cution and the code is tested for all possible inputs. Find-Bugs (Hovemeyer and Pugh 2004) is a specific example ofsuch an Oracle. All these tools are based on detailed anal-ysis of source code files and will often have computationalcost or require substantial setup time, especially for com-plex software, models and algorithms. Some techniques,including most model checkers (Corbett et al. 2000) andtheorem provers (Flanagan et al. 2002), work on mathemat-ical abstractions of source code which are difficult or time-consuming to develop.The purpose of this work is different from and comple-</p><p>mentary to much previous work on software quality assur-ance. Instead of attempting to locate the exact location of aknown or potential bug, our goal is to establish, by means ofmachine learning, the smell of each file that makes up asoftware project. Stated more formally, we consider the out-put of a machine learning classifier, which for each source</p><p>Copyright c 2012, Association for the Advancement of ArtificialIntelligence ( All rights reserved.</p><p>code file a feature vector as input, to rank the files based ontheir probability of being buggy. The lift curves that are in-duced from the ranking are drawn based on the predictionsof the Oracle which gives the correct class, thus incorrectpredictions can easily be identified. We explore methodsto predict the number of buggy files to stop the machinelearning classifiers as soon as the expected number of buggyfiles are identified. By this analysis we wish not only to fur-ther investigate the files that are most likely to be fault-pronefirst, but also to stop a more detailed analysis as soon as theexpected number of buggy files have been processed. Wecall this methodology Strategic Bug Finding (SBF).In experiments, we use six machine learning algorithms,</p><p>namely Naive Bayes, Bayesian network, decision tree(C4.5), radial basis function (RBF), simple logistics and ze-roR. We analyze nine open source Java projects using theseML algorithms, as implemented in the Weka tool (Hall etal. 2009). Our main experimental result is that Bayesiannetworks and decision tree perform better for most of theprojects.The rest of this article is organized as follows: In section</p><p>2 we discuss previous work in the prediction of buggy files.The machine learning algorithms used in our approach arediscussed in section 3. In section 4 we present our frame-work and the algorithms that we have used. In section 5experimental results and the lift charts are shown. section 6concludes the paper and presents future work.</p><p>2 BackgroundSoftware metrics are used, for example, to provide measure-ments of the source codes architecture and data flow. Tak-ing software metrics as features, machine learning classifierscan be used to predict the probability of a file being buggy.This fundamental connection between software quality andmachine learning has been explored by several researchersin the past. There are studies that focus on predicting thefault-proneness of software based on structural propertiesof object oriented code (Arisholm, Briand, and Johannessen2010). Code change history, along with dependency metrics,has been used to predict fault-prone modules (Nagappan andBall 2007) (Kim, Whitehead, and Zhang 2008). A numberof process measures in addition to structural measures havealso been used (Weyuker, Ostrand, and Bell 2007). It hasbeen argued that performance measures should always take</p></li><li><p>into account the percentage of source code predicted as de-fective, at least for unit testing and code reviews (Mende andKoschke 2009). The cost of predicting a source file as de-fective during manual testing is proportional to the size ofthe source code; this is often neglected (Arisholm, Briand,and Fuglerud 2007).The quality and quantity of training data (software met-</p><p>rics) have direct impact on the learning process as differ-ent machine learning methods have different criteria regard-ing training data. Some methods require large amounts ofdata, other methods are very sensitive to the quality of data,and still others need both training data and a domain the-ory (Witten and Frank 2005)). An interesting comparisonof ant colony optimization versus well-known techniqueslike C4.5, support vector machine (SVM), logistic regres-sion, K-nearest neighbour, RIPPER and majority vote hasbeen performed (Vandecruys et al. 2008). In terms of ac-curacy, C4.5 was the best technique. An experimental eval-uation of neural networks versus SVMs on a simple binaryclassification problem (i.e. buggy vs. correct) gave theseresults (Gondra 2008): the accuracy was 87.4% when usingSVM compared to 72.61%when using neural networks. Thecost-effectiveness and classification accuracy (precision, re-call, ROC) of eight data mining techniques have also beenevaluated (Arisholm, Briand, and Fuglerud 2007). The C4.5method was found to out-perform other including SVM andneural networks. Our results also confirms that C4.5 per-forms better than other machine learning classifiers. Oneway to graphically evaluate the performance of the modelsare lift charts also known as Alberg diagrams (Ohlsson andAlberg 1996).</p><p>3 Machine LearningIntuitively, predicting fault-prone files is a simple classifi-cation problem label each file instance as buggy or correct.We can solve this classification problem by learning a classi-fier that discriminates between the files in the project basedon the metrics of the file. A set of labeled training exam-ples is given to the learner and the learned classifier is thenevaluated on a set of test instances.The application of machine learning techniques to soft-</p><p>ware bug finding problems has become increasingly popularin recent years. But selecting a best machine learning clas-sifier for a given software project is not possible except intrivial cases. For large software projects, we do not under-stand the software metrics that affect the performance of aclassifier to make the predictions with confidence. Severalmachine learning techniques can be used. We use the NaiveBayes, decision tree, Bayesian network, radial basis func-tion, zeroR and simple logistic classifiers. We evaluate theperformance of these classifier using the lift charts.We use the C4.5 decision tree classifier (Quinlan 1993)</p><p>which works by recursively partitioning the dataset. In ourcontext, each leaf of a decision tree corresponds to a sub-set of the data set available (based on the software metricsfor the subset of files) and its probability distribution can beused for prediction when all the conditions leading to thatleaf are met. The C4.5 decision tree classifier performed</p><p>better than other modeling techniques like SVM and neuralnetworks (Arisholm, Briand, and Fuglerud 2007).In terms of predicting the defects in software, Naive</p><p>Bayes outperforms a wide range of other methods (Turhanand Bener 2009), (Tao and Wei-hua 2010). Naive Bayes as-sumes the conditional independence of attributes. Althoughthe probability estimates that it produces can be inaccurate,it often assigns maximum probability to the correct class(Frank et al. 2000). It is therefore interesting to see howit performs for software bug prediction task with the inde-pendence assumption. The Naive Bayes classifier (John andLangley 1995) is used.Whereas in reality, in terms of these metrics measure var-</p><p>ious aspects of the same file in a software project, individ-ual metric attributes tend to be highly correlated with eachother (known as multicollinearity) (Dick et al. 2004). TheBayesian network classifier facilitates the study of intra-relationship between software metrics and allow one to learnabout causal relationships. We use Bayesian network classi-fier (Bouckaert 2004), the training set is used to build theBayesian network structure using the learning algorithmsand the predictions on the test set to classify buggy files fromnon-buggy ones are done using the inference algorithms.We use the simple logistic classifier (Landwehr, Hall, and</p><p>Frank 2005) for building linear logistic regression models.The zeroR determines the most common class and tests howwell the class can be predicted without considering other at-tributes. It is used as a lower bound on performance. We usea normalized Gaussian radial basis basis function network(Buhmann 2003).</p><p>4 Framework and AlgorithmsWe consider a set of software projects = {1, 2, . . .}.Consider one particular project . It contains a set of (sourcecode) files = {1, 2, . . .}, were each i is expressedin a programming language. For each (source code) file,multiple metrics are computed. In other words, we have aset of metrics functions G = {g1, g2, . . . , gm}. Specifically,a metric function g G is a function from the set of allpossible programs into the natural numbers N. The machinelearning algorithms are represented as L = {l1, l2 . . .}Using the above set-up, we form , and get a case</p><p>(E, c), where E = (g1(i), g2(i), g3(i), . . . , gm(i)),that also contains a discrete bug classification cC = {c1,c2, . . .}. In the simplest case, C = {0, 1} where 0 meansnon-buggy while 1means buggy. Here,C is computed usinga bug-finding algorithm or the Oracle, such as FindBugs. Wegenerate the training setE for a subset of the projects exceptthe test project. This process is shown in Algorithm 1.Algorithm 2 show the test set generation and the learn-</p><p>ing phase. Each probabilistic classifier computes a posteriorPr(C | E) , where E is a case (the set of metrics for the file)but without the classification. Consider a project having nnumber of cases E where m of are buggy and n m arecorrect. Using a classifier, we can compute Pr(C | E) foreach case.The probabilities are sorted in a decreasing way,having the case E with higher probability of being buggyfirst. Algorithm 1 and 2 are run for each project to identifythe buggy files in each of them.</p></li><li><p>Algorithm 1 Training Set {1, 2, . . . M} {1, 2, . . . N}G {g1, g2, . . . , gm}E (g1(i), g2(i), g3(i), . . . , gm(i), c)for i = 1 to M in and i 6= k dofor j = 1 to N in do</p><p>Eij (g1(j), g2(j), . . . gm(j))cij Oracle(Eij)Ttraining Ttraining</p><p>Eij , cij</p><p>end forend for</p><p>Algorithm 2 Test Set - Project kL {l1, l2 . . . lO}for l = 1 to O in L dofor j = 1 to N in do</p><p>Ekj (g1(j), g2(j), . . . gm(j))c,Pr(c | Ekj) lk(Ttraining, Ekj)Predictedtest Predictedtest</p><p>Ekj , c, Pr(c | Ekj)</p><p>end forend for</p><p>Lift Curve Analysis</p><p>Lift curves are used to plot the number of source code files inthe x-axis vs the number of buggy files in the y-axis. Thesecurves are then used as in previous works, where predic-tion is based on ranking classes with highest fault predictionprobability (Briand and Wst 2002). We expect the curve toincrease sharply in the beginning and then reach a plateau.The fault-prone files are caught as the slope of the curve in-creases. Testing can be performed till this stage and stoppedwhen the slope of the curve flattens or the expected numberof buggy files can be calculated using maximum likelihoodestimation.Consider a set of software projects1 = {1, 2, . . .}.</p><p>Consider we wish to analyze one particular project k. Wewill create a training set2, Ttraining using the metrics fromprojects {1, 2, . . . , M} excluding the ones from k. Wewill use the metrics from project k as our test set, Ttest.Both the test set and the training set will have their predic-tion class labeled for each file. This is done using the Oracle.The test set predictions are used here to evaluate the pre-</p><p>dictions of the machine learning classifier. The machinelearning classifier gives the probability of bugginess giventhe software metrics, Pr(c | Eik) for each file i in thetest project k. The files are sorted in decreasing order tocapture the files that have the highest probability of beingbuggy. While drawing the lift charts, we consider the cumu-lative count of the actual predictions made by the Oracle foreach file instance. We do this counting in the order as sortedbased on the machine learning classifier. For each file i, weadd one and for each non-buggy file we add zero. The for-mal mathematical definition of the lift curve function g(i) is</p><p>1Note that this could be a very large number of projects.2Training set could be a subset of the projects.</p><p>shown below:g(i+ 1) = g(i) + (i+ 1)</p><p>g(0) = 0(1)</p><p>(i) ={0 if Pr(c | Eik) &lt; 0.5 and Oracle(Eik)=correct,1 if Pr(c | Eik) 0.5 and Oracle(Eik)=buggy</p><p>Thus only ranking is considered from the classifier, andthe lift curves are generated based on the predictions of theOracle. In the early stage, we expect both the classifier andthe Oracle to predict only buggy files. Thus the slope of thelift curve is expected to increase in the initial stage. If thereis a mismatch at this stage, then the lift curve will flatten outshowing the weakness of the classifier. But we expect thelift curve to flatten out after a certain period when there areno more buggy files to predict.</p><p>Figure 1: Lift Chart</p><p>Consider a project containing 1000 (one thousand) files,of which 200 (two hundred) are buggy. As shown in Figure1, the ideal situation would be that all bugs are found withinthe first 200 files as shown by the best case curve. On theother hand, the worst situation would be that all bugs wouldbe found within the last 200 files (worst case curve). Bysorting files by their probability of being buggy, a similarresult to average case curve would be obtained dependingon the accuracy of the ML algorithm. The better a classifieris trained, the closer to the best-case curve the result of usingit will be.One major...</p></li></ul>