commit guru: analytics and risk prediction of software...

Commit Guru: Analytics and Risk Predictionof Software Commits

Christoffer Rosen, Ben GrawiDepartment of Software Engineering

Rochester Institute of TechnologyRochester, NY, USA

{cbr4830, bjg1568}@rit.edu

Emad ShihabDepartment of Computer Science and Software Engineering

Concordia UniversityMontreal, QC, Canada

[email protected]

ABSTRACTSoftware quality is one of the most important research sub-areasof software engineering. Hence, a plethora of research has focusedon the prediction of software quality. Much of the software ana-lytics and prediction work has proposed metrics, models and novelapproaches that can predict quality with high levels of accuracy.However, adoption of such techniques remain low; one of the rea-sons for this low adoption of the current analytics and predictiontechnique is the lack of actionable and publicly available tools.

We present Commit Guru, a language agnostic analytics and pre-diction tool that identifies and predicts risky software commits.Commit Guru is publicly available and is able to mine any GITSCM repository. Analytics are generated at both, the project andcommit levels. In addition, Commit Guru automatically identifiesrisky (i.e., bug-inducing) commits and builds a prediction modelthat assess the likelihood of a recent commit introducing a bug inthe future. Finally, to facilitate future research in the area, usersof Commit Guru can download the data for any project that is pro-cessed by Commit Guru with a single click. Several large opensource projects have been successfully processed using CommitGuru. Commit Guru is available online at commit.guru. Oursource code is also released freely under the MIT license.

Categories and Subject DescriptorsH.4 [Information Systems Applications]: Miscellaneous; D.2.8[Software Engineering]: Metrics—complexity measures, perfor-mance measures

General TermsSoftware Quality Analysis

KeywordsSoftware Analytics, Software Metrics, Risky Software Commits,Software Prediction

1. INTRODUCTION

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 2015 ACM ESEC/FSE’15 ...$15.00.

The increasing importance and complexity of software systemsin our daily lives makes their quality a critical, yet extremely dif-ficult issue to address. Estimates by the US National Institute ofStandards and Technology (NIST) stated that software faults andfailures cost the US economy $59.5 billion a year [1].

Therefore, a large amount of recent software engineering re-search has focused on software analytics and the prediction of soft-ware quality, where code and/or repository data (i.e., recorded dataabout the development process) is used to provide software analyt-ics and predict where defects might appear in the future (e.g., [2,3]).In fact, a recent literature review shows that in the past decademore than 100 papers were published on software defect predic-tion alone [4]. In addition to risk prediction, there has also beengrowing research concentrated on conflict resolution (e.g., [5, 6]).

Nevertheless, the adoption of software prediction in practice hasbeen a challenge [7–9]. One of the reasons for this limited adoptionis the lack of tools that incorporate the state-of-the-art analytics andprediction techniques [9].

To address this limitation, we present a tool called Commit Guruthat provides commit-level analytics and predictions. Commit Gurubuilds on the state-of-the-art analytics and prediction work to pro-vide developers and managers with the risk levels of their commits.To derive the risk values, Commit Guru analyzes and presents anumber of metrics (presented in our earlier work [10,11]) that helpexplain how the risk value is determined. Commit Guru is fullyautomated, updates daily and is capable of analyzing any publiclyaccessible git repository. Finally, Commit Guru provides a full datadump of all the commits, along with their associated metrics, to fa-cilitate research efforts in the area of software quality analytics andprediction.

2. A WALKTHROUGH OF COMMIT GURUThe main goal of Commit Guru is to provide developers and

managers with analytics and predictions about their risky commits.Similar to prior work, we consider a risky commit as a committhat introduces a bug in the future [11, 12]. Commit Guru can beused by developers, managers and gatekeepers to identify commitsthat have introduced bugs. An overview of Commit Guru is shownin Figure 1. Commit Guru is composed of a backend, which isresponsible for all of the analysis of the commits and a front-end,which is responsible for the visualization of the analyzed data. Toillustrate how both, the backend and front-end, work to supportCommit Guru, we provide a brief walkthrough in this section.

2.1 Front-end of Commit GuruWhen a user visits commit.guru, they first see the homepage

shown in Figure 2. From the homepage, a user can view reposi-tories that have already been analyzed or analyze a new repository.

1

Figure 1: Overview of Commit Guru

To analyze a git repository, a user enters the URL of an publicly ac-cessible Git repository and (optionally) their e-mail address (num-bered 1 in Figure 2). The e-mail address is only used to notify theuser when their repository has been fully analyzed. By default, allrepositories that are submitted are viewable by all visitors of Com-mit Guru, however, a user can select if they do not want their ana-lyzed repository to be publicly viewable. A user also has the optionof creating an account to keep track of all of his or her repositories.

Once a repository has been fully analyzed, we present it in therecent repositories section (numbered 2 in Figure 2). Any visitorof Commit Guru can click on the repository and quickly have anoverview of different analytics about the repository. These include:

• Project commit statistics, which show the total number ofcommits and the percentage of commits that are risky (andnot risky).

• Median metric values, which show the median values forthe risky and non-risky commits. For example, if the mediannumber of lines of code (LOC) added for risky commits is100 lines and for non-risky commits if 50 lines, then they arerepresented as such. The values for the risky commits areshown in red, whereas the values for the non-risky commitsare shown in green. We also measure whether the differencein medians between the risky and non-risky commits is sta-tistically significant or not. For the statistically significantdifferences, we highlight them by adding an asterisk ‘*’ nextto the metric’s name.

In addition to the viewing the different analytics at the projectlevel, a user can view commit-level analytics. Clicking on the‘commits’ tab (labeled number 1 in Figure 3) allows the user toview all of the individual commits of the project. In this tab, thereexists two views: historical and predictive data. A user can switchviews by modifying the display dropdown (labeled number 2 inFigure 3). The historical data view show all commits in the repos-itory and highlights risky commits in red based on those that aredetermined to be bug-inducing (i.e., risky). These commits, onceexpanded, provide a link to the commit that fixed the bug in it.Users can also apply a number of filters, e.g., filter commits by a

specific developer, order the commits based on their age. In thepredictive data view, we show only the recent commits made inthe last three months. In the prediction view, we leverage a regres-sion model to evaluate the riskiness of each commit.

In both the predictive and historical data views, a user can clickon the individual commits to view specific metric values (labelednumber 4 in Figure 3). For each commit, we present the metricvalue for the specific commit. If the value of the metric is belowthe median of the non-risky commits, then it is coloured in green;if it is higher than the value of the risky commits, it is colouredin red; and if it is higher than the non-risky commits, but lowerthan the median value of the risky commits, then the metric valueis coloured in yellow. Users can also apply a number of filters, e.g.,filter commits by a specific developer, order the commits based onsome criteria. Lastly, a user can click on the ‘options’ tab (labeled3 in Figure 3) to see when the repository was ingested and lastanalyzed. Furthermore, a user can obtain a csv file containing allof the commits and their corresponding metrics.

2.2 Backend of Commit GuruTo accomplish the aforementioned analysis, Commit Guru’s back-

end performs three sequential steps: ingestion, analysis (to calcu-late analytics and determine historical risky commits), and predic-tion (of potentially risky commits). The “Commit Guru Manager"tracks the status’ of all projects and appropriate adds tasks (e.g,ingestion or analysis) to a work queue. A thread pool object main-tains a pool of worker threads and assigns tasks from the workqueue to available workers. This allows the system to performmultiple tasks parallel in the background. Below, we describe ina nutshell how the backend of Commit Guru works.

When a user requests a new repository to be analyzed, the front-end creates a new row in a table called repositories. Therepositories table stores the name, location URL, user’s email,creation date, and assigns the repository the "Waiting to be In-gested" status. The “Commit Guru Manager" observes this, andadds an ingestion task into the work queue. Our thread pool willthen pop this task from the work queue and dispatch an availableworker thread to start processing it. In this case, it will start in-gesting the code changes in the repository and classify them (moredetails about the classification are provided in Section 2.2.1).

Once a worker thread has finished the ingestion task, it changesthe repository’s status in the table as "Waiting to be Analyzed" andsaves the ingestion date in the repository table. The “CommitGuru Manager" notices this and triggers a new analysis task intothe work queue. A free worker is dispatched and begins linking thefix-inducing commits to the bug-inducing commits. The medianvalues of the metrics for both, risky and non-risky commits, arealso calculated (more details about the analysis step are providedin Section 2.2.2).

When analysis is done, the worker changes the status of therepository to "Waiting to Build Model". The “Commit Guru Man-ager” again sees his and creates a new thread to build the regres-sion model using the ’R’ statistical package. After the modelshave been built, it stores the regression model’s coefficients intoa metrics table (more details about the prediction of risky com-mits are provided in Section 2.2.3.). It also creates a complete CSVdata dump of all the code changes with their code change measuresand whether they were bug-inducing. This file is saved on the diskto make it accessible for download by users. Next, it saves thedate of the dump into a column in the repository table. We createnew data dumps for each repository every month. As the “CommitGuru Manager” knows the last time a repository was ingested, itwill also add tasks to pull new changes every day and update the

2

Figure 2: Commit Guru’s Home Page

Figure 3: Commit View

median values and model’s coefficients.Finally, when all of this is done, the “Commit Guru Manager”

changes the repository’s status to "Analyzed". The user is also no-tified via e-mail, using Google’s SMTP Server with a gmail accountwe created for Commit Guru, that the repository has been fully an-alyzed. The user can now view the riskiness of the project’s codecommits and download data dumps of this data

2.2.1 Ingestion and ClassificationAs stated earlier, the first phase a project will enter is the inges-

tion and classification phase. For all new projects, we download itsrepository and revision history, and store all of the project’s datalocally on the Commit Guru server. Then, we parse each commit inthe revision log and record each of the 13 change-level metrics (pre-sented in our earlier work [11]) for each commit. After extractingthe commit metrics from the commit, we semantically analyze thelog made by the developer in order to classify the type of commit(e.g., corrective, feature addition). We leveraged the prior work of

Table 1: Words and Categories Used to Classify Commits [13]Category Associated Words Explanation

Corrective bug, fix, wrong, er-ror, fail, problem,patch

Processing failure

Feature Addition new, add, require-ment, initial, create

Implementing a new require-ment

Merge merge Merging new commitsNon Functional doc Requirements not dealing with

implementationsPerfective clean, better Improving performancePreventive test, junit, coverage,

assetTesting for defects

Hindle et al. [13] and use the list of keywords shown in Table 1 toclassify commits. If the commit log contains one of the keywordsin our defined categories, the commit is labeled as belonging to thecorresponding classification. While it is possible for a commit tobelong to several categories, we limit them to only one categoryin the following order: corrective, feature addition, non functional,perfective, and preventive. For example, if a developer fixes both,an issue and adds a new feature, it would be labeled only as a cor-rective change. A similar approach was used by the authors of theSZZ algorithm [14] to determine commits that were fixing.

2.2.2 Analysis and Determining Risky CommitsNext, we start linking the risky (i.e., bug-inducing) commits to

those that are fixing commits. We link these fixing commits to therisky commits by performing a diff to find which lines of code weremodified. Using the annotate/blame function of the SCM tool, wecan discover in what previous commits those lines of codes weremodified. These commits are then marked as a potential locationwhere the bug was introduced, i.e., potential risky commits.

To perform the diffs on each fixing commit in a project, we pipethe output of the diff into bash and echo each line back with ourown delimiters at the beginning and end of each line. We do thisas it is not sufficient to simply inspect each line by separating themthrough the newline character, since a line of code can contain ar-bitrary newline characters. We split this output into regions, or bythe files modified. The first line contains information about the filename, which we collect. First, we check if the file name endingis in our list compiled from Wikipedia containing 144 source codeand script file extensions [15]. We do not consider other types offiles because a change in a document file is likely not the source ofa bug. Including other file types may make the algorithm confusedas it would consider code churn and documentation change as thesame type of event. Additionally, it slows down our algorithm sig-nificantly for projects that have large change logs included in theirupdates, as the algorithm has to mark each modified line individu-ally.

Next, we look for our delimiter to find the line that contains in-formation about the lines of code modified and the code diff. Eachregion can have multiple delimiters and we refer to these as sec-tions in the region. We divide the region up initially into two parts:the first part contains the file information and the second part con-tains the information about the lines of code modified followed bythe code chunk. We split the second part into two more additionalparts, which separates the information about the lines of code andthe code chunk. From the information about the lines of code mod-ified, we extract which line number the diff begins and mark it asthe current line number. We then begin looking at the code diff andif it begins with a ‘-’ character, we can be certain that it was a codemodification. Then, we record the line number and its correspond-

3

ing file name and increment the current line number. We do thissince code additions that did not exist in previous versions of thefile may modify the line numbers. After this has been completedfor all sections in the region, we perform the SCM blame/annotatefunction on all modified lines of code for their corresponding fileson the fixing commit’s parent. This will return the commit identi-fications that previously modified them. We determine these com-mits to have introduced the bugs corrected by the fixing commit andmark them. Once we have successfully linked a fixing commit to abug-inducing commit, we mark the fixing commit as successfullylinked so to not redo this every time a repository is analyzed. Onlynew fixing commits need to be linked, as we have already foundwhich change(s) introduced the bugs in those fixing commits.

Once the fixing and the risky commits are labeled, we calculatethe 13 commit-level metrics derived in our prior work [11]. Foreach metric, we calculate its median value for two sets of commits,risky and non-risky. We also perform a Wilcox test to ensure thatthe difference in medians is statistically significant (i.e., p-value <0.05). Then, we use the median values to present the project levelanalytics and to provide the user some perspective on where eachcommit falls compared to the set of risky and non-risky commits.For example, if the set of risky commits have a median LOC addedof 100 lines, where as the non-risky set has a mean of 20 lines, thenif a commit have 10 lines it is mostly safe (as far as this one metricis concerned).

2.2.3 Building Predictive ModelsOne main criteria/assumption that needs to be satisfied for the

SZZ (and our algorithm that determines risky commits) to workis that enough time needs to have elapsed since a commit, for thecommit to have the chance to be fixed. Therefore, we also builda predictive model that trains on historical data and makes predic-tions for commits performed in the last three month period.

Our prediction model is a logistic regression model, althoughin theory we can apply any other model. We build the predictionmodel incrementally by adding one metric to the model at a time.If the metric is statistically significant and does not cause any pre-viously added metric to no longer be significant, it is added to themodel. We repeat this for all 13 metrics in the following order:lines added, lines deleted, lines total, no. subsystems, no. directo-ries, no. files, no. developers, age, no. unique changes, experience,recent experience, subsystem experience, and entropy. Once webuild the model, we use the coefficients of the model to predict therisk of the commits made in the last 3 months. When a user sub-mits a repository to be analyzed, they can see the predicted risk oftheir commits for any commit made in the last 3 months using thepredictive view of Commit Guru.

3. CONCLUSIONIn this paper, we present our tool, Commit Guru, that performs

analytics and predicts risky software commits. Commit Guru uses13 metrics introduced in our prior work [11] to give its analyticsand make its predictions. Commit Guru is publicly available onthe web at http://www.commit.guru. Practitioners can useCommit Guru to identify recent commits that are more likely tocontain bugs and better understand the overall quality of a soft-ware project. This can be useful to find those changes to prioritizeverification activities such as code inspections as well as seeingwhich quality change metrics might be good indicators for bugs.Researchers can use Commit Guru to study a large collection ofquality commits and their metrics from any GIT SCM based opensource repositories for future research in this area.

4. REFERENCES[1] “The economic impacts of inadequate infrastructure for

software testing.”http://www.nist.gov/director/planning/upload/report02-3.pdf.

[2] T. Zimmermann, R. Premraj, and A. Zeller, “Predictingdefects for Eclipse,” in PROMISE ’07: Proceedings of theThird International Workshop on Predictor Models inSoftware Engineering, 2007, pp. 1–7.

[3] R. Moser, W. Pedrycz, and G. Succi, “A comparativeanalysis of the efficiency of change metrics and static codeattributes for defect prediction,” in Proceedings of the 30thInternational Conference on Software Engineering, ser.ICSE ’08, 2008, pp. 181–190.

[4] E. Shihab, “An exploration of challenges limiting pragmaticsoftware defect prediction,” PhD Thesis, Queen’s University,2012.

[5] Y. Brun, R. Holmes, M. D. Ernst, and D. Notkin, “Crystal:Precise and unobtrusive conflict warnings,” in Proceedings ofthe 19th ACM SIGSOFT symposium and the 13th Europeanconference on Foundations of software engineering. ACM,2011, pp. 444–447.

[6] M. L. Guimarães and A. R. Silva, “Improving early detectionof software merge conflicts,” in Proceedings of the 34thInternational Conference on Software Engineering. IEEEPress, 2012, pp. 342–352.

[7] M. W. Godfrey, A. E. Hassan, J. Herbsleb, G. C. Murphy,M. Robillard, P. Devanbu, A. Mockus, D. E. Perry, andD. Notkin, “Future of mining software archives: Aroundtable,” IEEE Software, vol. 26, no. 1, pp. 67 –70,Jan.-Feb. 2009.

[8] A. Hassan, “The road ahead for mining softwarerepositories,” in Frontiers of Software Maintenance, 2008,Oct. 2008, pp. 48 –57.

[9] E. Shihab, “Pragmatic prioritization of software qualityassurance efforts,” in Proceedings of the 33rd InternationalConference on Software Engineering, ser. ICSE ’11, 2011,pp. 1106–1109.

[10] E. Shihab, A. E. Hassan, B. Adams, and Z. M. Jiang, “Anindustrial study on the risk of software changes,” inProceedings of the ACM SIGSOFT 20th InternationalSymposium on the Foundations of Software Engineering, ser.FSE ’12, 2012, pp. 62:1–62:11.

[11] Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus,A. Sinha, and N. Ubayashi, “A large-scale empirical study ofjust-in-time quality assurance,” Software Engineering, IEEETransactions on, vol. 39, no. 6, pp. 757–773, 2013.

[12] S. Kim, E. J. Whitehead, and Y. Zhang, “Classifyingsoftware changes: Clean or buggy?” Software Engineering,IEEE Transactions on, vol. 34, no. 2, pp. 181–196, 2008.

[13] A. Hindle, D. M. German, and R. Holt, “What do largecommits tell us?: A taxonomical study of large commits,” inProceedings of the 2008 International Working Conferenceon Mining Software Repositories, ser. MSR ’08, 2008, pp.99–108.

[14] J. Sliwerski, T. Zimmermann, and A. Zeller, “When dochanges induce fixes?” ACM sigsoft software engineeringnotes, vol. 30, no. 4, pp. 1–5, 2005.

[15] “List of file formats,”http://en.wikipedia.org/wiki/List_of_file_formats, Jun. 2015.

4

APPENDIXA. SOURCE CODE AVAILABILITY

Our source code is freely available under the MIT li-cense. The back-end of the Commit Guru system canbe viewed and dowloaded at https://github.com/CommitAnalyzingService/CAS_CodeRepoAnalyzer.The front-end can be found at https://github.com/CommitAnalyzingService/CAS_Web.

B. COMMIT GURU DEMONSTRATIONTo demonstrate Commit Guru in action, we walk through mining

and analyzing the code changes for the open source project pip.This is a popular tool for installing Python packages. The project’ssource code is publicly available on GitHub, which is a popularplace for developers’ to store projects using git as its version controlsystem. Pip can be seen on GitHub via the web at https://github.com/pypa/pip. It has 3,944 code changes with 195contributors and currently has 86 pull requests.

B.1 Walk Through Analyzing PipAdding Pip: We first get the url to pip’s git repository. Thisis all we need to mine the project’s code changes and see thosethat are risky through the Commit Guru web tool. We navigate tocommit.guru with our web browser. Here, we can input pip’sgit repository url and our email address as shown in Figure 4. Afterwe click "Start Now", the repository will start to get processed byCommit Guru.Overall Commit Statistics: When Commit Guru has finished pro-cessing a repository, it will send an e-mail notification to the owner,if this was provided. It will also show up by default to the recentrepositories list on the home page unless disabled. When we clickon the link given by the e-mail or the repository in the recent repos-itories table on the home page, we see the overall commit statisticsas shown in Figure 5. Here, we can quickly see that 20% of codechanges are marked as bug-inducing (number 2 in Figure 5). Addi-tionally, we can view the averages for each of the commit metrics(number 3 in Figure 5) for the changes that were bug-inducing (col-ored in red) or not (colored in green). Metrics with an asterisk(’*’)indicate that the differences between the bug-inducing and non bug-inducing commits are statistically significant. For pip, we see thatall metrics are statistically significant. Interestingly, we also learnthat the more developers that touched a file for this project, the lesslikely it was to contain a bug. Other useful information can alsobe found from this page, particularly for researchers. Hoveringover the blue question mark next to the metric name will displaya popup describing it. For instance, hovering over subsystem devexperience metric will show a popup explaining that this metricrepresents the number of commits the developer made in the pastto the subsystems that are touched by the current commit (number4 in Figure 5).Viewing Code Changes: In the overview page, there is a naviga-tion bar for the repository to switch between different pages of therepository (labeled as 1 in Figure 5). After clicking on the "Com-

mits" tab in the repository navigation bar, we see a list containingthe commits for this repository. By default, it shows an historicalview as shown in number 1 in Figure 6. Commits colored in redindicate that they are bug-inducing (number 3 in Figure 6). On theleft-side for each commit, we also show it’s classification (if avail-able). It is also possible to filter for specific commits by clickingon the filter drop-down menu (number 2 in Figure 6) and specify-ing the commit hash, commit message, author email, or the commitclassification.Finding Risky Commits: Changing the display from historical topredictive data in the drop down on the commits page (number 1in Figure 6) will change the view to show all code changes in thepast three months and how likely they contain bug(s). There is anadditional element on the left-side for each commit (number 1 inFigure 7). This is a percentage number that indicates the proba-bility of it containing a bug. It is color coded accordingly: greenindicates low risk, yellow indicates medium risk, and red indicateshigh risk. However, in the predictive data view, a red colored com-mit indicates that it is more probable to contain a bug using ourregression algorithm instead of indicating that it is bug-inducing(historical data view). Here, we can browse through all the recentcode changes and easily inspect those that are most likely to in-troduce bugs by sorting on high risk-level (number 2 in Figure 7).For example, we can see that commit 17352765f0 is likely to intro-duce a new bug, making it a good target for conducting additionalverification activities (i.e., code review).Inspecting Individual Commits: In both the historical and pre-dictive data displays, it is possible to inspect individual commits byclicking on it as shown in Figure 8. This will expand the box, show-ing all the individual code change metrics (number 1 in Figure 8),the commit message (number 2 in Figure 8), the updated file names(number 3 in Figure 8), an option to give feedback whether it wasuseful (number 4 in Figure 8), and a link to the commit that fixeda bug it contained (number 5 in Figure 8), if applicable. Each met-ric is color coded according to it’s risk factor based on our medianmodel. If the metric is less than the average of non-buggy commits,it is green (low risk). If it is between the average of a buggy andnon buggy commit, it is colored yellow (medium risk). Finally, if itis the same as the average of a buggy commit or higher, it is coloredas red (high risk). This is useful for quickly understanding why acertain commit was flagged as risky in the predictive data view andseeing a summary of which files were changed and by whom.Downloading Data Dumps: In the repository navigation bar,clicking on "Options" will take us to a view that allows us to down-load a data dump of the code changes for the repository as shownin Figure 9. It also shows the last time Commit Guru downloadedcode changes from the repository, the last time they where ana-lyzed, and when the data dump was done. Clicking on the "Down-load dump (.csv)" button (number 1 in Figure 9) will start the down-load of a csv file containing all code changes with their change mea-sures and whether they are marked as containing a bug. Figure 10shows an example of the contents in the data dump after doing thisfor the Pip project.

5

Figure 4: Adding a Repository

Figure 5: Commit Statistics

6

Figure 6: Commits - Historical Display

7

Figure 7: Commits - Predictive Data Display

8

Figure 8: Inspecting an individual commit

Figure 9: Downloading Data Dump

9

Figure 10: Data Dump Contents

10

commit guru: analytics and risk prediction of software...

Documents