extractingandstudyingthelogging-code-issue ...chenfsd/resources/emse2019...by mining the projects’...

38
Empirical Software Engineering https://doi.org/10.1007/s10664-019-09690-0 Extracting and studying the Logging-Code-Issue- Introducing changes in Java-based large-scale open source software systems Boyuan Chen 1 · Zhen Ming (Jack) Jiang 1 © Springer Science+Business Media, LLC, part of Springer Nature 2019 Abstract Execution logs, which are generated by logging code, are widely used in modern software projects for tasks like monitoring, debugging, and remote issue resolution. Ineffective log- ging would cause confusion, lack of information during problem diagnosis, or even system crash. However, it is challenging to develop and maintain logging code, as it inter-mixes with the feature code. Furthermore, unlike feature code, it is very challenging to verify the correctness of logging code. Currently developers usually rely on their intuition when performing their logging activities. There are no well established logging guidelines in research and practice. In this paper, we intend to derive such guidelines through mining the historical logging code changes. In particular, we have extracted and studied the Logging- Code-Issue-Introducing (LCII) changes in six popular large-scale Java-based open source software systems. Preliminary studies on this dataset show that: (1) both co-changed and independently changed logging code changes can contain fixes to the LCII changes; (2) the complexity of fixes to LCII changes are similar to regular logging code updates; (3) it takes longer for developers to fix logging code issues than regular bugs; and (4) the state-of-the-art logging code issue detection tools can only detect a small fraction (3%) of the LCII changes. This highlights the urgent need for this area of research and the importance of such a dataset. Keywords Empirical studies · Logging code · Software analytics 1 Introduction Execution logs, which are usually readily available for large-scale software systems, have been widely used in practice for a variety of tasks (e.g., system monitoring Communicated by: Lin Tan Boyuan Chen [email protected] Zhen Ming (Jack) Jiang [email protected] 1 Software Construction, AnaLytics and Evaluation (SCALE) lab, York University, Toronto, ON, Canada

Upload: others

Post on 27-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineeringhttps://doi.org/10.1007/s10664-019-09690-0

Extracting and studying the Logging-Code-Issue-Introducing changes in Java-based large-scaleopen source software systems

Boyuan Chen1 ·ZhenMing (Jack) Jiang1

© Springer Science+Business Media, LLC, part of Springer Nature 2019

AbstractExecution logs, which are generated by logging code, are widely used in modern softwareprojects for tasks like monitoring, debugging, and remote issue resolution. Ineffective log-ging would cause confusion, lack of information during problem diagnosis, or even systemcrash. However, it is challenging to develop and maintain logging code, as it inter-mixeswith the feature code. Furthermore, unlike feature code, it is very challenging to verifythe correctness of logging code. Currently developers usually rely on their intuition whenperforming their logging activities. There are no well established logging guidelines inresearch and practice. In this paper, we intend to derive such guidelines through mining thehistorical logging code changes. In particular, we have extracted and studied the Logging-Code-Issue-Introducing (LCII) changes in six popular large-scale Java-based open sourcesoftware systems. Preliminary studies on this dataset show that: (1) both co-changed andindependently changed logging code changes can contain fixes to the LCII changes; (2) thecomplexity of fixes to LCII changes are similar to regular logging code updates; (3) it takeslonger for developers to fix logging code issues than regular bugs; and (4) the state-of-the-artlogging code issue detection tools can only detect a small fraction (3%) of the LCII changes.This highlights the urgent need for this area of research and the importance of such a dataset.

Keywords Empirical studies · Logging code · Software analytics

1 Introduction

Execution logs, which are usually readily available for large-scale software systems,have been widely used in practice for a variety of tasks (e.g., system monitoring

Communicated by: Lin Tan

� Boyuan [email protected]

Zhen Ming (Jack) [email protected]

1 Software Construction, AnaLytics and Evaluation (SCALE) lab, York University,Toronto, ON, Canada

Page 2: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

(Shang et al. 2014a), problem debugging (Yuan et al. 2012a), remote issue resolution(Oliner et al. 2012), test analysis (Jiang et al. 2009), and business decision making(Barik et al. 2016)). Execution logs are generated by executing the logging code (e.g.,Logger.info("User " + userName + " logged in")) that developers haveinserted into the source code. There are typically four types of components in a snippet oflogging code: a logging object, a verbosity level, static texts, and dynamic contents. In theabove example, the logging object is Logger, the verbosity level is info, the static textsare User and logged in, and the dynamic content is userName.

It is very challenging to develop and maintain high quality logging code for constantlyevolving systems for the following two reasons: (1) Management: software logging is across-cutting concern, which tangles with the feature code (Kiczales et al. 1997). Althoughthere are language extensions (e.g., AspectJ 2016) to support better management of loggingcode, many industrial and open source systems still choose to inter-mix logging code withfeature code (Chen and Jiang 2016; Pecchia et al. 2015; Yuan et al. 2012b). (2) Verifica-tion: unlike feature code or other types of cross-cutting concerns (e.g., exception handlingor configuration), whose correctness can be verified via software testing; it is very chal-lenging to verify the correctness of the logging code. Figure 1 shows one such issue in thelogging code. In the Hadoop DFSClient source code, the variable name was changed fromLEASE SOFTLIMIT PERIOD to LEASE HARDLIMIT PERIOD in version 4078. How-ever, the developer forgot to update the static text from soft-limit to hard-limit.Such issues are very hard to be detected using existing software verification techniques,except conducting careful manual code reviews. Developers have to rely on their intuitionto compose, review, and update logging code. Most of the existing issues in logging code(e.g., inconsistent or out-dated static texts, and wrong verbosity levels) are discovered andfixed manually (Chen and Jiang 2016; Yuan et al. 2012b).

Most of the existing research on logging code focuses on “where-to-log” (a.k.a., sug-gesting where to add logging code) (Ding et al. 2015; Fu et al. 2014; Zhu et al. 2015) and“what-to-log” (a.k.a., providing sufficient information in the logging code) (Kabinna et al.2016; Shang et al. 2014b; Yuan et al. 2011). There are very few works tackling the problemof “how-to-log” (a.k.a., developing and maintaining high quality logging code). Low qual-ity logging code can hinder program understanding (Shang et al. 2014b), cause performanceslow-down (HBASE-10470 2018a), or even system crashes (HBASE-750 2016c). Unlikeother software engineering processes (e.g., refactoring (Fowler et al. 1999) and releasemanagement Humble and Farley 2010), there are no well-established logging practices inindustry (Fu et al. 2014; Pecchia et al. 2015). Our previous work (Chen and Jiang 2017)is the first study, which characterizes and detects anti-patterns (common issues) in loggingcode by manually examining a sample of the logging code changes from three open sourceprojects. Six anti-patterns in the logging code (ALC) were identified. The majority (72%)of the reported ALC instances, on the most recent releases of the ten open source systems,have been accepted or fixed by their developers. This clearly demonstrates the need for thisarea of research. However, one of the main obstacles facing researchers is the lack of theavailable dataset, which contains the Logging-Code-Issue-Introducing (LCII) changes, asmany issues in the logging code are generally not documented in the commit logs or in thebug reports. An LCII change is analogous to a bug introducing change (Kim et al. 2006). Itis a type of code change, which will lead to future changes (e.g., changing the static textsfor clarification or lowering the verbosity levels to reduce the runtime overhead) to the cor-responding logging code. For example, in Fig. 1, the logging code change at version 4078is the LCII change. This change introduced a bug in the logging code. The fix to this LCII

Page 3: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

DFSClient.java from Hadoop (HDFS-5800)

V 4078:

V 5956:

LOG.warn("Failed to renew lease for " + clientName + " for "+ (elapsed / 1000)+ " seconds (>= soft-limit ="+ (HdfsConstants.LEASE_HARDLIMIT_PERIOD / 1000)+ " seconds.)"+ "Closing all files being written ...",e)

LOG.warn("Failed to renew lease for " + clientName + " for "+ (elapsed / 1000)+ " seconds (>= hard-limit ="+ (HdfsConstants.LEASE_HARDLIMIT_PERIOD / 1000)+ " seconds.)"+ "Closing all files being written ...",e)

Fig. 1 An example of an issue in the logging code (HDFS-5800 2018d)

change is at version 5956. Logging code changes which are co-evolved with feature codeare not included as LCII changes.

In this paper, we have developed a general approach to extracting the LCII changesby mining the projects’ historical data. Our approach analyzes the development history(historical code changes and bug reports) of a software system and outputs a list of LCIIchanges for further analysis. We have performed a few preliminary studies on the resultingLCII change dataset and presented some open problems in this area. The contributions ofthis paper are as follows:

1. Compared to Chen and Jiang (2017), using our new approach, the resulting LCIIchanges are more complete. In Chen and Jiang (2017), the authors assumed that onlythe independently changed logging code changes (a.k.a., the logging code changeswhich are not co-changed with any feature code changes) may contain fixes to the LCIIchanges. In this paper, we have found that this assumption is invalid, as some fixes tothe LCII changes may require feature code changes as well.

Thus, rather than only focusing on the independently changed logging code, weextract the LCII changes from all the logging code changes.

2. Instead of manually identifying LCII changes as in Chen and Jiang (2017), we havedeveloped an adapted version of the SZZ algorithm (Sliwerski et al. 2005), called LCC-SZZ (Logging Code Change-based SZZ), to automatically locate the LCII changesfrom their fixes. By leveraging this algorithm, we can extract the LCII changes amongvarious code revisions.

3. To ease replication and encourage further research in the area of “how-to-log”, we haveprovided a large-scale benchmarking dataset, which contains 8,748 LCII changes fromsix large-scale open source systems, each consisting of six to ten years of developmenthistory (The replication package 2018). Such a dataset, which is the first of its kind tothe authors’ knowledge, can be very useful for interested researchers to develop newtechniques to characterize and detect ALCs or to derive coding guidelines on effectivesoftware logging.

4. We have conducted some preliminary studies on the extracted LCII change dataset andreported four main findings: (1) both co-changed and independently changed loggingcode can contain fixes to LCII changes; (2) the complexity of the LCII changes andother logging code changes are similar; (3) it takes significantly longer to fix a log-ging code issue than a regular bug, and (4) although none of the existing techniques ondetecting logging code issues perform well (< 3% recall), their detection results com-plement each other. Our findings clearly indicate the need in this area of research anddemonstrate the usefulness of our approach and the provided dataset.

Page 4: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

1.1 Paper Organization

The rest of the paper is organized as follows. Section 2 gives an overview about the extrac-tion process for the LCII changes. Sections 3, 4, and 5 illustrate the three phases of ourextraction approach. Section 6 describes our preliminary studies on the extracted dataset.Section 7 presents the related work. Section 8 discusses the threats to validity. Section 9concludes the paper and describes some of the future work.

2 An Overview of our Approach to Extracting the LCII Changes

In this section, we will explain our approach to extracting the LCII changes from the soft-ware development history. As shown in Fig. 2, our approach consists of the following threephases:

1. Phase 1 - Data gathering and data preparation (Section 3): this is the data gatheringand pre-processing phase. First, the versions of each source file are extracted from

Code

repository

Fine-grained

code

changes

Logging

code

changes

Historical

version

extractionHeuristics

Dependency analysis

Co-changed

logging code

updates

Indept. changed

logging code

updates

Fixes to the

LCII

changes

LCII changes

LCC-SZZ

Integrate

with issue

reports

Manual

verification

Phase1 Data gathering and data preparation

Phase 2 - Extraction of the fixes to the LCII changes

Phase 3 - Extraction of the LCII changes

Evolutionary

data

Change-

Distiller

Fig. 2 Overall process

Page 5: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

the source code repositories. Then, all the bug reports are downloaded for the studiedproject. Fine-grained code changes (e.g., function update, variable renaming, etc.) areidentified between adjacent versions of the same files. Finally, heuristics are applied toautomatically identify the changes which are related to the logging code.

2. Phase 2 - Extraction of the fixes to the LCII changes (Section 4): in order to iden-tify the LCII changes, we first need to identify their fixes among all the logging codechanges. In this phase, we extract the fixes to the LCII changes by carefully examiningboth the independently changed and co-changed logging code changes.

3. Phase 3 - Extraction of the LCII changes (Section 5): unlike the feature code changesin which each changed line can be traced back to a previous version, the process ofidentifying the LCII changes is different. As the logging code is tangled with the featurecode, one line of logging code changes can be related to multiple feature code changes.For example, in one code commit, developers may update the static texts (due to methodrenaming) as well as the method invocations (due to changes in the method signatures)in one single line of the logging code. Hence, in this phase, we have developed anadapted version of the SZZ algorithm, called LCC-SZZ, to automatically identify LCIIchanges.

The next three sections (Sections 3, 4, and 5) will explain the above three phases indetails.

3 Phase 1 - Data Gathering and Data Preparation

Wewill first briefly describe the studied projects. Then we will explain our process to extractthe logging code changes and bug reports.

3.1 Studied Projects

In this paper, we focus our study on six popular Java-based open source software projects:Elasticsearch, Hadoop, HBase, Hibernate, Geronimo, and Wildfly. Table 1 provides anoverview of the studied projects. These projects are from different application domains. Theselected projects are either server-side or supporting component-based projects, since theprevious study (Chen and Jiang 2016) showed that software logging is more pervasive andmore actively maintained in these two categories than in client-side projects. Each studiedproject has one or more bug repositories (GitHub or Jira) as shown in the third column ofTable 1. As shown in the fourth column of the table, all the selected projects have a relativelylong development history: ranging between six to ten years. The fifth column (LCC) shows

Table 1 Studied projects

Project Description Bug repository Code history LCC

Elasticsearch Distributed analysis engine GitHub (2010-02-08, 2017-03-30) 2,781

Hadoop Distributed compute platform Jira (2009-05-19, 2017-08-22) 2,652

HBase Distributed database Jira (2007-04-03, 2017-05-05) 3,638

Hibernate ORM framework Jira (2007-06-29, 2017-07-25) 2,619

Geronimo Server runtime Jira (2003-08-07, 2013-03-21) 1,019

Wildfly Application server Jira, GitHub (2010-05-28, 2017-03-30) 3,401

Page 6: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

the total number of logging code changes throughout the development history. LCC standsfor the Logging Code Changes. It refers to the code changes which are directly related tosoftware logging. Additional/replacement/removal of a variable inside a debug statement,changes to the verbosity levels, or adding and removing the static texts are three examplesof LCCs. Here we only focus on the direct changes to the logging code. For example, if onlythe condition outside a logging statement is changed, such change is not considered as anLCC, as the changes may related to some feature code changes. As we can see the loggingcode from all six projects are actively maintained: each containing thousands of loggingcode changes.

There are two different types of data that we need to extract from the historical repos-itories in order to identify the LCII changes: the logging code changes, and the issuereports.

3.2 Logging Code Changes

All six studied projects use GitHub as their source code repositories. We wrote a script toautomatically extract the code version history from the master branch. We only focused onthe code changes committed to the master branch, as the logging code changes committedthere had been carefully reviewed. The extracted data includes every version of every sourcecode file , along with the meta information of each code version (e.g., the commit hash,the commit date, the authors, etc.). For every file, we then sorted these code versions bytheir commit timestamps. For example, if the file Foo.java was changed in two commits(hash:f4b214 and hash:a7cc6b), which correspond to the first and the second commits, itwould result in two extracted versions named Foo v1.java and Foo v2.java.

We ran ChangeDistiller (CD) (Fluri et al. 2007) to get the source code level changesbetween two adjacent versions. CD parses two adjacent versions (e.g., Foo v1.java andFoo v2.java) of the same the file into Abstract Syntax Trees (ASTs) and compares theASTs using a tree differencing algorithm. The output of CD contains a list of fine-grainedcode changes from four categories: insertion, deletion, update, and move. Since our focusis on the changes to the existing logging code, we only analyzed the updated code changeslike update to a particular method invocation or update to a variable assignment.

Once we obtained the source code level changes, we applied keyword-based heuris-tics to filter out the non-logging code changes. We searched for commit messages whichinclude the words like “log”, “trace”, and “debug”. We ruled out the changes whichcontained the mismatched words like “dialog”, “login”, etc. We also excluded the loggingcode which did not print any messages (e.g., logging verbosity level setting statement likelog.setLevel(org.apache.log4j.Level.toLevel(level))). Such key-word-based heuristics have been used in many of the previous studies (e.g., Chen and Jiang2016, 2017; Fu et al. 2014; Shang et al. 2015; Yuan et al. 2012b) to identify logging codechanges with high precision.

3.3 Issue Reports

The studied projects use two kinds of bug tracking systems: Jira and GitHub. For Jira-based projects (Hadoop, HBase, Hibernate, Geronimo, and Wildfly), we followed a similarapproach in Chen and Jiang (2016) to crawl the Jira reports using the provided Jira APIs.For Elasticsearch and Wildfly, their issues are managed through GitHub in the form of pullrequests and GitHub issues. We used the public APIs to crawl the related GitHub data.

Page 7: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

4 Phase 2 - Extraction of the Fixes to the LCII Changes

We have categorized the logging code changes into the following two categories:

– co-changed logging code, in which the logging code is updated together with the corre-sponding feature code. In Chen and Jiang (2017), there is an assumption that the fixesto the LCII changes only exist in the independently changed logging code. It is not clearwhether such an assumption is valid or not.

– independently changed logging code, refers to the logging code changes which are notclassified as co-changed logging code changes. Unlike the co-changed logging codechanges, which are generally updated along with the feature code, independentlychanged logging code is usually about fixing LCII changes. However, it is not clearwhether all the independently changed logging code is actually used to fix the LCIIchanges.

In the previous study (Chen and Jiang 2016), we have identified the following eight sce-narios of the co-changed logging code changes: (1) co-change with condition expressions,(2) co-change with variable declarations, (3) co-change with feature methods, (4) co-changewith class attributes, (5) co-change with variable assignment, (6) co-change with stringmethod invocations, (7) co-change with method parameters, and (8) co-change with excep-tion conditions. We encoded these rules to automatically identify the co-changed loggingcode changes. The remaining logging code changes belong to the independently changedlogging code changes.

In the rest of this section, we will explain our process to extract fixes to the LCII changesby mining the independently changed and co-changed logging code changes. During thisprocess, we will validate the above two assumptions.

4.1 Extracting Fixes to the LCII Changes from the Co-changed Logging CodeChanges

We extract fixes to the LCII changes from the co-change logging code changes by leveragingthe information contained in the filed bug reports. Usually, developers would include the bugreport IDs in the commit logs for traceability (Bird et al. 2009) and code review (Rigby et al.2008). For example, in Hadoop, there is a commit with the commit log stating the following:“HADOOP-15198. Correct the spelling in CopyFilter.java. Contributed by Mukul KumarSingh.”. This commit refers to the bug ID HADOOP-15198, whose details can be foundby searching this Jira ID online (https://issues.apache.org/jira/browse/HADOOP-15198).Hence, we extracted the bug IDs mentioned in the commit logs to link to the correspondingJira issues.

Then we performed keyword-based filtering to only include the bug reports whichaddressed the logging issues by searching the subject of the issue reports with keywordslike “logging”, “logs”, and “logging code”.

We did not search the same keywords in the bug description and the comments sections.Many of the reported issues include logs or logging code in their bug descriptions or com-ments as an artifact to help facilitate developers to understand or reproduce the reportedissues. Thus, searching through these two sections would cause too much noise in theresulting dataset. For example, Hadoop issue (HADOOP-12666 2018a) is about “SupportMicrosoft Azure Data Lake” and in the discussion, someone pasted a code snippet whichcontained logging code. However, this issue report is obviously not related to the issues

Page 8: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Server.java from Hadoop (HADOOP-7358)

V 1413:

V 1630: LOG.debug(getName() + ": responding to #" + call.id + " from " + call.connection);

LOG.debug(getName() + ": responding to #" + call.callId + " from " + call.connection);

Fig. 3 An example of the co-changed logging code, which is linked to a log-related bug report. However, itis not a fix to an LCII change

in logging code. Hence, to avoid too many false positives, we chose to only match thosekeywords in the title of the bug reports.

The six studied projects contain a total of 7,092 co-changed logging code changes. Thereare 553 (8%) co-changed logging code changes which are linked to the log-related issuereports. For each of these changes, we manually categorized their change types based ontheir intentions and whether they belong to the fixes to the LCII changes. There are 130logging code changes, which are not related to the LCII changes. Figure 3 shows one suchexample from the Server.java file in the Hadoop project. This code change is linked tothe Hadoop issue (HADOOP-7358 2018b). The attribute of the call object was changedfrom id to callId. However, the bug report was about improving “log levels when excep-tions caught in RPC handler”, which is not related to this logging code changes. Afterfiltering such irreverent changes, we ended up with 423 co-changed logging code changes,which are fixes to the LCII changes.

4.2 Extracting Fixes to the LCII Changes from the Independently Changed LoggingCode Changes

There are a total of 9,018 independently changed logging code changes. Due to its sheersize, we can only manually examine a few sampled instances. We randomly selected 369independently changed logging code changes for manual investigation. This corresponds toa confidence level of 95% with a confidence interval of 5%.

During our analysis, we found one scenario of the independently changed logging codechanges, which are not fixes to the LCII changes. This type of logging code changes mod-ifies the printing format (e.g., adding or removing spaces) without changing the actualcontents. Figure 4 shows one such example. The only change for that logging statement inthe new revision was adding a space after the word client in the static texts. There wereno changes in any of the four logging components (logging library, verbosity level, statictexts, or dynamic information). Hence, we automatically filtered out all the printing formatchanges from all the independently changed logging code changes using a script. We endedup with 8,325 (92%) independently changed logging code changes, which are fixes to theLCII changes.

We have further investigated the relation between the independently changed loggingcode changes and the reported log-related issues. We have located the issue IDs referencedin the commit logs of these changes and tried to verify if these issue reports contained

RpcProgramNfs3.java from Hadoop (HDFS-7423)

V 8351:

V 8428: LOG.debug("GETATTR for fileId: " + handle.getFileId() + " client:R"+ remoteAddress);

LOG.debug("GETATTR for fileId: " + handle.getFileId() + " client:"+ remoteAddress);

Fig. 4 An example of the printing format change

Page 9: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

any log related keywords in their titles. It turns out that only 10.4% of the independentlychanged logging code changes have corresponding log-related issue reports. This verifiedour argument in Section 1 that many issues in the logging code are generally not documentedin the commit logs or in the bug reports.

5 Phase 3 - Extraction of the LCII Changes

In the previous phase, we have generated a dataset which contains the fixes to the LCIIchanges. In this phase, we will extract the code commits containing the LCII changes.

5.1 Our Approach to Extracting the LCII Changes

We used an adapted approach of the SZZ algorithm (Sliwerski et al. 2005) to automaticallyidentify the LCII changes from their fixes. The SZZ algorithm (Sliwerski et al. 2005) isan approach which automatically identifies the bug introducing changes from their fixes. Itfirst tags code changes to bug fixing changes by searching for bug reports. Then it tracesthrough the code revision history to find when the changed code was introduced (bug intro-ducing changes). There are many follow-up works to further enhance the effectiveness ofthis approach (Davies et al. 2014; Kim et al. 2006; Sliwerski et al. 2005). However, allthese approaches are focused on identifying bugs in the feature code, directly applying theseapproaches on the fixes to LCII changes may lead to incomplete or incorrect results. Dif-ferent from the feature code changes, one line of logging code changes can be related tomultiple previous lines of the logging code changes.

We will illustrate the problem of using the SZZ algorithm to locate the LCII changesusing an example shown in Fig. 5.

Vn represents the version of the file. The file was changed through V1 to V40 while thelogging code was introduced at V1 and subsequently was changed at V20 and V40. Thechange at V40 is a fix to the LCII change since it corrected the verbosity level from infoto fatal in order to be consistent with the text Fatal error. In addition, in the samechange, the developer corrected a typo from adddress to address in the static texts. Byusing the SZZ algorithm, we find out that the problematic logging code (at V39)

was introduced at V20. Therefore, we considered the LCII change to be at V20. However,after examining the entire code revision history related to this logging code snippet, wenoticed that the logging code at V20 is not the only LCII change.

There are two problems associated with the logging code at V39: the verbosity level andthe static texts. The typo of the static texts was introduced in (V20), and the verbosity levelinfowas introduced when the logging code was initially added (V1). Hence, for fix versionV40, there are two LCII change versions: V1 and V20.

V 15: Log.info("Fatal error occur in execution: " + ipAddress);(Logging code first introduced)V 20: Log.info("Fatal error occur in execution, adddress: " + ipComplexAddress);(Update info and typo introduced)V 40: Log.fatal("Fatal error occur in execution, address: " + ipComplexAddress);(Fix level and typo)

Fig. 5 An example of the logging code changes

Page 10: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

The main problem with the original SZZ algorithm is that it would consider loggingcode as one entity instead of treating the various components (logging object, verbositylevel, static texts, and dynamic contents) of the logging code separately. To cope with thisproblem, we have developed an adapted version of the SZZ algorithm, called LCC-SZZ.There are two inputs for the LCC-SZZ algorithm: (1) a list containing the historical revisionsof a particular snippet of the logging code, and (2) the versions at which the logging codeare fixed to resolve the LCII change. The output of the LCC-SZZ algorithm is the version(s)of LCII change(s). The pseudo code of this algorithm is shown in Algorithm 1.

The whole process contains two steps: (1) formulating a list of component chains; and(2) finding the version(s) of the LCII change(s) for each fixed component.

In step 1, LCC-SZZ breaks down each version of that logging code snippet into vari-ous components. Each component has a type and a string representation. The type could belogging library, verbosity level, static texts, or dynamic information (e.g. variables, methodinvocations, etc.). The string representation is the value of that component. For example,the string representation of a variable is the variable name. We then track the historicalchanges for each component and formalize them as component chains. Therefore, for alist of logging code, we have multiple component chains and grouop them together as thecomponent chain list. This step is done by the procedure CHAIN FORMULATION. In thebeginning, the component chain list is created as an empty list on line 2. From line 3 to6, the components for each version of that logging code snippet are extracted. Then thisextracted data is transformed through the CHAIN FORMULATION procedure. The inputsfor CHAIN FORMULATION are the extracted components from the logging code and thecomponent chain list. The for loop on line 15 is used to iterate through all componentsto match with the existing component chains. The for loop on line 17 is used to iteratethrough all the component chains to see if the current component can be matched to any ofthem. As shown on line 18, the component needs to have the same type with the componentchain. If the component type is logging library or verbosity level, it will be put into the cor-responding component chain (shown from line 19 to 22); as each logging code snippet onlycontains one logging library and one verbosity level. For other types of components, theyare analyzed using their string representations. The old and the new string representationsof the component are checked to see if they are similar from line 24 to 29 based on the twofollowing criteria: (1) if the string edit distance between the two component string represen-tations is less than 3; or (2) if the length of the longest substring from the two componentstring representations are larger than 3. We choose threshold value to be 3 based on a trialand error process. If either one of the above criteria is true, they are considered to be similar,and the current component is inserted to this chain as a new node. If there is no match found,we will initialize a new component chain with this component to be the head of the chain.This process is implemented on line 33. This component chain is then added to the compo-nent chain list. In the end, for a list of logging code, we have formulated a list of componentchains (i.e., the component chain list). In our example shown in Fig. 5, the loggingcode list contains three lines of logging code: V1, V20, and V40. There are two fix versions:V20 and V40. There are four component chains in the component chain list for this exam-ple. The component chain for the verbosity level is: “info(V1) ← info(V20) ← fatal(V40)”.The component chain for the variable is: “ipAddress(V1) ← ipComplexAddress(V20)← ipComplexAddress(V40)”. All four resulting component chains are shown inFig. 6.

Page 11: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Algorithm 1 The pseudo code of the LCC-SZZ algorithm.

Input: logging code list, fix version list

1: procedure LCC SZZ( )

2:

3: for logging code in logging code list do

4:

5: CHAIN FORMULATION(extracted components,component chain list)

6: end for

7: for fix version in fix version list do

8: for component chain in component chain list do

9: EXTRACT ISSUE INTRODUCING VERSION(

)

10: end for

11: end for

12: end procedure

13:

14: procedure CHAIN FORMULATION( )

15: for component in extracted components do

16:

17: for component chain in component chain list do

18: if component.type=component chain.type then

19: if component.type = LEVEL or component.type = LIB then

20: component chain.add(component)

21:

22: break

23: end if

24:

25: if similar string(component,latest component) = true then

26: component chain.add(component)

27:

28: break

29: end if

30: end if

31: end for

32: if find chain = false then

33: init new chain( )

34: end if

35: end for

36: end procedure

37:

38: procedure EXTRACT ISSUE INTRODUCING VERSION(

)

39: for component in component chain do

40: if component.version = fix version then

41: .

42: . .

43: if issue component content != component.content then

44: while issue component.content = issue component content do45: if issue component.previous = null then

46: break

47: end if

48: .

49: end while

50: end if

51: print issue component.version

52: end if

53: end for

54: end procedure

Page 12: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Log Log Log

fatal info

“Fatal error

…,

address”

“Fatal error

…,

adddress”

“Fatal error

…”

info

ipComplex

Address

ipComplex

AddressipAddress

V40 V20 V1

Logging

library

Verbosity

level

Static

texts

Dynamic

contents

Fig. 6 The resulting component chains

After the formulation of the component chain list, the next step (a.k.a., step 2) is to findthe issue introducing version given a specified fix version. This process is explained inthe procedure EXTRACT ISSUE INTRODUCING VERSION starting from line 38. All thecomponents which are changed in the fix version are retrieved. Then the components fromthe previous version of the fix version are picked, as they contain these issues. This step isimplemented from line 40 to 43. For the components with issues in the previous version,the search continues until the first version when the issue appears is found. This processis implemented through the while loop on line 44. In our example, the fix versions areV40 and V20. In fix version V40, both the verbosity level, and the static texts are changed.Hence, the previous version (V20) is examined. In V20, the verbosity level is info whosefirst appearance is in the initial version of this logging code snippet, V1. Similarly, at V40,the spelling of adddress is changed to address, and the first appearance of this issue isat V20. Therefore, for fix version V40, there are two issue introducing versions: V1 and V20.In fix version V20, the variable ipAddress is changed to ipComplexAddress. Theissue introducing version is V1. Hence, the LCII change versions for fix version V40 are V1and V20. The LCII change version for fix version V20 is V1.

5.2 Evaluation

To evaluate and compare the effectiveness of the LCC-SZZ and the SZZ algorithms, weused an approach similar to da Costa et al. (2017). First, we applied both algorithms onthe fixes to the LCII changes. Then we compared the results of the two algorithms fromthe following three dimensions: (1) disagreement ratio between the results from the twoalgorithms; (2) the earliest appearance of the LCII changes; and (3) manual verification.The first dimension is a general estimation of how these two algorithms differ. The seconddimension is concerned with the discrepancies between the two algorithms when comparedto the estimates given by the development team. The third dimension is to estimate theaccuracy by comparing against a human oracle.

Page 13: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Table 2 Difference ratio ofcomputed introducing versionsby SZZ and LCC-SZZ

Project Same Different Total

Elasticsearch 758 (90.5%) 80 (9.5%) 838

Hadoop 905 (87.8%) 125 (12.2%) 1,030

HBase 1,276 (83.2%) 257 (16.8%) 1,533

Hibernate 1,657 (83.5%) 328 (16.5%) 1,985

Geronimo 489 (91.2%) 47 (8.8%) 536

Wildfly 2,553 (90.3%) 273 (9.7%) 2,826

Total 7,638 (87.3%) 1,110 (12.7%) 8,748

5.2.1 The Disagreement Ratio Between SZZ and LCC-SZZ

We calculated the disagreement ratio between the introducing versions generated by SZZand LCC-SZZ. The results are shown in Table 2. In total, 12.7% of the results are differ-ent. Among the six studied projects, HBase has the largest difference ratio (16.8%) andGeronimo has the lowest difference ratio (8.8%).

5.2.2 The Earliest Appearance of the LCII Changes

The evaluation on the earliest LCII changes is to compare the results from the SZZ andLCC-SZZ algorithms with the estimates provided by the development community. Thisevaluation dimension is not meant to tell whether SZZ/LCC-SZZ is absolutely correct.Rather, it aims to point out the obviously incorrectly outputted changes. Within each bugreport, there is a field called “affected-version” showing the versions of the project that thelogging issue impacted. For example, in HDFS, a component of the Hadoop system, the Jiraissue (HDFS-4122 2018c) shows the affected HDFS versions are 1.0.0 and 2.0.0-alpha. Asthe issue introducing date cannot be after the earliest affected version, we can use such adataset to evaluate the SZZ/LCC-SZZ results. We consider the results from SZZ/LCC-SZZto be incorrect if the resulting date reported by either algorithm is after the earliest affectedversion date. To notice, if the LCC-SZZ yields multiple version results, we use the date ofthe earliest version. For example, if the date of the earliest affected version is January 1,2012 and the LCII changes outputted by the SZZ algorithm is October 1, 2012, we considerthis output by the SZZ algorithm to be incorrect. However, if the resulting date is before theearliest affected version, we cannot judge the correctness of the outputted changes. Table 3shows the details.

Table 3 Comparing the time from the earliest bug appearance and the LCII code commit timestamp

Project After affected After affected Bug reports with # of

version (SZZ) version (LCC-SZZ) affected version bug report

Hadoop 41 29 286 407

HBase 27 25 87 397

Hibernate 2 2 141 332

Geronimo 0 0 21 32

Total 70 (13%) 56 (10%) 535 1,168

Page 14: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

For Hadoop, HBase, Hibernate, and Geronimo, less than half ( 5351168 × 100% = 46%) of

the bug reports have valid data in the “affected-version” field. Using the SZZ algorithm,13% of the computed versions are after the earliest affected versions (a.k.a., incorrectlyidentified LCII changes). Using the LCC-SZZ algorithm, this number reduces to 10%,which is a 30% improvement compared to the original SZZ algorithm. The majority of theseimprovements come from Hadoop, which also happens to contain the largest number of bugreports with valid data in the “affected-version” field.

5.2.3 Manual Verification

To further evaluate the results, we did a stratified sampling on all the fixes to the LCIIchanges, and went through the logging code change history to find the commits where thelogging code snippets became issues. In this way, we generated a manually verified oracledataset for the correctly identified LCII changes.

We calculated the difference ratio between the oracle and results from the SZZ/LCC-SZZ algorithms. The results are shown in Table 4. In total, we manually examined 368 fixesto the LCII changes and derived the oracle for code commits containing the LCII changes.85% of the SZZ detecting results agree with the oracle while 93% of the LCC-SZZ resultsagree with the oracle. All of the LCII changes correctly identified by the SZZ algorithm arealso correctly identified in the LCC-SZZ results.

For the case that both LCC-SZZ and SZZ failed, we manually examined a few of them.We found that the misclassified instances are mainly related to logging style changes.Figure 7 shows an example. The logging code style was changed to format printing fromstring concatenation. The logging code applied string concatenation style since introductionat version 658. Therefore, version 658 should be the LCII change version. However, thereis a change from “Index ” to “index” (fixing typos) from version 658 to 715. Both SZZ andLCC-SZZ mistakenly label 715 as the version containing the LCII changes.

Summary: our evaluation results show that: (1) 13.6% of the identified LCII changes

obtained from the LCC-SZZ algorithm are different from the original SZZ algorithm;

(2) when evaluating under the earliest appearance of the LCII changes, there are more

incorrectly flagged results (3%) by the SZZ algorithm compared to the LCC-SZZ algo-

rithm; (3) the LCC-SZZ algorithm achieves 93% accuracy when compared to the oracle,

which is an 8% improvement compared to results from the SZZ algorithm.

Table 4 Consistency comparedto manual oracles Project SZZ LCC-SZZ Total

Elasticsearch 25 29 35

Hadoop 37 41 43

HBase 53 60 64

Hibernate 75 76 84

Geronimo 20 22 23

Wildfly 105 115 119

Total 315 343 368

Page 15: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

V 658: logger.debug("Index [" + index + "]: Update mapping ["+ type+ "] (dynamic) with source ["+ updatedMappingSource+ "]")

V 715: logger.debug("index [" + index + "]: Update mapping ["+ type+ "] (dynamic) with source ["+ updatedMappingSource+ "]")

V 736: logger.debug("[{}] update mapping [{}] (dynamic) with source [{}]",index,type,updatedMappingSource)

MetaDataService.java in ElasticSearch

Fig. 7 Examples of both SZZ and LCC-SZZ failed

6 Preliminary Studies

In the previous sections (Sections 3, 4 and 5), we have explained our approach to extractingthe LCII changes and evaluated the quality of the resulting dataset (93% accuracy basedon manual verification). In this section, we will perform a few preliminary studies on thisdataset to highlight the usefulness of such data and discuss some of the open problems. Wewant to study the characteristics of LCII changes by characterizing the intentions behind thefixes to the LCII changes (RQ1), studying the complexity of their fixes (RQ2), the durationto address these issues (RQ3), and the effectiveness of existing automated approaches todetecting logging code issues (RQ4). For each of the research question, we will describe ourextraction process for the required dataset, explaining the data analysis process, and discussits findings and implications.

6.1 RQ1: What are the Intentions Behind the Fixes to the LCII Changes?

In this RQ, we will characterize the intentions behind the fixes to the LCII changes. Weintend to do this based on the two types of logging code changes from two dimensions:(1)what-to-log vs. how-to-log, and (2) co-changed logging code changes vs. independentlychanged logging code changes.

6.1.1 Data Extraction

Table 5 shows the break down of the total number of logging code changes and the num-ber of fixes to the LCII changes. TL shows the total number of logging code changes; INDshows the number of independently changed logging code changes; IND FIX shows thenumber of IND which are fixes to LCII changes; CO shows the number of co-changed

Table 5 Summary of the fixes to LCII changes

Project TL IND IND FIX CO CO FIX T FIX

Elasticsearch 2,781 1,004 818 1,777 20 838

Hadoop 2,652 1,121 913 1,531 117 1,030

HBase 3,638 1,497 1,413 2,141 120 1,533

Hibernate 2,619 1,958 1,921 661 64 1,985

Geronimo 1,019 652 527 367 9 536

Wildfly 3,401 2,786 2,733 615 93 2,826

Total 16,110 9,018 8,325 7,092 423 8,748

Page 16: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

logging code changes; CO FIX shows the number of CO which are fixes to the LCIIchanges; and T FIX shows the total number of fixes to LCII changes. For example, inHadoop, out of 2,652 logging code changes, 1,030 (38.8%) are fixes to the logging codechanges. Out of these 1,030 fixes, 913 (88.6%) are independently changed logging codechanges and 117 (11.4%) are co-changed logging code changes. Most of the independentlychanged logging code changes (92.3%) are fixes to the LCII changes. However, when welook at the co-changed logging code changes, majority of them (94.0%) are co-changes withthe features code instead of fixes to the LCII changes. Among the total 8,748 fixes to LCIIchanges, majority of which (95.0%) are independently changed logging code changes.

6.1.2 An Overview of the Data Analysis Process

Since there are only 423 co-changed logging code changes, which are fixes to the LCIIchanges, we manually studied all of them. On the other hand, there are much more indepen-dently changed logging code changes than the co-changed logging code changes (8,325 vs.423). It would take too much time for us to examine all the instances manually. Hence, weonly examined 367 randomly sampled instances. This sample size corresponds to a confi-dence level of 95% with a confidence interval of 5%. When selecting the sample instancesfor manual inspection, we applied the stratified sampling technique (Chen and Jiang 2016)to ensure the representative samples are selected and studied from each project. Usingthe stratified sampling approach, the portion of sampled logging code changes from eachproject is equal to the relative weight of the total number of independently changed log-ging code changes. For example, among the total 8,325 independently changed loggingcode changes from the six studied projects, 913 of them are from Hadoop. Thus, 40 (a.k.a.,367 × 913

8325 ) of the manually studied independently changed code changes were selectedfrom Hadoop. The detailed breakdown of our characterization results is shown in Table 6.

Table 6 Our manual characterization results on intentions behind the fixes to LCII changes

Category Rationale Co-changed Independently

What-to-log Adding More Information (AMI) 64 24

CLArification (CLA) 84 25

Fixing Language Problems (FLP) 14 38

Avoid Excessive Logging (AEL) 38 15

How-to-log Checking Nullable Variables(CNV) 0 1

Removing Object Casting (ROC) 0 1

Refactoring Logging Code (RLC) 0 1

Changing Output Format (COF) 0 1

Updating Logging Style (ULS) 82 255

Adding Logging Guards (ALG) 114 0

Mess-up Commits in VCS (MCV) 0 6

Others - 27 0

Total - 423 367

Page 17: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

In total, we have characterized 4 intentions and 7 intentions behind the what-to-log vs.how-to-log dimension. Among co-changed logging code changes, 27% (114423 × 100%) are“Add Logging Guards (ALG)”, which is the most common intention. Among independentlychanged logging code changes, the majority of the fixes ( 255367 × 100% = 70%) are for“Updating Logging Style (ULS)”. There are also other intentions. However, since the num-ber of these intentions is small, we just grouped them into a category called “Others”. Wewill explain below using real-world code examples for each characterized intention.

6.1.3 Detailed Characterization of the Intentions Behind the LCII Changes

Here we will explain our detailed characterization of the intentions behind the LCII changesfrom the following three categories: what-to-log, how-to-log, and others.

Figure 8 shows the real world code examples for each intention for the fixes of the LCIIchanges behind the what-to-log category:

– Adding More Information (AMI): extra information is added to the logging codefor the purpose of providing additional contextual information. Both co-changedand independently changed logging code changes can have this intention.Figure 8a shows an example for co-changed logging code changes. A methodgetClientIdAuditPrefix is added in the newer version in order to provideadditional runtime context (HBASE-8754 2018d): “this (client IP/port of the bal-ancer invoker) is a critical piece of admin functionality, we should log the IP for itat least ...”. An independently changed logging code changes, as shown in Fig. 8b,this.serverName was added to provide more information about the SplitLog-Worker.

– CLArification (CLA): to clarify the runtime context, some of the outdated or imprecisedynamic information (e.g., local variables or class attributes) is fixed. Both co-changedand independently changed logging code changes can have this intention. Figure 8cshows an example of the co-changed logging code changes. The variable numEditsis updated to numEditsThisLog to output a more accurate number of edits requiredfor this logging context (HDFS-1073 2018a).

Figure 8d shows an example for the independently changed logging codechange. The variable AMConstants.SPECULATOR CLASS was replaced byMRJobConfig.MR AM JOB SPECULATOR to better reflect the reported error.

– Fixing Language Problems (FLP): the logging code is updated to fix typos or grammarmistakes in texts or dynamic information (e.g., class attributes, local variables, methodnames, and class names). Both co-changed and independently changed logging codechanges can have this intention. Figure 8e shows an example for the co-changed log-ging code changes. The class attribute AUTHZ SUCCESSFULL For was misspelled,as reported in HADOOP-8347 (2018c), and was corrected in the next revision. Figure 8fshows an example for the independently changed logging code changes. The wordattrubute in the static texts was misspelled and corrected to attribute in thenext version.

– Avoid Excessive Logging (AEL): the logging code is updated to avoid excessive log-ging to reduce the runtime overhead. Both co-changed and independently changedlogging code changes can have this intention. Figure 8f shows an example for the co-changed logging code changes. The verbosity level of the logging code was changedfrom error to debug and the logging guard was also added. This was to reduce theamount of logging to lessen the runtime overhead and the effort to analyze the log files

Page 18: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Fig. 8 The intentions behind various fixes to the LCII changes, which are related to what-to-log. For eachintention, we have included real-world code examples

(HBASE-12539 2018b). Figure 8h shows an example for the independently changedlogging code changes. The variable and the text tableName were deleted since thevariable location already contained this information.

Page 19: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Figure 9 shows the real world code examples for each intention for the fixes of the LCIIchanges behind the how-to-log category:

– Adding Log Guards (ALG): the condition statements are added to ensure thatthe logs are only generated for some scenarios. Only co-changed logging codechanges have this intention. Figure 9(a) shows one such example. The log guardLOG.isTraceEnabled() is added to ensure the corresponding generated logs gotprinted when the trace level logging is enabled in the configuration settings (HHH-67322018). Such logging code changes are considered as fixes to the LCII changes, sincethese changes can improve software performance by preventing unnecessary creationof strings.

Fig. 9 The intentions behind various fixes to the LCII changes, which are related to how-to-log. For eachintention, we have included real-world code examples

Page 20: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

– Updating Logging Style (ULS): the logging code is updated due to the changes inthe logging library/method/API. Both co-changed and independently changed loggingcode changes can have this intention. Figure 9b shows an example for the co-changedlogging code changes. The new version of the logging code uses log-specific API(methodInvocationFailed) to generate logs as explained in the pull request(PR5906 2018) of theWildfly project. This style is commonly used in some of the third-party logging libraries like JBoss (JBoss Logging 2018). Figure 9c shows an examplefor the independently changed logging code changes. The invoked method was changedfrom trace to tracev. The latter method requires the parameter to be formattedas {0} instead of string concatenation. Such changes can improve the readability oflogging code and therefore ease the future maintenance activities on the logging code.

– Checking Null Variable (CNV): the logging code is updated to check if the variable isnull to prevent runtime failure. Only independently changed logging code changes havethis intention. Figure 9d shows one such example. The null check of the variable hriswas added to avoid the null pointer exception. Such changes improve the robustness ofthe logging code by preventing runtime failures caused by null pointer exceptions.

– Changing Object Cast (COC):the logging code is updated to change the object cast of a variable. Only indepen-

dently changed logging code changes have this intention. Figure 9e shows one suchexample. An explicit cast was added to the variable addresses to improve the for-matting of the output string. Such changes can improve the readability of logging codeand therefore ease the future maintenance activities on the logging code.

– Refactoring Logging Code (RLC): the logging code is updated for refac-toring purposes. Only independently changed logging code changes havethis intention. Figure 9f shows one such example. For the variableshexec, both Arrays.toString(shexec.getExecString()) andshexec.toString() would output the same string. Similar as the above fixes,such changes can improve the readability of logging code and therefore ease the futuremaintenance activities on the logging code.

– Changing Output Format (COF): the logging code is updated to correct themalformed output. Only independently changed logging code changes have thisintention. Figure 9g shows one such example. The type of return value ofinfo.getRegionName() is Byte. If it is outputted directly, it will not be humanreadable. Hence, it was wrapped by the method Bytes.toString() to improve thereadability and maintainability of the logging code.

– Mess-up Commits in VCS (MCV): the logging code is updated due to mess-up commitsin version control systems (VCS). Only independently changed logging code changeshave this intention. Figure 9h shows one such example. The static text region wasdeleted at first and then added back. This is because developers first merged commitsfrom another branch and later reverted the change. Such changes are mainly loggingcode maintenance activities. They may or may not improve the quality of the loggingcode.

Some fixes to the LCII changes are due to other reasons. Figure 10 shows one suchexample. The system functionality has been changed from renaming to moving as stated inthe bug report (HDFS-11448 2018b). Thus, the logging texts are updated as well. However,the numbers for these intentions are very small. Hence, we grouped these together into onecategory (“Others”).

Page 21: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Fig. 10 An example of fixes to LCII changes due to other reasons

6.1.4 Breakdown for Fixes to LCII Changes per Project

We further break down the characterization of the fixes to the LCII changes for each of thestudied project. The results are shown in Table 7.

In general, ULS is the biggest group in three out of six projects (Hibernate, Geronimo,Wildfly). CLA is the biggest group in Elasticsearch (27%) and HBase (24%). In Hadoop,the biggest group is ALG. When we look among the intentions in the what-to-log category,CLA is the biggest category in five out of the six studies projects and AMI is the secondbiggest. This shows that developers tend to clarify or add additional information when theyfix logging code issues related to what-to-log. In how-to-log, ULS is the biggest category infour out of six projects. The second biggest intention is ALG. There are less than 1% MCVinstances due to the merging issue in the version control systems.

Table 7 Our manual characterization results on the intentions behind the fixes to the LCII changes

Category Rationale Elasticsearch Hadoop HBase Hibernate Geronimo Wildfly

What-to-log AMI 8 29 42 0 8 1

CLA 15 30 44 9 1 10

FLP 10 14 19 2 1 6

AEL 12 15 24 0 1 1

How-to-log CNV 0 0 1 0 0 0

ROC 1 0 0 0 0 0

RLC 0 1 0 0 0 0

COF 0 0 1 0 0 0

ULS 10 23 13 77 19 195

ALG 0 38 17 57 2 0

MCV 0 2 2 1 0 1

Others - 0 5 22 0 0 0

Total - 56 157 185 146 32 214

Page 22: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

6.1.5 Summary

Findings: Contrary to our previous study (Chen and Jiang 2017), both co-changed and

independently changed logging code changes can contain fixes to the LCII changes. The

majority of the fixes to the LCII changes are from the independently changed logging

code changes. In total, there are ten different intentions behind the fixes to the LCII

changes: four are related to what-to-log, and seven are related to how-to-log. Among

them, “Clarification” and “Updating Logging Style” are the top two intentions.

Implications: Our current approach to characterizing the fixes to the LCII changes

is done manually. This is very time consuming and prevents us from studying larger

datasets. Advanced text analysis approaches like topic modeling or natural language

processing can be helpful to automatically cluster related code fixes and provide

summarizations.

6.2 RQ2: Are the Fixes to the LCII Changesmore Complex than Other Logging CodeChanges?

In the previous RQ, we have characterized the intentions behind different fixes to the LCIIchanges. In this RQ, we intend to compare the complexity between the fixes to the LCIIchanges and other logging code changes. Since there are no existing metrics available toquantify the logging code change complexity, we have defined a complexity metric (thenumber of changed components of the logging code) to characterize how complex a log-ging code change is. Each logging code snippet contains four components: the logginglibrary, the verbosity level, statics texts, and dynamic invocations. A logging code changeis more complex, if there are more types of components changed together. The most com-plex logging code change is that all the components of a logging code snippet have beenchanged. On the other hand, some logging code changes only involve component positionchanges without actual content changes. For example, the following logging code snippetlog.info("text" + var); could be changed to log.info("text", var);,since none of the components’ contents have been changed. For such changes, the value ofthe complexity metric is “0”.

6.2.1 Data Extraction

For each logging code change, we tracked whether any logging components have beenchanged using the heuristics similar to our previous works (Chen and Jiang 2016, 2017) andcalculated their change complexity metrics defined above.

6.2.2 Data Analysis

Table 8 shows the breakdown of different components for the fixes to the LCII changesand other logging code changes. Generally, there are large differences in three out of fourcomponents. 27.5% of the fixes to LCII changes contain logging library changes, while only5.1% of the other logging code changes contain logging library changes. The portion ofverbosity level changes in the fixes to the LCII changes and other logging code changes are46.4% versus 8.2%. While the fixes to LCII changes contain more logging library changesand verbosity level changes, only 14.4% of them contain dynamic content changes. On theother hand, 78.6% of other logging code changes contain dynamic content changes. For

Page 23: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Table 8 Number of individual logging component changes

Project Lib Level Static Dynamic Total

Elasticsearch LCII 35 (4.2%) 132 (15.8%) 603 (72.0%) 236 (28.2%) 838

Elasticsearch Other 18 (0.9%) 47 (2.4%) 933 (48.0%) 1,633 (84.0%) 1,943

Hadoop LCII 182 (17.7%) 360 (35.0%) 479 (46.5%) 240 (23.3%) 1,030

Hadoop Other 28 (1.7%) 53 (3.3%) 738 (45.5%) 1,253 (77.3%) 1,622

HBase LCII 122 (8.0%) 407 (26.5%) 875 (57.1%) 461 (30.1%) 1,533

HBase Other 11 (0.5%) 48 (2.3%) 847 (40.2%) 1,858 (88.3%) 2,105

Hibernate LCII 742 (37.4%) 1,292 (65.1%) 1,089 (54.9%) 84 (4.2%) 1,985

Hibernate Other 251 (39.6%) 346 (54.6%) 443 (69.9%) 296 (46.7%) 634

Geronimo LCII 123 (22.9%) 291 (54.3%) 141 (26.3%) 63 (11.8%) 536

Geronimo Other 18 (3.7%) 24 (5.0%) 269 (55.7%) 293 (60.7%) 483

Wildfly LCII 1,206 (42.7%) 1,575 (55.7%) 565 (20.0%) 177 (6.3%) 2,826

Wildfly Other 48 (8.3%) 87 (15.1%) 236 (41.0%) 453 (78.8%) 575

Total LCII 2,410 (27.5%) 4,057 (46.4%) 3,752 (42.9%) 1,261 (14.4%) 8,748

Total Other 374 (5.1%) 605 (8.2%) 3,466 (47.1%) 5,786 (78.6%) 7,362

static text changes, fixes to the LCII changes and other logging code changes are similar(42.9% versus 47.1%). For fixes to the LCII changes, the verbosity level changes rank firstwhile dynamic content changes rank first in other logging code changes. Static text changesare ranked second in both types of logging code changes.

Table 9 shows the complexity of the logging code changes for each project. The majorityof logging code changes are “1 type” changes (72.9% in the fixes to the LCII changes and66.8% in other logging code changes). The second largest category is “2 types” loggingcode changes, 21.9% and 27.9% for the fixes to the LCII changes and other logging codechanges, respectively. For “3 types” and “4 types” logging code changes, they only make up4.8% and 5.3% of the fixes to the LCII changes and other logging code changes. There are0.5% fixes to the LCII changes categorized as “0 type” changes. Hence, in terms of changecomplexity, fixes to the LCII changes and other logging code changes are similar.

Here we further examined the top two most types of most frequently occurred complexitychanges: “1 type” and “2 types” changes.

The results for the “1 type change” are shown in Table 10. For each component, there aretwo columns. The column on the left shows the counts of co-changed logging code changeswhich change that component. The column on the right shows the counts of independentlychanged logging code changes which change that component. For example, there are 58 “1type”co-changed logging code changes which changes static texts and 1661 for indepen-dently changed logging code changes. Among all components, changes related to verbositylevels are the most, accounting for 36.8%. None of the co-changed logging code changes arerelated to logging library and verbosity level, which is a big difference from indpendentlychanged logging code changes.

The detail results for the “2 types” changes are shown in Table 11. Changes relatedto static texts and verbosity levels are the two most common changes. 84.5% of thechanges modify texts and 68.1% of the changes modify verbosity levels. The indepen-dently changed logging code and co-changed logging code changes have the similardistribution.

Page 24: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Table 9 The complexity of logging code changes grouped by each project

Project 0 type 1 type 2 types 3 types 4 types Total

Elasticsearch LCII 1 (0.1%) 668 (79.7%) 169 (20.2%) 0 (0.0%) 0 (0.0%) 838

Elasticsearch Other 0 (0.0%) 1,287 (66.2%) 625 (32.2%) 30 (1.5%) 1 (0.1%) 1,943

Hadoop LCII 9 (0.9%) 799 (77.6%) 205 (19.9%) 16 (1.6%) 1 (0.1%) 1,030

Hadoop Other 0 (0.0%) 1,209 (74.5%) 381 (23.5%) 27 (1.7%) 5 (0.3%) 1,622

HBase LCII 1 (0.1%) 1,217 (79.4%) 298 (19.4%) 16 (1.0%) 1 (0.1%) 1,533

HBase Other 0 (0.0%) 1,477 (70.2%) 598 (28.4%) 29 (1.4%) 1 (0.0%) 2,105

Hibernate LCII 3 (0.2%) 952 (48.0%) 838 (42.2%) 189 (9.5%) 3 (0.2%) 1,985

Hibernate Other 0 (0.0%) 181 (28.5%) 219 (34.5%) 219 (34.5%) 15 (2.4%) 634

Geronimo LCII 15 (2.8%) 431 (80.4%) 84 (15.7%) 5 (0.9%) 1 (0.2%) 536

Geronimo Other 0 (0.0%) 372 (77.0%) 104 (21.5%) 4 (0.8%) 3 (0.6%) 483

Wildfly LCII 13 (0.5%) 2,307 (81.6%) 318 (11.3%) 172 (6.1%) 16 (0.6%) 2,826

Wildfly Other 0 (0.0%) 393 (68.3%) 125 (21.7%) 47 (8.2%) 10 (1.7%) 575

Total LCII 42 (0.5%) 6,374 (72.9%) 1,912 (21.9%) 398 (4.5%) 22 (0.3%) 8,748

Total Other 0 (0.0%) 4,919 (66.8%) 2,052 (27.9%) 356 (4.8%) 35 (0.5%) 7,362

Findings: the majority of the fixes to the LCII changes are changing the verbosity level,

while the majority of the other logging code changes are changing dynamic contents. The

two types of logging code changes (LCII changes and co-evolving logging code changes)

are similar in terms of change complexity. The majority of both types of logging code

changes only have one or two component changes (a.k.a., “1 type” and “2 types” logging

code changes). Among the “1 type” logging code changes, the most common changes

are about verbosity level changes, followed by the logging library and static text changes

tied at the second place. Among the “2 types” logging code changes, the top two most

common changes are changes in the verbosity level and static texts.

Implications: correcting logging code issues just needs to change one or two compo-

nents in most of the cases with particular focuses on the verbosity levels and the static

texts. Although there are existing automated approaches (e.g., Chen and Jiang 217; Yuan

et al. 2011) to helping developers determine the appropriate verbosity levels for each

logging code snippet, there are no techniques focusing on detecting and improving the

static texts component.

6.3 RQ3: How Long does it Take to Fix an LCII Change?

In this RQ, we will measure how long it takes to resolve a LCII change. We are going tocompare the resolution time between LCII changes and reported bugs.

6.3.1 Data Extraction

We first extracted the bug creation date and the bug resolution date from the bug reportsin order to compute the bug resolution time. To compute the resolution time for each LCIIchange, we computed the time differences between the LCII changes and their fixes.

Page 25: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Table10

Com

ponent

changesforthe“1

type”changes

Project

Lib

Level

Text

Dynam

icTo

tal

Co

Ind

Co

Ind

Co

Ind

Co

Ind

Elasticsearch

030

0103

2441

884

668

Hadoop

0120

0261

42273

2380

799

HBase

069

0312

10601

50175

1,217

Hibernate

0530

0297

170

450

952

Geronim

o0

710

231

297

525

431

Wild

fly

0899

01,144

1179

4341

2,307

Total

01719(27.0%

)0

2,348(38.0%)

58(0.9%)

1,661(26.1%)

133(2.1%

)454(7.1%

)6,374

Page 26: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Table11

Com

ponent

changesforthe“2

types”

changes

Project

Lib

Level

Static

Dynam

icTo

tal

Co

Ind

Co

Ind

Co

Ind

Co

Ind

Elasticsearch

05

029

9151

9135

169

Hadoop

055

1865

38109

2897

205

HBase

051

1465

44203

30189

298

Hibernate

033

55748

54772

113

838

Geronim

o0

490

540

360

2984

Wild

fly

11124

2251

14185

1534

318

Total

11(0.6%)

317(16.6%)

89(4.7%)

1,212(63.4%)

159(8.3%

)1,456(76.2%)

83(4.3%)

497(26.0%)

1,912

Page 27: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Table 12 The number of fixes tothe LCII changes and the numberof bug reports in each project

Project # of LCII fixes # of bug reports

Elasticsearch 838 N/A

Hadoop 1,030 31,420

HBase 1,533 15,486

Hibernate 1,985 9,009

Geronimo 536 5,594

Wildfly 2,826 14,469

6.3.2 Data Analysis

Table 12 shows the numbers of fixes to LCII changes and regular bug reports for each stud-ied project. There are much more reported bugs than fixes to LCII changes. For example, inHadoop, there are 1,030 fixes to LCII changes while there are 31,420 bug reports (almost30 times more). Elasticsearch does not have an issue tracking system, so we only show thenumber of fixes to LCII changes.

Table 13 shows the median resolution time for both LCII and regular bugs in each project.For example, in Hadoop, the median resolution time for LCII changes is 210.9 days and itis 15.5 days for regular bugs. For all five studied projects, the median resolution time of theLCII changes is longer than that of regular bugs. In particular, the resolution time can beas long as 379 days for the fixes to the LCII changes (Wildfly) while the longest medianresolution time for regular bugs is only 30.9 days (Hibernate).

Figure 11 visually compares the distribution of the resolution time for the fixes to theLCII changes, and the duration for fixing regular bugs. Each plot is a beanplot (Kampstra2008). The left part of the plot shows the distributions of the durations for fixing the LCIIchanges, while the right part of the plot shows the distribution of the bug resolution time.The vertical scale is shown under the unit of the natural logarithm of days. Among the fivestudied projects, the left part of the plots are always higher (a.k.a., containing higher extremepoints) than the right part.

To statistically compare the resolution time for LCII changes and regular bugs across allfive projects, we performed the non-parametric Wilcoxon rank-sum test (WRS). Table 13shows our results. The two-sided WRS test shows that the resolution time of LCII changesis significantly different from that of regular bugs (p < 0.001) in all five projects. To assessthe magnitude of the differences between the resolution time for LCIIs and regular bugs,we have also calculated the effect sizes using Cliff’s Delta for all five projects in Table 13.

Table 13 Comparing the resolution time between the LCII changes and regular bugs

Project LCII Bugs p-values for WRS Cliff’s Delta(d)

Hadoop 210.9 15.5 <0.001 −0.54 (large)

HBase 128.6 6.5 <0.001 −0.5 (large)

Hibernate 80.4 30.9 <0.001 −0.13 (negligible)

Geronimo 96.9 14.0 <0.001 −0.33 (small)

Wildfly 379.2 18.9 <0.001 −0.63 (large)

The unit of resolution time for both the LCII changes and regular bugs are in days

Page 28: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

−1

00

51

01

5

LCII vs. Bug

Hadoop

ln(D

ays)

−1

00

51

01

5

LCII vs. Bug

HBase

ln(D

ays)

−1

00

51

01

5

LCII vs. Bug

Hibernate

ln(D

ays)

−1

00

510

15

LCII vs. Bug

Geronimo

ln(D

ays)

−1

00

51

01

5

LCII vs. Bug

Wildfly

ln(D

ays)

Fig. 11 Comparing the resolution time of LCII and regular bugs

The strength of the effects and the corresponding range of Cliff’s Delta (d) values (Romanoet al. 2006) are defined as follows:

effect size =

⎧⎪⎪⎨

⎪⎪⎩

negligible if |d| ≤ 0.147small if 0.147 < |d| ≤ 0.33medium if 0.33 < |d| ≤ 0.474large if 0.474 < |d|

Our results show that the effect sizes for three out of five projects (Hadoop, HBase,and Wildfly) are large, whereas the other projects have negligible (Hibernate) or small(Geronimo) effect sizes. However, the actual rationales behind the longer resolution timefor logging code issues are not clear, as there are both high priority and low priority issuesin regular bugs and logging code issues. In the future, we plan to investigate this further bysurveying the developers or the domain experts of these projects.

We further compared the resolution time between the LCII changes from co-changedlogging code changes and the independently changed logging code changes using the sim-ilar method as described above. The results are shown in Table 14. Each row correspondsto one projet. For example, in Hadoop, the median resolution time of fixes by co-changedLCII changes is 210.9 days and it is 216.8 days for independently changed LCII changes.For Elasticsearch, HBase and Hibernate, the median resolution time of the co-changed LCIIchanges is longer. The differences between the two types of LCII changes are only statis-tically significant in Hibernate and Wildfly (p < 0.001). The magnitude of differences iseither small or negligible for five out of six projects.

We also studied the distribution of the resolution time with respect to the complexity ofthe LCII changes. The results are visualized in Fig. 12. Each plot contains 2 to 5 violinplots for each project. Some plots are missing because there are not enough instances. TakeElasticsearch for example, there are no enough instances for “0 type”, “3 types” and “4

Page 29: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Table 14 Comparing the resolution time between the co-changed LCII changes and independently changedLCII changes

Project Co-changed Independently p-values for WRS Cliff’s Delta(d)

Elasticserach 103.9 84.1 0.04 −0.3 (small)

Hadoop 210.9 216.8 0.73 0.02 (negligible)

HBase 183.5 126.1 0.14 −0.08 (negligible)

Hibernate 163.2 80.3 <0.001 −0.72 (large)

Geronimo 21.3 96.9 0.27 0.22(small)

Wildfly 219.3 397.3 <0.001 0.20 (small)

The unit of resolution time for both the co-changed and independently changed changes are in days

types” changes. Hence, there are no corresponding plots for this project. The vertical scalefor all the plots is in the unit of the natural logarithm of days. As we can see across the violinplots, the distribution differs from project to project. It demonstrates that the complexity ofthe fixes to the LCII changes do not necessarily impact the resolution time.

−4

0

4

0 1 2

Type

Bug R

esolu

tion T

ime(ln

(days))

Elasticsearch

−4

0

4

8

Type

Bug R

esolu

tion T

ime(ln

(days))

Hadoop

−5

0

5

Type

Bug R

esolu

tion T

ime(ln

(days))

HBase

−2.5

0.0

2.5

5.0

7.5

Type

Bug R

esolu

tion T

ime(ln

(days))

Hibernate

−5

0

5

Type

Bug R

esolu

tion T

ime(ln

(days))

Geronimo

−4

0

4

8

0 1 2 3 4 0 1 2 3 4

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4

Type

Bug R

esolu

tion T

ime(ln

(days))

Wildfly

Fig. 12 Comparing the resolution time of different types of LCII changes

Page 30: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

6.3.3 Summary

Findings: the resolution time of the LCII changes and regular bugs are statistically dif-

ferent in all studied projects. The median resolution time of the LCII changes is much

longer than that of regular bugs and the magnitude of differences is large in three out of

the five studied projects. Within the LCII changes, there is no clear statistical difference

between the types of changes (co-changed vs. independently changed) or the complexity

(1 type vs. 2 types vs. 3 types vs. 4 types) of the changes.

Implications: although software logging is used widely in many contexts (e.g., debug-

ging, runtime monitoring, business decision making, and security), the correctness of

the logging code cannot be easily verified using conventional software verification tech-

niques. The issues in the logging code changes can take much longer to be detected and

fixed than regular bugs, as most of the existing issues in the logging code can only be

detected and fixed manually. Simple logging code changes may take as much time as

complex logging code changes. Automated approaches to validating and improving the

quality of the logging code is becoming an increasingly important research area.

6.4 RQ4: Are State-of-the-art Code Detectors Able to Detect Logging Code withIssues?

In this RQ, we want to study the effectiveness of the state-of-the-art techniques on detect-ing issues in the logging code. There are two techniques in the existing research literatureattempting to flag issues in the logging code:

– Rule-based Static Analysis (LCAnalyzer): this technique is proposed by Chen andJiang (2017). It scans through source code searching for six anti-pattern instances usinga set of rules.

– Code Clones (Cloning): this technique is proposed by Yuan et al. (2012b). It uses acode clone detection tool to find all the clone groups and checks if the logging code inthe clone groups have the same verbosity levels. If the levels are inconsistent, at leastone of them have incorrect verbosity levels.

6.4.1 Data Extraction

Our extracted dataset contains all the LCII changes and their fixes throughout the entiredevelopment history. However, both techniques listed above need to be applied on the entiresource code from one release of each project, since LCAnalyzer needs to extract codedependency information and Cloning needs to scan all the source code to find code clones.Therefore, we selected bi-monthly snapshots of the studied projects and ran both techniqueson them.

We extracted bi-monthly snapshots for all the studied projects. For each snapshot, wecomputed the number of existing logging code issues using our dataset. For each loggingcode issue, we kept track of the issue introduction and fixed time/version, so that we cantrack the number of logging code issues in each snapshot by checking its timestamp againstvarious commits: a logging code issue exists in a snapshot if it is introduced before therelease time of the snapshot and fixed after the release time of the snapshot.

To avoid repeated counts, we counted each unique logging code issue once. For example,if a piece of logging code with issue exists in two snapshots, and is detected by the above

Page 31: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

techniques, we only count it once. Across all the snapshots, we computed the total numberof unique logging code issues and aggregated the total number of detected ones from thetwo approaches.

In order to further characterize the capability of these two detecting techniques, we clas-sified the logging code with issues into four categories: (1) issues in the logging library,(2) issues in the verbosity level, (3) issues in the static text, and (4) issues in the dynamiccontents. Note that Cloning can only detect issues in the verbosity level. We obtained theLCAnalyzer from Replication Package for the LCAnalyzer work (2018) and implementedthe Cloning technique by ourselves.

6.4.2 Data Analysis

The detected results of LCAnalyzer and Cloning are shown in Fig. 13. We calculated therecall of the detection results as Number of unique LCII detected

Number of all unique LCII× 100%. In total, only 2.1%

of the logging code issues have been detected by LCAnalyzer, and only 0.1% of themhave been detected by Cloning technique. We then split the issues based on its problematiccomponents. For example, 0.3% of verbosity level related issues are detected by Cloningtechnique while 1.5% of them are detected by LCAnalyzer.

In general, LCAnalyzer performs the best on issues related to dynamic contents, while itis only able to detect 5.1% of all issues in the total 60 snapshots. The worst performance ofLCAnalyzer is 1.0% recall, when the issues are related to the logging library. On the otherhand, Cloning technique can only detect verbosity level related issues due to its design, yetit is still outperformed by LCAnalyzer (0.3% vs. 1.5%).

When we compared the correctly detected instances from the two studied tools, we foundthat the two techniques complement each other. We noticed that all the logging code withissues found by the Cloning techniques are not detected by the LCAnalyzer, and vice versa.To demonstrate the differences of these two approaches, we show two real world examplesin Fig. 14. LCAnalyzer can only detect the issue in the top, whereas Cloning can only detect

0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

7.0%

8.0%

9.0%

10.0%

Lib Level Text Dynamic Total

LCAnalyzer Cloning

Fig. 13 Comparing the recall of the detection results for the two studies techniques

Page 32: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Tool Code Example

LCAnalzyer

V 1255: LOG.info("DEBUG --- getStagingAreaDir: dir=" + path);V 1256: LOG.debug("getStagingAreaDir: dir=" + path);

ResourceMgrDelegate.java in Hadoop

Code Clone

if (currentSplitSize + srcFileStatus.getLen() > nBytesPerSplit&& lastPosition != 0) {

…if (LOG.isDebugEnabled())LOG.debug("Creating split : " + split + ", bytes in split: " +

currentSplitSize);…if (LOG.isDebugEnabled())LOG.info("Creating split : " + split + ", bytes in split: " +

currentSplitSize);

UniformSizeInputFormat.java in Hadoop

Fig. 14 Examples of the detected issues in the logging code from the two studied techniques

the issue in the bottom. The reason that LCAnalyzer can detect the logging code issue inthe top is because the verbosity level and the static content are inconsistent. In the statictext, it shows that the logging code is for debugging purpose while the verbosity level isinfo. In the fix version, the verbosity level is changed to debug and the text DEBUG isdeleted. For the issue in the bottom, two pieces of logging code snippets (highlighted inred) are considered as clones. The verbosity level of one logging code is debug whereasthe other one is info. Therefore, the info level should be changed to debug. However,even combining the power of these two techniques, most of the issues in logging code arestill undetected.

6.4.3 Summary

Findings: both the LCAnalyzer and the Cloning technique can only detect a small frac-

tion ( 3%) of the issues in logging code. The results outputted by the two techniques

complement each other.

Implications: there are still many logging code issues, which cannot be detected by

existing automated techniques. Leveraging our provided dataset, researchers can develop

and benchmark their new techniques on automatically detecting issues in the logging

code and deriving effective logging guidelines. As shown in RQ2, the majority of the

LCII changes are related to verbosity level changes and static texts, researchers are rec-

ommended to prioritize their effort on detecting and improving the issues in these two

categories.

7 RelatedWorks

In this section, we will discuss three areas of related research: (1) empirical studies onsoftware logging, and (2) research on automated suggestions on the logging code, and (3)research on identifying the bug introducing changes.

Page 33: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

7.1 Empirical Studies on Software Logging

Empirical studies show that there are no well-established logging practices used in industry(Fu et al. 2014; Pecchia et al. 2015) as well as open source projects (Chen and Jiang 2016;Yuan et al. 2012b). Although most of the logging code has been actively maintained, some ofthe existing logging code can be confusing and difficult to understand (Shang et al. 2014b).Various studies have been conducted to study the relationship between software loggingand code quality. Shang et al. (2015) that the amount of logging code is correlated with theamount of post-release bugs. Kabinna et al. (2016) studied the migration of logging librariesand its rationales. Over 70% of the migrated systems have been found buggy afterwards. InZhao et al. (2017), the authors analyzed the logging code changes from three open sourcesystems (Hadoop, HBase, and ZooKeeper). They found that the majority of the logging codechanges are due to adding logging statements. Many modifications to the exsting loggingcode are related to the verbosity level changes.

In this paper, we studied the historical logging code changes from six open source sys-tems. We have performed three empirical studies on the commits with a focus on theissues in the logging code and their fixes. First, we compared the change characteristicsbetween the fixes to the logging code and regular logging code changes. Second, we stud-ied the resolution time for fixing logging code issues and compared the duration against theresolution time for bugs. Finally, we assessed the effectiveness of the state-of-the-art tech-niques on detecting issues in the logging code. All the above three research questions, whichhad never been studied before, leveraged our extracted dataset presented in this paper.

7.2 Research on Automated Suggestions on the Logging Code

Orthogonal with the feature code, the logging code is considered as a cross-cutting concern.Adding and maintaining high quality logging code is very challenging and usually requires alarge amount of manual effort. Various research has been conducted in the area of automatedsuggestions of the logging code:

– where-to-log focuses on the problem of providing automated suggestions on addinglogging code. Yuan et al. (2012a) leveraged program-analysis techniques to suggestcode locations to add logging code for debugging purposes. Zhu et al. (2015) learnedcommon logging patterns from existing code and made automated suggestions based onthe resulting learned models. Zhao et al. (2017) introduced Log20, which can automatethe placement of logging code under certain overhead threshold.

– what-to-log is related to the problem of adding sufficient information into the loggingcode. Yuan et al. (2011) proposed a program analysis-based approach to suggestingadding additional variables into the existing logging code to improve the diagnosabil-ity of software systems. Li et al. (2017a) learned from the development history toautomatically suggest the most appropriate level for each newly-added logging code.

– how-to-log is about maintaining high quality logging code. Li et al. (2017b) learnedfrom the code level changes to predict whether the logging code is a particular codecommit requiring updates (a.k.a., just-in-time logging code changes). However, theirtechnique did not pin-point the exact logging code snippets to be updated within eachcode commit. Yuan et al. (2012b) leveraged code cloning techniques to detect incon-sistent verbosity levels among similar feature code segments. Chen and Jiang (2017)inferred six anti-patterns in the logging code by manually sampling a few hundredlogging code changes.

Page 34: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

This paper fits in the area of “how-to-log” and is close to Chen and Jiang (2017). How-ever, the technique to extract issues in the logging code and their fixes have been improvedin this paper so that the resulting dataset is more complete and more accurate. Further-more, instead of only focusing on a small set of sampled instances, this paper provided adataset containing all the historical issues in the logging code for six popular open sourceprojects. In addition, we have also conducted three empirical studies on the resulting datasetto demonstrate the usefulness of this dataset and presented some open research problems inthe area of “how-to-log”.

7.3 Research on Identifying the Bug Introducing Changes

It is important to identify bug introducing changes so that developers and researchers canlearn from them and develop tools to detect and fix them automatically. There are generallytwo steps in identifying bug introducing changes:

– Step 1 - Identifying bug fixing commits: keyword heuristics based approaches havebeen used to identify bug fixing commits by searching through relevant keywords(e.g., “Fixed 42233” or “Bug 23444”, etc.) in the code commit logs (Bird et al. 2009;Zimmermann et al. 2007).

– Step 2 - Identifying bug introducing commits based on their fixes: Sliwerski et al.(2005) were the first to develop an automatic algorithm to identify bug introducingchanges based on their fixes. Their algorithm, called the SZZ algorithm, initially startedat the code commits which fixes these issues, then tracks backward to the previouscommits which touched those source code lines. Afterwards, there are various modi-fications to the SZZ algorithms (Kim et al. 2006; Williams and Spacco 2008a, b) toimprove its precision and to evaluate its effectiveness (da Costa et al. 2017).

This paper is different from the above, as it aims to extract and study the LCII changesinstead of the software bugs. Software bugs are usually reported to the bug tracking systemsand their fixes are logged in the code commit messages. However, issues in the loggingcode are usually undocumented. Thus, in this paper, we have proposed an approach to firstidentifying fixes to the LCII changes and then automatically identifying the LCII changesbased on their fixes. We have analyzed both of the co-changed and independently changedlogging code changes to flag fixes to the LCII changes. Since there are multiple components(e.g., logging library, verbosity level, static texts, and dynamic contents) in each loggingcode snippets related to more than one line of feature code, there can be multiple issuesfrom different code commits in one line of the logging code. Hence, we have developed anadapted SZZ algorithm (LCC-SZZ) to identify issues in the logging code changes.

8 Threats to Validity

In this section, we will discuss the threats to validity.

8.1 Internal Validity

There are two general types of logging code changes: independently and co-changed log-ging code changes. We have identified the fixes to the LCII changes by carefully analyzingthe contents of both types of changes. For analysis of co-changed logging code changes,we manually examined 533 instances. We inspected the source code files which contain

Page 35: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

those co-changed logging code changes, and examined the context to verify whether theseare fixes to LCII changes. This manual investigation was repeated by both authors in thispaper. The resulting outputs from both authors were compared and discussed to generatea reconciled dataset in the end. The resulting LCII changes identified using the LCC-SZZalgorithm have also been verified to be highly accurate (93% accuracy).

8.2 External Validity

In this paper, we extracted the LCII changes from six Java-based systems. We do feel ourapproach is generic and can be applied to identify LCII changes for systems implementedin other programming languages (e.g., C, C#, and Python).

We have conducted three empirical studies on the resulting dataset extracted from thedevelopment history of six Java-based projects (Elasticsearch, Hadoop, HBase, Hibernate,Geronimo, andWildfly), which come from different domains (big data platform, web server,database, middleware, etc.). All of these projects have relatively long development historyand are actively maintained. However, our findings in the research questions may not begeneralizable to other programming languages.

8.3 Construct Validity

We have used ChangeDistiller (Fluri et al. 2007) to extract the fine-grained code changesacross different code versions. ChangeDistiller has been used in many of the previous works(Chen and Jiang 2016, 2017) and is proven to be highly accurate and very robust.

We have demonstrated that LCC-SZZ is more effective than SZZ in terms of identifyingLCII changes across the following three dimensions: (1) the disagreement ratio betweenthe extraction results from the two algorithms; (2) comparing the earliest appearance of theLCII changes against the bug reporting time; and (3) manual verification. Our evaluationapproaches are similar to many of the existing studies in this area (e.g. da Costa et al. 2017;Moha et al. 2010; Palomba et al. 2015; Sajnani et al. 2016).

9 Conclusions and FutureWork

Software logging has been used widely in large-scale software systems for a variety ofpurposes.

It is hard to develop and maintain high quality logging code, as it is very challengingto verify the correctness of a particular logging code snippet. To aid effective maintenanceof logging code, in this paper we have extracted and studied the historical issues in log-ging code and their fixes from six popular Java-based open source projects. To demonstratethe effectiveness of our dataset, we have conducted four preliminary case studies. We havefound that both the co-changed and the independently changed logging code changes cancontain fixes to the LCII changes. The change complexity metrics between the fixes to theLCII changes and other logging code changes are similar. It usually takes much longer toaddress an LCII change than a regular bug. Existing state-of-the-art techniques on detectinglogging code issues cannot detect a majority of the issues in logging code. In the future, weplan to further leverage our derived dataset to: (1) develop better techniques to automati-cally detect issues in logging code, and (2) derive best practices in terms of developing andmaintaining high quality logging code.

Page 36: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published mapsand institutional affiliations.

References

JBoss Logging (2018) https://jboss-logging.github.io/jboss-logging-tools/. Last accessed: 11/28/2018PR5906 (2018) Split all log messages into separate module project codes. https://github.com/wildfly/wildfly/

pull/5906. Last accessed: 02/14/2018Replication Package for the LCAnalyzer work (2018) http://www.cse.yorku.ca/zmjiang/share/replication

package/icse2017 chen/LCAnalyzer.zip. Last accessed: 02/14/2018The AspectJ Project (2016) https://eclipse.org/aspectj/. Last accessed: 08/26/2016The replication package (2018) http://www.cse.yorku.ca/zmjiang/share/replication package/emse2018 chen/

replication package.zip. Last accessed: 04/09/2018Barik T, DeLine R, Drucker S, Fisher D (2016) The Bones of the System: A Case Study of Logging and

Telemetry at Microsoft. In: Companion Proceedings of the 38th International Conference on SoftwareEngineering)

Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009) Fair and balanced?:Bias in bug-fix datasets. In: Proceedings of the the 7th Joint Meeting of the European Software Engi-neering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering(ESEC/FSE)

Chen B, Jiang ZM (2016) Characterizing logging practices in Java-based open source software projects – areplication study in Apache Software Foundation Empirical Software Engineering

Chen B, Jiang ZM (2017) Characterizing and detecting anti-patterns in the logging code. In: 2017 IEEE/ACM39Th international conference on software engineering (ICSE), pp 71–81

da Costa DA, McIntosh S, Shang W, Kulesza U, Coelho R, Hassan A (2017) A framework for evaluating theresults of the szz approach for identifying bug-introducing changes. IEEE Trans Softw Eng 43(7):641–657

Davies S, Roper M, Wood M (2014) Comparing text-based and dependence-based approaches for determin-ing the origins of bugs. J Softw: Evol Process 26(1):107–139

Ding R, Zhou H, Lou JG, Zhang H, Lin Q, Fu Q, Zhang D, Xie T (2015) Log2: A Cost-aware LoggingMechanism for Performance Diagnosis. In: Proceedings of the 2015 USENIX Conference on UsenixAnnual Technical Conference (ATC)

Fluri B, Wursch M, Pinzger M, Gall H (2007) Change distilling:tree differencing for fine-grained sourcecode change extraction. IEEE Trans Softw Eng 33(11):725–743

Fowler M, Beck K, Brant J, Opdyke W, Roberts D (1999) Refactoring: Improving the design of existingcode. Addison-Wesley Longman Publishing co. Inc., Reading

Fu Q, Zhu J, Hu W, Lou JG, Ding R, Lin Q, Zhang D, Xie T (2014) Where Do Developers Log? AnEmpirical Study on Logging Practices in Industry. In: Companion Proceedings of the 36th InternationalConference on Software Engineering

HADOOP-12666 (2018a) Support Microsoft Azure Data Lake - as a file system in Hadoop. https://issues.apache.org/jira/browse/HADOOP-12666. Last accessed: 02/06/2018

HADOOP-7358 (2018b) Improve log levels when exceptions caught in RPC handler. https://issues.apache.org/jira/browse/HADOOP-7358. Last accessed: 02/14/2018

HADOOP-8347 (2018c) Hadoop Common logs misspell ’successful’. https://issues.apache.org/jira/browse/HADOOP-8347. Last accessed: 02/14/2018

HBASE-10470 (2018a) Import generates huge log file while importing large amounts of data. https://issues.apache.org/jira/browse/HBASE-10470. Last accessed: 01/24/2018

HBASE-12539 (2018b) HFileLinkCleaner logs are uselessly noisy. https://issues.apache.org/jira/browse/HBASE-12539. Last accessed: 02/14/2018

HBASE-750 (2016c) NPE caused by StoreFileScanner.updateReaders. https://issues.apache.org/jira/browse/HBASE-750/. Last accessed: 08/26/2016

HBASE-8754 (2018d) Log the client IP/port of the balancer invoker. https://issues.apache.org/jira/browse/HBASE-8754. Last accessed: 02/14/2018

HDFS-1073 (2018a) Simpler model for Namenode’s fs Image and edit Logs. https://issues.apache.org/jira/browse/HDFS-1073. Last accessed: 02/14/2018

HDFS-11448 (2018b) JN log segment syncing should support HA upgrade. https://issues.apache.org/jira/browse/HDFS-11448. Last accessed: 02/14/2018

Page 37: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

HDFS-4122 (2018c) Cleanup HDFS logs and reduce the size of logged messages. https://issues.apache.org/jira/browse/HDFS-4122. Last accessed: 02/07/2018

HDFS-5800 (2018d) Typo: soft-limit for hard-limit in DFSClient. https://issues.apache.org/jira/browse/HDFS-5800. Last accessed: 02/14/2018

HHH-6732 (2018) Some logging trace statements are missing guards against unneeded string creation.https://hibernate.atlassian.net/browse/HHH-6732. Last accessed: 02/14/2018

Humble J, Farley D (2010) Continuous delivery: Reliable Software Releases Through Build, Test, andDeployment Automation. Addison-Wesley Professional, Reading

Jiang ZM, Hassan AE, Hamann G, Flora P (2009) Automated performance analysis of load tests. In:Proceedings of the 25th IEEE International Conference on Software Maintenance (ICSM)

Kabinna S, Bezemer CP, Shang W, Hassan AE (2016) Logging Library Migrations: A Case Study for theApache Software Foundation Projects. In: Proceedings of the 13th International Conference on MiningSoftware Repositories (MSR)

Kampstra P (2008) Beanplot: A boxplot alternative for visual comparison of distributions. J Stat Softw CodeSnippets 28(1):1–9

Kiczales G, Lamping J, Mendhekar A, Maeda C, Lopes C, Loingtier JM, Irwin J (1997) Aspect-orientedprogramming

Kim S, Zimmermann T, Pan K, Whitehead EJJ (2006) Automatic identification of bug-introducing changes.In: 21St IEEE/ACM international conference on automated software engineering (ASE’06)

Li H, Shang W, Hassan AE (2017a) Which log level should developers choose for a new logging statement?Empir Softw Eng 22(4):1684–1716

Li H, Shang W, Zou Y, Hassan AE (2017b) Towards just-in-time suggestions for log changes. Empir SoftwEng 22(4):1831–1865

Moha N, Gueheneuc YG, Duchien L,Meur AFL (2010) DECOR: a method for the specification and detectionof code and design smells IEEE transactions on software engineering (TSE)

Oliner A, Ganapathi A, Xu W (2012) Advances and challenges in log analysis. Commun ACM 55(2):55–61Palomba F, Bavota G, Penta MD, Oliveto R, Poshyvanyk D, Lucia AD (2015) Mining Version Histories for

Detecting Code Smells IEEE transactions on software engineering (TSE)Pecchia A, Cinque M, Carrozza G, Cotroneo D (2015) Industry practices and event logging: Assessment of a

critical software development process. In: Companion Proceedings of the 37th International Conferenceon Software Engineering

Rigby PC, German DM, Storey MA (2008) Open source software peer review practices: a case study of theapache server. In: Proceedings of the 30th International Conference on Software Engineering (ICSE)

Romano J, Kromrey JD, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: Shouldwe really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys?In: Annual meeting of the Florida Association of Institutional Research

Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) SourcererCC: Scaling Code Clone Detection toBig-code. In: Proceedings of the 38th International Conference on Software Engineering (ICSE)

Shang W, Jiang ZM, Adams B, Hassan AE, Godfrey MW, Nasser M, Flora P (2014a) An exploratory studyof the evolution of communicated information about the execution of large software systems. J Softw:Evol Process 26(1):3–26

Shang W, Nagappan M, Hassan AE, Jiang ZM (2014b) Understanding Log Lines Using DevelopmentKnowledge. In: Proceedings of the 2014 IEEE International Conference on Software Maintenance andEvolution (ICSME)

Shang W, Nagappan M, Hassan AE (2015) Studying the relationship between logging characteristics and thecode quality of platform software. Empir Softw Eng 20(1):1–27

Sliwerski J, Zimmermann T, Zeller A (2005) When do changes induce fixes?. In: Proceedings of the 2005International Workshop on Mining Software Repositories

Williams C, Spacco J (2008a) Branching and merging in the repository. In: Proceedings of the 2008International Working Conference on Mining Software Repositories, MSR’08, pp 19–22, New York

Williams C, Spacco J (2008b) Szz revisited: Verifying when changes induce fixes. In: Proceedings of the2008 Workshop on Defects in Large Software Systems, DEFECTS ’08, pp 32–36, New York

Yuan D, Zheng J, Park S, Zhou Y, Savage S (2011) Improving software diagnosability via log enhancement.In: Proceedings of the Sixteenth International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS)

Yuan D, Park S, Huang P, Liu Y, Lee MM, Tang X, Zhou Y, Savage S (2012a) Be conservative: Enhancingfailure diagnosis with proactive logging. In: Proceedings of the 10th USENIX Conference on OperatingSystems Design and Implementation (OSDI)

Page 38: ExtractingandstudyingtheLogging-Code-Issue ...chenfsd/resources/emse2019...by mining the projects’ historical data. Our approach analyzes the development history (historical code

Empirical Software Engineering

Yuan D, Park S, Zhou Y (2012b) Characterizing logging practices in open-source software. In: Proceedingsof the 34th International Conference on Software Engineering, ICSE ’12. IEEE Press, Piscataway, pp102–112

Zhao X, Rodrigues K, Luo Y, StummM, Yuan D, Zhou Y (2017) log20: Fully automated optimal placementof log printing statements under specified overhead threshold. In: Proceedings of the 26th Symposiumon Operating Systems Principles (SOSP)

Zhu J, He P, Fu Q, Zhang H, Lyu MR, Zhang D (2015) Learning to log: Helping developers make informedlogging decisions. In: Proceedings of the 37th International Conference on Software Engineering

Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: Proceedings of the ThirdInternational Workshop on Predictor Models in Software Engineering (PROMISE)

Boyuan Chen is a PhD student at the Department of Electrical Engineering and Computer Science, YorkUniversity, Toronto, Canada. He received his MAsc from EECS department, York University and BEngdegree from the School of Computer Science at University of Science and Technology of China. His researchinterests lie within software engineering, in particular, mining software repositories, source code analysis,software logging, and software testing.

Dr. Zhen Ming (Jack) Jiang is an associate professor at the Department of Electrical Engineering and Com-puter Science, York University in Toronto, Canada. His research interests lie within software engineeringand computer systems, with special interests in software performance engineering, software analytics, sourcecode analysis, software architectural recovery, software visualizations, and debugging and monitoring of dis-tributed systems. Some of his research results are already adopted and used in practice on a daily basis. He isthe cofounder of the annually held International Workshop on Load Testing and Benchmarking of SoftwareSystems (LTB). He received several Best Paper Awards including ICST 2016, ICSE 2015 (SEIP track), ICSE2013, and WCRE 2011. He received the BMath and MMath degrees in computer science from the Universityof Waterloo, and the PhD degree from the School of Computing at the Queen’s University.