categorizing defects in infrastructure as code › pdf › 1809.07937v1.pdf · infrastructure as...

38
Noname manuscript No. (will be inserted by the editor) Categorizing Defects in Infrastructure as Code Akond Rahman · Sarah Elder · Faysal Hossain Shezan · Vanessa Frost · Jonathan Stallings · Laurie Williams Received: date / Accepted: date Abstract Infrastructure as code (IaC) scripts are used to automate the main- tenance and configuration of software development and deployment infras- tructure. IaC scripts can be complex in nature, containing hundreds of lines of code, leading to defects that can be difficult to debug, and lead to wide- scale system discrepancies such as service outages at scale. Use of IaC scripts is getting increasingly popular, yet the nature of defects that occur in these scripts have not been systematically categorized. A systematic categorization of defects can inform practitioners about process improvement opportunities to mitigate defects in IaC scripts. The goal of this paper is to help software practitioners improve their development process of infrastructure as code (IaC) scripts by categorizing the defect categories in IaC scripts based upon a qualita- tive analysis of commit messages and issue report descriptions. We mine open Akond Rahman Department of Computer Science, North Carolina State University, Raleigh, NC, USA E-mail: [email protected] Sarah Elder Department of Computer Science, North Carolina State University, Raleigh, NC, USA E-mail: [email protected] Faysal Hossain Shezan Department of Computer Science, University of Virginia, Charlottsville, VA, USA E-mail: [email protected] Vanessa Frost Department of Computer & Information Science & Engineering, University of Florida, Gainsville, FL, USA E-mail: [email protected] Jonathan Stallings Department of Statistics, North Carolina State University, Raleigh, NC, USA E-mail: [email protected] Laurie Williams Department of Computer Science, North Carolina State University, Raleigh, NC, USA E-mail: [email protected] arXiv:1809.07937v1 [cs.SE] 21 Sep 2018

Upload: others

Post on 06-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Noname manuscript No.(will be inserted by the editor)

Categorizing Defects in Infrastructure as Code

Akond Rahman · Sarah Elder · FaysalHossain Shezan · Vanessa Frost ·Jonathan Stallings · Laurie Williams

Received: date / Accepted: date

Abstract Infrastructure as code (IaC) scripts are used to automate the main-tenance and configuration of software development and deployment infras-tructure. IaC scripts can be complex in nature, containing hundreds of linesof code, leading to defects that can be difficult to debug, and lead to wide-scale system discrepancies such as service outages at scale. Use of IaC scriptsis getting increasingly popular, yet the nature of defects that occur in thesescripts have not been systematically categorized. A systematic categorizationof defects can inform practitioners about process improvement opportunitiesto mitigate defects in IaC scripts. The goal of this paper is to help softwarepractitioners improve their development process of infrastructure as code (IaC)scripts by categorizing the defect categories in IaC scripts based upon a qualita-tive analysis of commit messages and issue report descriptions. We mine open

Akond RahmanDepartment of Computer Science, North Carolina State University, Raleigh, NC, USAE-mail: [email protected]

Sarah ElderDepartment of Computer Science, North Carolina State University, Raleigh, NC, USAE-mail: [email protected]

Faysal Hossain ShezanDepartment of Computer Science, University of Virginia, Charlottsville, VA, USAE-mail: [email protected]

Vanessa FrostDepartment of Computer & Information Science & Engineering, University of Florida,Gainsville, FL, USAE-mail: [email protected]

Jonathan StallingsDepartment of Statistics, North Carolina State University, Raleigh, NC, USAE-mail: [email protected]

Laurie WilliamsDepartment of Computer Science, North Carolina State University, Raleigh, NC, USAE-mail: [email protected]

arX

iv:1

809.

0793

7v1

[cs

.SE

] 2

1 Se

p 20

18

2 Akond Rahman et al.

source version control systems collected from four organizations namely, Mi-rantis, Mozilla, Openstack, and Wikimedia Commons to conduct our researchstudy. We use 1021, 3074, 7808, and 972 commits that map to 165, 580, 1383,and 296 IaC scripts, respectively, collected from Mirantis, Mozilla, Openstack,and Wikimedia Commons. With 89 raters we apply the defect type attributeof the orthogonal defect classification (ODC) methodology to categorize thedefects. We also review prior literature that have used ODC to categorizedefects, and compare the defect category distribution of IaC scripts with 26non-IaC software systems. Respectively, for Mirantis, Mozilla, Openstack, andWikimedia Commons, we observe (i) 49.3%, 36.5%, 57.6%, and 62.7% of theIaC defects to contain syntax and configuration-related defects; (ii) syntax andconfiguration-related defects are more prevalent amongst IaC scripts comparedto that of previously-studied non-IaC software. As syntax and configuration-related defects is the dominant defect category for IaC scripts, based uponODC recommendations, practitioners may improve their development processfor IaC scripts by allocating more code inspection, static testing, and unittesting effort for IaC scripts.

Keywords continuous deployment · defect categorization · devops ·infrastructure as code · puppet · orthogonal defect classification

1 Introduction

Continuous deployment (CD) is the process of rapidly deploying software orservices automatically to end-users (Rahman et al, 2015). The practice ofinfrastructure as code (IaC) scripts is essential to implement an automateddeployment pipeline, which facilitates CD (Humble and Farley, 2010). Infor-mation technology (IT) organizations, such as Netflix 1, Ambit Energy 2, andWikimedia Commons 3, use IaC scripts to automatically manage their soft-ware dependencies, and construct automated deployment pipelines (Parninet al, 2017) (Puppet, 2018) (Rahman and Williams, 2018). Commercial IaCtools, such as Ansible 4 and Puppet 5, provide multiple utilities to constructautomated deployment pipelines. For example, Puppet provides the ‘sshkeyresource’ to install and manage secure shell (SSH) host keys and the ‘serviceresource’ to manage software services automatically (Labs, 2017). Use of IaCscripts has helped IT organizations to increase their deployment frequency.For example, Ambit Energy, uses IaC scripts to increased their deploymentfrequency by a factor of 1,200 (Puppet, 2018).

Similar to software source code, IaC scripts can be complex and containhundreds of lines of code (LOC). For example, Jiang and Adams (Jiang and

1 https://www.netflix.com/2 https://www.ambitenergy.com/3 https://commons.wikimedia.org/wiki/Main Page4 https://www.ansible.com/5 https://puppet.com/

Categorizing Defects in Infrastructure as Code 3

Adams, 2015) have reported median size of an IaC script to vary between1,398 and 2,486 LOC. Not applying traditional software engineering practicesto IaC scripts can be detrimental, often leading IT organizations ‘to shootthemselves in the foot’ (Parnin et al, 2017). IaC scripts are susceptible to hu-man errors (Parnin et al, 2017), bad coding practices (Cito et al, 2015), andcan experience frequent churn, making scripts susceptible to defects (Jiang andAdams, 2015) (Parnin et al, 2017). Defects in IaC scripts can have serious con-sequences as IT organizations such as Wikimedia use these scripts to provisiontheir development and deployment servers, and ensure availability of services 6.Any defect in a script can propagate at scale, leading to wide-scale service out-ages. For example in January 2017, execution of a defective IaC script erasedhome directories of around 270 users in cloud instances maintained by Wiki-media 7. Prior research studies (Jiang and Adams, 2015) (Parnin et al, 2017)and the above-mentioned evidence demonstrated in the real-world, motivateus to systematically study the defects that occur in IaC scripts.

Categorization of defects can guide IT organizations on how to improvetheir development process. Chillarege et al. (Chillarege et al, 1992) proposedthe orthogonal defect classification (ODC) technique which included a set ofdefect categories. According to Chillarege et al. (Chillarege et al, 1992), eachof these defect categories map to a certain activity of the development pro-cess which needs attention. For example, the prevalence of functional defectscan highlight process improvement opportunities in the design phase of thedevelopment process. Since the introduction of ODC in 1992, researchers andindustry practitioners have widely used ODC to categorize defects for non-IaCsoftware systems written in general purpose programming languages (GPLs)such as, database (Pecchia and Russo, 2012), operating systems (Cotroneoet al, 2013), and safety critical systems for spacecraft (Lutz and Mikulski,2004). Unlike non-IaC software, IaC scripts use domain specific languages(DSLs) (Shambaugh et al, 2016). DSLs are fundamentally different from GPLswith respect to comprehension, semantics, and syntax (Fowler, 2010) (Voelter,2013). A systematic categorization of defects in IaC scripts can help in under-standing the nature of IaC defects, and provide actionable recommendationsfor practitioners to mitigate defects in IaC scripts.

The goal of this paper is to help practitioners improve their developmentprocess of infrastructure as code (IaC) scripts by categorizing the defects inIaC scripts based upon a qualitative analysis of commit messages and issuereport descriptions.

We investigate the following research questions:

RQ1: What process improvement recommendations can we makefor infrastructure as code development, by categorizing defects us-ing the defect type attribute of orthogonal defect classification?

6 https://blog.wikimedia.org/2011/09/19/ever-wondered-how-the-wikimedia-servers-are-configured/

7 https://wikitech.wikimedia.org/wiki/Incident documentation/20170118-Labs

4 Akond Rahman et al.

RQ2: What are the differences between infrastructure as code (IaC)and non-IaC software process improvement activities, as deter-mined by their defect category distribution reported in the litera-ture?RQ3: Can the size of infrastructure as code (IaC) scripts providea signal for improvement of the IaC development process?RQ4: How frequently do defects occur in infrastructure as codescripts?

We use open source datasets from four organizations, Mirantis, Mozilla,Openstack, and Wikimedia Commons, to answer the four research questions.We use 1021, 3074, 7808, and 972 commits that map to 165, 580, 1383, and296 IaC scripts, respectively, collected from Mirantis, Mozilla, Openstack, andWikimedia Commons. With the help of 89 raters, we apply qualitative analysisto commit messages from repositories to analyze the defects that occur inIaC scripts. We apply the defect type attribute of ODC (Chillarege et al,1992) to categorize the defects that occur in the IaC scripts. We compare thedistribution of defect categories found in the IaC scripts with the categoriesfor 26 non-IaC software systems, as reported in prior research studies, whichused the defect type attribute of ODC and were collected from IEEEXplore 8,ACM Digital Library 9, ScienceDirect 10, and SpringerLink 11.

We list our contributions as following:

– A categorization of defects that occur in IaC scripts;– A comparison of the distribution of IaC defect categories to that of non-IaC

software found in prior academic literature.; and– A set of curated datasets where a mapping of defect categories and IaC

scripts are provided.

We organize the rest of the paper as following: Section 2 provides back-ground information and prior research work relevant to our paper. Section 3describes our methodology conducted for this paper. We use Section 4 to de-scribe our case studies. We present our findings in Section 5. We discuss ourfindings in Section 6. We list the limitations of our paper in Section 7. Finally,we conclude the paper in Section 8.

2 Background and Related Work

In this section, we provide background on IaC scripts and briefly describerelated academic research.

8 https://ieeexplore.ieee.org/Xplore/home.jsp9 https://dl.acm.org/

10 https://www.sciencedirect.com/11 https://link.springer.com/

Categorizing Defects in Infrastructure as Code 5

1 #This is an example Puppet script2 class (`example'3 ){4 token => ‘XXXYYYZZZ’56 $os_name = ‘Windows’78 case $os_name {9 'Solaris': { auth_protocol => `http' }

10 'CentOS': { auth_protocol => getAuth() }11 default: { auth_protocol => `https' }12 }13 } �

Comment

Attribute ‘token’

Variable ‘$os_name’

Case conditional Calling function‘getAuth()’

1

Fig. 1: Annotation of an example Puppet script.

2.1 Background

IaC is the practice of automatically defining and managing network and sys-tem configurations, and infrastructure through source code (Humble and Far-ley, 2010). Companies widely use commercial tools such as Puppet to im-plement the practice of IaC (Humble and Farley, 2010) (Jiang and Adams,2015) (Shambaugh et al, 2016). We use Puppet scripts to construct our datasetbecause Puppet is considered as one of the most popular tools for configurationmanagement (Jiang and Adams, 2015) (Shambaugh et al, 2016), and has beenused by companies since 2005 (McCune and Jeffrey, 2011). Typical entities ofPuppet include modules and manifests (Labs, 2017). A module is a collectionof manifests. Manifests are written as scripts that use a .pp extension.

In a single manifest script, configuration values can be specified using vari-ables and attributes. Puppet provides the utility ‘class’ that can be used as aplaceholder for the specified variables and attributes. For better understand-ing, we provide a sample Puppet script with annotations in Figure 1. Forattributes configuration values are specified using the ‘=>’ sign, whereas, forvariables, configuration values are provided using the ‘=’ sign. A single mani-fest script can contain one or multiple attributes and/or variables. In Puppet,variables store values and have no relationship with resources. Attributes de-scribe the desired state of a resource. Similar to general purpose programminglanguages, code constructs such as functions/methods, comments, and condi-tional statements are also available for Puppet scripts.

2.2 Related Work

Our paper is related to empirical studies that have focused on IaC technologies,such as Puppet. Sharma et al. (Sharma et al, 2016) investigated smells in IaC

6 Akond Rahman et al.

scripts and proposed 13 implementation and 11 design configuration smells.Hanappi et al. (Hanappi et al, 2016) investigated how convergence of Puppetscripts can be automatically tested, and proposed an automated model-basedtest framework. Jiang and Adams (Jiang and Adams, 2015) investigated theco-evolution of IaC scripts and other software artifacts, such as build files andsource code. They reported IaC scripts to experience frequent churn. In a re-cent work, Rahman and Williams (Rahman and Williams, 2018) characterizeddefective IaC scripts using text mining, and created prediction models usingtext feature metrics. Bent et al. (van der Bent et al, 2018) proposed and val-idated nine metrics to detect maintainability issues in IaC scripts. Rahmanet el. (Rahman et al, 2017) investigated which factors influence usage of IaCtools. In another work, Rahman et al. (Rahman et al, 2018) investigated thequestions that programmers ask on Stack Overflow to identify the potentialchallenges programmers face while working with Puppet.

The above-mentioned studies motivate us to explore the area of IaC bytaking a different stance: we analyze the defects that occur in Puppet scriptsand categorize them using the defect type attribute of ODC for the purposeof providing process improvement decisions.

Our paper is also related to prior research studies that have categorizeddefects of software systems using ODC and non-ODC techniques. We brieflydescribe the related studies as following:

Chillarege et al. (Chillarege et al, 1992) introduced the ODC techniquein 1992. They proposed eight categories as a ‘defect type attribute’ that areorthogonal to each other. Since then, researchers have used the defect typeattribute of ODC to categorize defects that occur in software. Duraes andMadeira (Duraes and Madeira, 2006) studied 668 faults from 12 software sys-tems and reported that 43.4% of the 668 defects were algorithm defects and21.9% of the defects were assignment-related defects. Freimut et al. (Freimutet al, 2005) applied the defect type attribute of ODC in an industrial setting.Fonseca et al. (Fonseca and Vieira, 2008) used ODC to categorize security de-fects that appear in web applications. They collected 655 security defects fromsix PHP web applications and reported 85.3% of the security defects belongto the algorithm category. Zheng et al. (Zheng et al, 2006) applied ODC ontelecom-based software systems, and observed an average of 35.4% of defectsbelong to the algorithm category. Lutz and Mikluski (Lutz and Mikulski, 2004)studied defect reports from seven missions of NASA, and observed functionaldefects to be the most frequent category of 199 reported defects. Christmas-son and Chillarege (Christmansson and Chillarege, 1996) studied 408 faultsextracted for the IBM OS, and reported 37.7% and 19.1% of these faultsto belong to the algorithm and assignment categories, respectively. Basso etal. (Basso et al, 2009) studied defects from six Java-based software namely,Azureus, FreeMind, Jedit, Phex, Struts, and Tomcat, and observed the mostfrequent category to be algorithm defects. Overall, they reported majority ofthe defects to belong to two categories: algorithm and assignment. Cinque etal. (Cinque et al, 2014) analyzed logs from an industrial system that belong to

Categorizing Defects in Infrastructure as Code 7

the air traffic control system. In their paper, 58.9% of the 3,159 defects wereclassified as algorithm defects.

The above-mentioned studies highlight the research community’s interestin systematically categorizing the defects in different software systems. Thesestudies focus on non-IaC software which highlights the lack of studies thatinvestigate defect categories in IaC, and motivate us further to investigatedefect categories of IaC scripts. Categorization of IaC-related defects can helppractitioners provide actionable recommendations on how to mitigate IaC-related defects, and improve the quality of their developed IaC scripts.

3 Categorization Methodology

In this section, we first provide definitions related to our research study, andthen provide the details of the methodology we used to categorize the IaCdefects.

– Defect: An imperfection in an IaC script that needs to be replaced orrepaired (IEEE, 2010).

– Defect-related commit: A commit whose message indicates that an ac-tion was taken related to a defect.

– Defective script: An IaC script which is listed in a defect-related commit.– Neutral script: An IaC script for which no defect has been found yet.

3.1 Dataset Construction

Our methodology of dataset construction involves two steps: repository collec-tion (Section 3.1.1) and commit message processing (Section 3.1.2).

3.1.1 Repository Collection

We use open source repositories to construct our datasets. An open sourcerepository contains valuable information about the development process ofan open source project, but the project might have a short development pe-riod (Munaiah et al, 2017). This observation motivates us to apply the follow-ing selection criteria to identify repositories for mining:

– Criteria-1: The repository must be available for download.– Criteria-2: At least 11% of the files belonging to the repository must be

IaC scripts. Jiang and Adams (Jiang and Adams, 2015) reported that inopen source repositories IaC scripts co-exist with other types of files, suchas source code and Makefiles files. They observed a median of 11% of thefiles to be IaC scripts. By using a cutoff of 11% we assume to collect a setof repositories that contain a sufficient amount of IaC scripts for analysis.

8 Akond Rahman et al.

– Criteria-3: The repository must have at least two commits per month.Munaiah et al. (Munaiah et al, 2017) used the threshold of at least twocommits per month to determine which repositories have enough activityto develop software for IT organizations. We use this threshold to filterrepositories that contain projects with short development activity.

3.1.2 Commit Message Processing

Prior research (Ray et al, 2016), (Zhang et al, 2016), (Zhang et al, 2017) lever-aged open source repositories that use VCS for software defect studies. We usetwo artifacts from VCS of the selected repositories from Section 3.1.1, to con-struct our datasets: (i) commits that indicate modification of IaC scripts; and(ii) issue reports that are linked with the commits. We use commits becausecommits contain information on how and why a file was changed. Commitscan also include links to issue reports. We use issue report summaries becausethey can give us more insights on why IaC scripts were changed in additionto what is found in commit messages. We collect commits and other relevantinformation in the following manner:

– First, we extract commits that were used to modify at least one IaC script.A commit lists the changes made on one or multiple files (Alali et al, 2008).

– Second, we extract the message of the commit identified from the previousstep. A commit includes a message, commonly referred to as a commitmessage. The commit messages indicate why the changes were made tothe corresponding files (Alali et al, 2008).

– Third, if the commit message included a unique identifier that maps thecommit to an issue in the issue tracking system, we extract the identifierand use that identifier to extract the summary of the issue. We use regularexpressions to extract the issue identifier. We use the corresponding issuetracking API to extract the summary of the issue; and

– Fourth, we combine the commit message with any existing issue summaryto construct the message for analysis. We refer to the combined message as‘extended commit message (XCM)’ throughout the rest of the paper. Weuse the extracted XCMs to categorize defects, as described in Section 3.2.

3.2 Categorization of Infrastructure as Code (IaC) Script Defects

We use the defect type attribute of ODC to categorize defects. We selectthe ODC defect type attribute because ODC uses semantic information col-lected from the software system to categorize defects (Chillarege et al, 1992).According to the ODC defect type attribute, a defect can belong to one ofthe eight categories: algorithm (AL), assignment (AS), build/package/merge(B), checking (C), documentation (D), function (F), interface (I), and tim-ing/serialization (T).

The collected XCMs derived from commits and issue report descriptionsmight correspond to feature enhancement or performing maintenance tasks,

Categorizing Defects in Infrastructure as Code 9

Table 1: Criteria for Determining Defect Categories Based on ODC, adaptedfrom (Chillarege et al, 1992)

Category Criterion Process ImprovementActivity

Algorithm (AL) Indicates efficiency or correctnessproblems that affect task and can befixed by re-implementing an algorithmor local data structure.

Coding, Code Inspection,Unit Testing, FunctionTesting

Assignment (AS) Indicates changes in a few lines of code. Code Inspection, UnitTesting

Build/Package/Merge (B) Indicates defects due to mistakes inchange management, library systems,or VCS.

Coding, Low Level Design

Checking (C) Indicates defects related to data vali-dation and value checking.

Coding, Code Inspection,Unit Testing

Documentation (D) Indicates defects that affect publica-tions and maintenance notes.

Coding, Publication

Function (F) Indicates defects that affect significantcapabilities.

Design, High Level DesignInspection, Function Test-ing

Interface (I) Indicates defects in interacting withother components, modules, or controlblocks.

Coding, System Testing

No Defect (N) Indicates no defects. Not ApplicableOther (O) Indicates a defect that do not belong

to the categories: AL, AS, B, C, D, F,I, N, and T.

Not Applicable

Timing/Serialization (T) Indicates errors that involve real timeresources and shared resources.

Low Level Design, LowLevel Design Inspection,System Testing

which are not related to defects. As an XCM might not correspond to a defect,we added a ‘No defect (N)’ category. Furthermore, a XCM might not to belongto any of the eight categories included in the ODC defect type attribute.Hence, we introduced the ‘Other (O)’ category. We categorize the XCMs intoone of these 10 categories. In case of the eight categories, we follow the criteriaprovided by Chillarege et al. (Chillarege et al, 1992), and used two of our owncriteria for categories ‘No defect’, and ‘Other’. The criteria for each of the 10categories are described in Table 1.

In Table 1 the ‘Process Improvement Activity Suggested By ODC’ col-umn corresponds to one or multiple activities related to software developmentwhich can be improved based on the defect categories determined by ODC.For example, according to ODC, algorithm-related defects should peak duringthe activities of coding, code inspection, unit testing, and function testing.To reduce the propagation of algorithm-related defects, software teams caninvest more effort for previously-mentioned activities. The ‘Not Applicable’cells correspond to no applicable process improvement recommendations.

We perform qualitative analysis on the collected XCMs to determine thecategory to which a commit belongs. Qualitative analysis provides the oppor-tunity to increase the quality of the constructed dataset (Herzig et al, 2013).

10 Akond Rahman et al.

For performing qualitative analysis, we had raters with software engineeringexperience apply the 10 categories stated in Table 1 on the collected XCMs.We record the amount of time they took to perform the categorization.

We conduct the qualitative analysis in the following manner:

– Categorization Phase: We randomly distribute the XCMs in a man-ner so that each XCM is reviewed by at least two raters. We adopt thisapproach to mitigate the subjectivity introduced by a single rater. Eachrater determine the category of an XCM using the 10 categories presentedin Table 1. We provide raters with an electronic handbook on IaC (Labs,2017), and the ODC publication (Chillarege et al, 1992). We do not provideany time constraint for the raters to categorize the defects. We record theagreement level amongst raters using two techniques: (a) by counting theXCMs for which the raters had the same rating; and (b) by computing theCohen’s Kappa score (Cohen, 1960).

– Resolution Phase: For any XCM, raters can disagree on the identifiedcategory. In these cases, we use an additional rater’s opinion to resolve suchdisagreements. We refer to the additional rater as the ‘resolver’.

– Practitioner Agreement: To evaluate the ratings in the categorizationand the resolution phase, we randomly select 50 XCMs for each dataset.We contact practitioners who authored the commit message via e-mails.We ask the practitioners if they agree to our categorization of XCMs. Highagreement between the raters’ categorization and practitioner’s feedbackis an indication of how well the raters performed. The percentage of XCMsto which practitioners agreed upon are recorded and the Cohen’s Kappascore is computed.

After applying qualitative analysis based on the 10 defect categories, wefind a mapping between each XCM and a category. We list an example XCMthat belongs to each of the 10 defect categories in Table 2. We refer to a XCMwhich is not categorized as ‘N’ as a ‘defect message’ throughout the paper. Ina similar manner, a XCM categorized as ‘AS’, identifies that the correspondingcommit is related to the category ‘Assignment’, and the IaC scripts modifiedin this commit contains an assignment-related defect.

From the defect message, we separate defect-related commits. Each defect-related commit corresponds to a defect in a script. From the defect-relatedcommits, we determine which IaC scripts are defective, similar to prior work (Zhanget al, 2016). We determine a script as defective if it contains at least one defect.Defect-related commits list which IaC scripts were changed, and from this listwe determine which IaC scripts are defective.

3.3 RQ1: What process improvement recommendations can wemake for infrastructure as code development, using the defect typeattribute of orthogonal defect classification?

By answering RQ1 we focus on identifying the defect categories that occur inIaC scripts. We use the 10 categories stated in Table 1 to identify the defect

Categorizing Defects in Infrastructure as Code 11

Table 2: Example of Extended Commit Messages (XCMs) for Defect Cate-gories

Category Mirantis Mozilla Openstack Wikimedia

Algorithm fixing deeper hashmerging for fire-wall

bug 869897: makewatch devices.shlogs no longerinfinitely growing;my thought islogrotate.d butopen to otherchoices here

fix middlewareorder of proxypipeline and addmissing modulesthis patch fixes theorder of the mid-dlewares definedin the swift proxyserver pipeline

nginx serviceshould be stoppedand disabled whennginx is absent

Assignment fix syntax errorsthis commit re-moves a couple ofextraneous com-mand introducedby copy/pasterrors

bug 867593 usecorrect regexsyntax

resolved syntax er-ror in collection

fix missing slash inpuppet file url

Build/Package/Merge params shouldhave been im-ported; it was onlyworking in tests byaccident

bug 774638-concat::setupshould depend ondiffutils

fix db sync de-pendencies: thispatch adds depen-dencies betweenthe cinder-apiand cinder-backupservices to ensurethat db sync is runbefore the servicesare up. change-id:i7005

fix varnish aptdependencies theseare required forthe build it does.also; remove un-necessary packagerequires thatare covered byrequire package

Checking ensure we haveiso/ directory injenkins workspacewe share iso folderfor storing fuel isoswhich are used forfuel-library tests

bug 1118354: en-sure deploystudiouser uid is >500

fix check onthreads

fix hadoop-hdfs-zkfc-init execunless condition$zookeeper hosts stringwas improp-erly set; sinceit was createdusing a non-existent local var’@zoookeeper hosts’

Documentation fixed comments onpull.

bug 1253309 - fol-lowup fix to reviewcomments

fix up doc stringfor workers vari-able change-id:ie886

fix hadoop.pp doc-umentation default

Function fix for jenkinsswarm slave vari-ables

bug 1292523-puppet fails to setroot password onbuildduty-toolsserver

make class rallywork class rally iscreated initially bytool

fix ve restbase re-verse proxy configmove the restbasedomain and ‘v1’url routing bitsinto the apacheconfig rather thanthe ve config.

Interface fix for iso buildbroken dependen-cies

bug 859799-puppetagainbuildbot masterswon’t reconfigbecause of missingsftp subsystem

update all missingparameters in allmanifests

fix file location forinterfaces

No Error added readme merge bug 1178324from default

add example forcobbler

new packages forthe server

Other fuel-stats nginx fix summary: bug797946: minor fix-ups; r=dividehex

minor fixes fix nginx config-packageservice or-dering

Timing/Serialization fix /etc/timezonefile add newlineat end of file/etc/timezone

bug 838203-test alerts.htmltimes out onubuntu 12.04 vm

fix minimal avail-able memory checkchange-id:iaad0

fix hhvm libraryusage race con-dition ensurethat the hhvmlib ‘current’ sym-link is createdbefore settingusrbinphp to pointto usrbinhhvminstead ofusrbinphp5. previ-ously there was apotential for raceconditions due toresource orderingrules

12 Akond Rahman et al.

categories. We answer RQ1 using the categorization achieved through qualita-tive analysis and by reporting the count of defects that belong to each defectcategory. We use the metric Defect Count for Category x (DCC) calculatedusing Equation 1.

Defect Count for Category x (DCC) =(count of defects that belong to category x

total count of defects

)×100

(1)

Answers to RQ1 will give us an overview on the distribution of defectcategories for IaC scripts.

3.4 RQ2: What are the differences between infrastructure as code(IaC) and non-IaC software process improvement activities, asdetermined by their defect category distribution reported in theliterature?

We answer RQ2 by identifying prior research that have used the defect typeattribute of ODC to categorize defects in other systems such as safety criticalsystems (Lutz and Mikulski, 2004), and operating systems (Christmanssonand Chillarege, 1996). We collect necessary publications based on the followingselection criteria:

– Step-1: The publication must cite Chillarege et al. (Chillarege et al, 1992)’sODC publication, be indexed by ACM Digital Library or IEEE Xploreor SpringerLink or ScienceDirect, and have been published on or after2000. By selecting publications on or after 2000 we assume to obtain defectcategories of software systems that are relevant and comparable againstmodern software systems such as IaC.

– Step-2: The publication must use the defect type attribute of ODC in itsoriginal form to categorize defects of a software system. A publication mightcite Chillarege et al. (Chillarege et al, 1992)’s paper as related work, and notuse the ODC defect type attribute for categorization. A publication mightalso modify the ODC defect type attribute to form more defect categories,and use the modified version of ODC to categorize defects. As we use theoriginal eight categories of ODC’s defect type attribute, we do not includepublications that modified or extended the defect type attribute of ODCto categorize defects.

– Step-3: The publication must explicitly report the software systems theystudied with a distribution of defects across the ODC defect type cate-gories, along with the total count of bugs/defects/faults for each softwaresystem. Along with defects, we consider bugs and faults, as in prior workresearchers have used bugs (Thung et al, 2012) and faults interchangeablywith defects (Pecchia and Russo, 2012).

Categorizing Defects in Infrastructure as Code 13

Our answer to RQ2 provides a list of software systems with distributionof defects categorized using the defect type attribute of ODC. For each soft-ware system, we report the main programming language it was built, and thereference publication.

3.5 RQ3: Can the size of infrastructure as code (IaC) scriptsprovide a signal for improvement of the IaC development process?

Researchers in prior work (Moller and Paulish, 1993) (Fenton and Ohlsson,2000) (Hatton, 1997) have observed the relationships of source code size anddefects for software written in GPLs. Similar investigation for IaC scripts canprovide valuable insights, and may help practitioners in identifying scripts thatcontain defects or specific categories of defects. We investigate if size of IaCscripts is related to defective scripts by executing the following two steps:

– First, we quantify if defective and neutral scripts in the constructed datasetssignificantly differ in size. We calculate the size of each defective and neutralscript by measuring LOC. Next we apply statistical measurements betweenthe two groups: defective and neutral scripts. The statistical measurementsare: Mann Whitney U test (Mann and Whitney, 1947) and effect size calcu-lation with Cliffs Delta (Cliff, 1993). Both Mann Whitney U test and CliffsDelta measures are non-parametric. The Mann Whitney U test states if onedistribution is significantly large/smaller than the other, whereas effect sizeusing Cliffs Delta measures how large the difference is.Following convention, we report a distribution to be significantly larger thanthe other if p− value < 0.05. We use Romano et al.’s recommendations tointerpret the observed Cliffs Delta values. According to Romano et al. (Ro-mano et al, 2006), the difference between two groups is ‘large’ if Cliffs Deltais greater than 0.47. A Cliffs Delta value between 0.33 and 0.47 indicatesa ‘medium’ difference. A Cliffs Delta value between 0.14 and 0.33 indicatesa ‘small’ difference. Finally, a Cliffs Delta value less than 0.14 indicates a‘negligible’ difference.Upon completion of this step we will (i) identify if size of IaC scripts aresignificantly different between defective and neutral scripts, (ii) identify howlarge the difference is using effect size, and (iii) report median values fordefective and neutral scripts.

– Second, we compare how size of IaC scripts vary between IaC defect cate-gories by calculating LOC of scripts that belong to each of the nine defectcategories, reported in Table 1. We apply a variant of the Scott-Knott (SK)test to compare if the categories of defects significantly vary from each otherwith respect to size. This variant of SK does not assume input to have a nor-mal distribution and accounts for negligible effect size (Tantithamthavornet al, 2017). SK uses hierarchical clustering analysis to partition the inputdata into significantly (α = 0.05) distinct ranks (Tantithamthavorn et al,2017). Using the SK test, we can determine if the size of a defect categoryis significantly higher than the other groups. As a hypothetical example,

14 Akond Rahman et al.

if defect category ‘AS’ ranks higher than that of ‘AL’, then we can statethat IaC scripts that have ‘AS’-related defects are significantly larger thanscripts with ‘AL’-related defects. Along with, reporting the SK ranks wealso present the distribution of script sizes in forms of boxplots.

3.6 RQ4: How frequently do defects occur in infrastructure as codescripts?

We answer RQ4 by quantifying defect density and temporal frequency of IaCscripts. We calculate defect density by counting defects that appear per 1000lines (KLOC) of IaC script, similar to prior work (Battin et al, 2001) (Mo-hagheghi et al, 2004) (Hatton, 1997). We use Equation 2 to calculate defectdensity (DDKLOC).

DDKLOC =

total count of defects for all IaC scripts(total count of lines for all IaC scripts

1000

) (2)

DDKLOC provide us the defect density of the four IaC datasets and givesan assessment of how frequently defects occur in IaC scripts. We select thismeasure of defect density, as this measure has been used as an industry stan-dard to (i) establish a baseline for defect frequency; and (ii) helps to assessthe quality of the software (Harlan, 1987) (McConnell, 2004).

Temporal frequency provides an overview on how frequently defects appearwith the evolution of time. We compute the proportion of defects that occurevery month to quantify temporal frequency of IaC defects. We use the metric‘Overall Temporal Frequency’ and calculate this metric using Equation 3:

Overall Temporal Frequency (m) =

defects in month m

total commits in month m∗ 100

(3)

For Overall Temporal Frequency, we apply the Cox-Stuart test (Cox andStuart, 1955) to determine if the exhibited trend is significantly increasingor decreasing. The Cox Stuart test is a statistical technique that comparesthe earlier data points to the later data points in a time series to determinewhether or not the trend observant from the time series data is increasingor decreasing with statistical significance. We use a 95% statistical confidencelevel to determine which topics exhibit increasing or decreasing trends. Todetermine temporal frequencies for both, overall defects and category-wisedefects, we apply the following strategy:

– if Cox-Stuart test output states the temporal frequency values are ‘increas-ing’ with a p-value < 0.05, we determine the temporal trend to be ‘increas-ing’.

Categorizing Defects in Infrastructure as Code 15

– if Cox-Stuart test output states the temporal frequency values are ‘decreas-ing’ with a p-value < 0.05, we determine the temporal trend to be ‘decreas-ing’.

– if we cannot determine if the temporal trend as ‘increasing’ or ‘decreasing’,then we determine the temporal trend to be ‘consistent’.

4 Datasets

We construct case studies using Puppet scripts from open source repositoriesmaintained by four organizations: Mirantis, Mozilla, Openstack, and Wikime-dia Commons. We select Puppet because it is considered as one of the mostpopular tools to implement IaC (Jiang and Adams, 2015) (Shambaugh et al,2016), and has been used by IT organizations since 2005 (McCune and Jef-frey, 2011). Mirantis is an organization that focuses on the development andsupport of cloud services such as OpenStack 12. Mozilla is an open source soft-ware community that develops, uses, and supports Mozilla products such asMozilla Firefox 13. Openstack foundation is an open-source software platformfor cloud computing where virtual servers and other resources are made avail-able to customers 14. Wikimedia Foundation is a non-profit organization thatdevelops and distributes free educational content 15.

4.1 Repository Collection

We apply the three selection criteria presented in Section 3.1.1 to identify therepositories that we use for analysis. We describe how many of the repositoriessatisfied each of the three criteria as following:

– Criteria-1: Altogether, 26, 1594, 1253, and 1638 repositories were publiclyavailable to download for Mirantis, Mozilla, Openstack, and WikimediaCommons, respectively. We download the repositories from their corre-sponding online project management systems [Mirantis (Mirantis, 2018),Mozilla (Mozilla, 2018), Openstack (Openstack, 2018), and Wikimedia (Com-mons, 2018)]. The Mozilla repositories were Mercurial-based, whereas, Mi-rantis, Openstack and Wikimedia repositories were Git-based.

– Criteria-2: For Criteria-2 we stated that at least of 11% of all the filesbelonging to the repository must be Puppet scripts. For Mirantis, 20 ofthe 26 repositories satisfied Criteria-2. For Mozilla, 2 of the 1,594 reposito-ries satisfied Criteria-2. For Openstack, 61 of the 1253 repositories satisfiedCriteria-2. For Wikimedia Commons, 11 out of the 1,638 repositories sat-isfied Criteria-2. Altogether 74 of the 4,485 repositories satisfy Criteria-2,

12 https://www.mirantis.com/13 https://www.mozilla.org/en-US/14 https://www.openstack.org/15 https://wikimediafoundation.org/

16 Akond Rahman et al.

indicating the amount of IaC scripts to be small compared to the organi-zations’ overall codebase.

– Criteria-3: As Criteria-3 we stated that the repository must have at leasttwo commits per month. The 20, 2, 61, and 11 selected repositories respec-tively, for Mirantis, Mozilla, Openstack, and Wikimedia Commons thatsatisfy Criteria-2 also satisfy Criteria-3.

We perform our analysis on 20, 2, 61, and 11 repositories from Miran-tis, Mozilla, Openstack, and Wikimedia Commons. We refer to the datasetsfrom Mirantis, Mozilla, Openstack, and Wikimedia Commons as ‘Mirantis’,‘Mozilla’, ‘Openstack’, and ‘Wikimedia’, respectively.

4.2 Commit Message Processing

For Mirantis, Mozilla, Openstack, and Wikimedia, altogether we respectivelycollect 2749, 60992, 31460, and 14717 commits, from 20, 2, 61, and 11 reposi-tories. Of these 2749, 60992, 31460, and 14717 commits, respectively, in 1021,3074, 7808, and 972, commits at least one Puppet script is modified. As shownin Table 3, for Mirantis we collect 91 Puppet scripts that map to 1,021 commitsfrom the 20 repositories. For Mozilla we collect 580 Puppet scripts that mapto 3,074 commits from the two repositories. For Openstack, we collect 1,383Puppet scripts that map to 7,808 commits from the 61 repositories. For Wiki-media, we collect 296 Puppet scripts that map to 972 commits from the 11repositories. Of the 3074, 7808, and 972 commits, 2764, 2252, and 210 commitmessages included unique identifiers that map to issues in their respective issuetracking systems. Using these unique identifiers to issue reports, we constructthe XCMs.

4.3 Determining Categories of Defects

Altogether, we had 89 raters who determined the categories of defect-relatedcommits. Of the 89 raters, three are PhD students and co-authors of this paper.The rest of the 86 students are recruited from a graduate course related tosoftware engineering, titled ‘Software Security’. To recruit the 86 graduatestudents we follow the Institutional Review Board (IRB) Protocol (IRB#9521and IRB#1230). We use the 89 raters to categorize the XCMs, using thefollowing phases:

– Categorization Phase:– Mirantis: We recruit students in a graduate course related to software

engineering titled ‘Software Security’ via e-mail. The number of stu-dents in the class was 58, and 32 students agreed to participate. Wefollow Internal Review Board (IRB) protocol, IRB#12130, in recruit-ment of students and assignment of defect categorization tasks. Werandomly distribute the 1021 XCMs amongst the students such that

Categorizing Defects in Infrastructure as Code 17

each XCM is rated by at least two students. The average professionalexperience of the 32 students in software engineering is two years. Onaverage, each student took 2.1 hours.

– Mozilla: The second and third author of the paper, separately ap-ply qualitative analysis on 3,074 XCMs. The second and third author,respectively, have a professional experience of three and two years insoftware engineering. The second and third author respectively took37.0 and 51.2 hours to complete the categorization.

– Openstack: The second and fourth author of the paper, separatelyapply qualitative analysis on 7,808 XCMs from Openstack repositories.The second and fourth author, respectively, have a professional expe-rience of two and one years in software engineering. The second andfourth author completed the categorization of the 7,808 XCMs respec-tively, in 80.0 and 130.0 hours.

– Wikimedia: 54 graduate students recruited from the ‘Software Se-curity’ course are the raters. We randomly distribute the 972 XCMsamongst the students such that each XCM is rated by at least two stu-dents. According to our distribution, 140 XCMs are assigned to eachstudent. The average professional experience of the 54 students in soft-ware engineering is 2.3 years. On average, each student took 2.1 hoursto categorize the 140 XCMs.

– Resolution Phase:– Mirantis: Of the 1,021 XCMs, we observe agreement for 509 XCMs

and disagreement for 512 XCMs, with a Cohen’s Kappa score of 0.21.Based on Cohen’s Kappa score, the agreement level is ‘fair’ (Landis andKoch, 1977).

– Mozilla: Of the 3,074 XCMs, we observe agreement for 1,308 XCMsand disagreement for 1,766 XCMs, with a Cohen’s Kappa score of 0.22.Based on Cohen’s Kappa score, the agreement level is ‘fair’ (Landis andKoch, 1977).

– Openstack: Of the 7,808 XCMs, we observe agreement for 3,188 XCMs,and disagreements for 4,620 XCMs. The Cohen’s Kappa score was 0.21.Based on Cohen’s Kappa score, the agreement level is ‘fair’ (Landis andKoch, 1977).

– Wikimedia: Of the 972 XCMs, we observe agreement for 415 XCMs,and disagreements for 557 XCMs, with a Cohen’s Kappa score of 0.23.Based on Cohen’s Kappa score, the agreement level is ‘fair’ (Landis andKoch, 1977).

First author of the paper is the resolver, and resolve disagreements for allfour datasets. We observe that the raters agreement level to be ‘fair’ forall four datasets. One possible explanation can be that the raters agreedon whether an XCM is defect-related, but disagreed on what the categoryof the defect is. For defect categorization, fair or poor agreement amongstraters however, is not uncommon. Henningsson et al. (Henningsson andWohlin, 2004) also reported a low agreement amongst raters.

18 Akond Rahman et al.

Practitioner Agreement: We report the agreement level between theraters’ and the practitioners’ categorization for randomly selected 50 XCMsas following:– Mirantis: We contact three programmers and all of them responded.

We observe a 89.0% agreement with a Cohen’s Kappa score of 0.8. Basedon Cohen’s Kappa score, the agreement level is ‘substantial’ (Landis andKoch, 1977).

– Mozilla: We contact six programmers and all of them responded. Weobserve a 94.0% agreement with a Cohen’s Kappa score of 0.9. Basedon Cohen’s Kappa score, the agreement level is ‘almost perfect’ (Landisand Koch, 1977).

– Openstack: We contact 10 programmers and all of them responded.We observe a 92.0% agreement with a Cohen’s Kappa score of 0.8. Basedon Cohen’s Kappa score, the agreement level is ‘substantial’ (Landis andKoch, 1977).

– Wikimedia: We contact seven programmers and all of them responded.We observe a 98.0% agreement with a Cohen’s Kappa score of 0.9. Basedon Cohen’s Kappa score, the agreement level is ‘almost perfect’ (Landisand Koch, 1977).

We observe that the agreement between ours and the practitioners’ cate-gorization varies from 0.8 to 0.9, which is higher than that of the agreementbetween the raters in the Categorization Phase. One possible explanation canbe related to how the resolver resolved the disagreements. The first author ofthe paper has industry experience in writing IaC scripts, which may help todetermine categorizations that are consistent with practitioners, potentiallyleading to higher agreement. Another possible explanation can be related tothe sample provided to the practitioners: the provided sample, even thoughrandomly selected, may include commit messages whose categorization arerelatively easy to agree upon.

Finally, upon applying qualitative analysis we identify the category of eachXCM. We mark XCMs that do not have the defect category ‘N’ as defect-related commits. The defect-related commits list the changed IaC scripts whichwe use to identify the defective IaC scripts. We present the count of defect-related commits, and defective IaC scripts in Table 3. We observe for Mozilla,18.1% of the commits are defect-related, even though 89.9% of the commits in-cluded identifiers to issues. According to our qualitative analysis, for Mozilla,issue reports exist that are not related to defects such as, support of newfeatures 16, and installation issues 17. In case of Openstack and Wikimedia,respectively, 28.8% and 16.4% of the Puppet-related commits include identi-fiers that map to issues.

16 https://bugzilla.mozilla.org/show bug.cgi?id=86897417 https://bugzilla.mozilla.org/show bug.cgi?id=773931

Categorizing Defects in Infrastructure as Code 19

Table 3: Defect Datasets Constructed for the Paper

Properties DatasetMirantis (MIR) Mozilla (MOZ) Openstack

(OST)Wikimedia(WIK)

Time Pe-riod

May 2010-Jul2017

Aug 2011-Sep2016

Mar 2011-Sep2016

Apr 2005-Sep2016

PuppetCommits

1021 of 2749,37.1%

3074 of 60992,5.0%

7808 of 31460,24.8%

972 of 14717,6.6%

PuppetCode Size(LOC)

17,564 30,272 122,083 17,439

Defect-relatedCommits

344 of 1021,33.7%

558 of 3074,18.1%

1987 of 7808,25.4%

298 of 972, 30.6%

DefectivePuppetScripts

91 of 165, 55.1% 259 of 580, 44.6% 810 of 1383,58.5%

161 of 296, 54.4%

4.4 Dataset Availability

The constructed datasets used for empirical analysis are available as online 18.

5 Results

In this section, we provide empirical findings by answering the four researchquestions:

5.1 RQ1: What process improvement recommendations can wemake for infrastructure as code development, using the defect typeattribute of orthogonal defect classification?

We answer RQ1 by first presenting the values for defect count per category(DCC) that belong to each defect category mentioned in Table 1. In Figure 2,we report the DCC values for the four datasets.

In Figure 2, the x-axis presents the nine defect categories, whereas, the y-axis presents DCC values for each category. As shown in Figure 2, Assignment-related defects account for 49.3%, 36.5%, 57.6%, and 62.7% of the defects,for Mirantis, Mozilla, Openstack, and Wikimedia, respectively. Together, thetwo categories, assignment and checking, accounts for 55.9%, 53.5%, 64.2%,and 74.8% of the defects, for Mirantis, Mozilla, Openstack, and Wikimedia,respectively.

In short, we observe assignment-related defects (AS) to be the dominantdefect category for all four datasets, followed by checking-related defects (C).One possible explanation can be related to how practitioners utilize IaC scripts.

18 https://doi.org/10.6084/m9.figshare.6465215

20 Akond Rahman et al.

0

20

40

60

AL AS B C D F I O T

Category

DC

C

Mirantis Mozilla Openstack Wikimedia

Fig. 2: Defect count per category (DCC) that belong to each defect cate-gory Algorithm (AL), Assignment (AS), Build/Package/Merge (B), Checking(C), Documentation (D), Function (F), Interface (I), Other (O), and Tim-ing/Serialization (T). Defects categorized as ‘Assignment’ is the dominantcategory.

IaC scripts are used to manage configurations and deployment infrastructureautomatically (Humble and Farley, 2010). For example, practitioners use IaCto provision cloud instances such as Amazon Web Services (AWS) (Cito et al,2015), or manage dependencies of software (Humble and Farley, 2010). Whenassigning configurations of library dependencies or provisioning cloud instancesprogrammers might be inadvertently introducing defects, that needs fixing.Fixing these defects involve few lines of code in IaC scripts, and these defectsfall in the assignment category. Correcting syntax issues also involve a fewlines of code, and also belongs to category assignment. Examples of defect-related XCMs that belong to the assignment category are: ‘fix gdb version tobe more precise’ and ‘bug 867593 use correct regex syntax’. Another possibleexplanation can be related to the declarative nature of Puppet scripts. Puppetprovides syntax to declare and assign configurations for the system of interest.While assigning these configuration values programmers may be inadvertentlyintroducing defects.

Our findings are consistent with prior research related to defect catego-rization. In case of machine learning software such as Apache Mahout 19,Apache Lucene 20, and OpenNLP 21, that are dedicated for executing machinelearning algorithms, researchers have observed the most frequently occurringdefects are algorithm-related (Thung et al, 2012). For database software, Sul-livan et al. (Sullivan and Chillarege, 1991) have observed that defect distri-bution is dominated by assignment and checking-related defects. Chillaregeet al. (Chillarege et al, 1992) found Sullivan et al. (Sullivan and Chillarege,1991)’s observations ‘reasonable’, as few lines of code for database softwaretypically miss a condition, or assign a wrong value.

As shown in Figure 2, for category Other (O), defect count per category(DCC) is 12.5%, 9.4%, 6.7%, and 0.8%, for Mirantis, Mozilla, Openstack, and

19 http://mahout.apache.org/20 https://lucene.apache.org/core/21 https://opennlp.apache.org/

Categorizing Defects in Infrastructure as Code 21

- content => inline_template("This is <%= fqdn %> (<%= ipaddress %>)");+ content => inline_template("This is <%= fqdn %> (<%= ipaddress %>)\n");

1

a

- stdlib::safe_package[$nova_title] -> service[$nova_title]- stdlib::safe_package[$nova_title] ∼> service[$nova_title]+ package[$nova_title] -> service[$nova_title]+ package[$nova_title] ∼> service[$nova_title]

1

b

Fig. 3: Code changes in commits categorized as the ‘Other’ category. Figures 3aand 3b respectively presents the code changes for two commits marked as‘Other’ obtained from Mozilla and Openstack.

Wikimedia, respectively. This category includes defects that correspond to adefect but the rater was not able to identify the category of the defect. Ex-amples of such defect-related commit messages include ‘minor puppetagainfixes’, and ‘summary fix hg on osx; a=bustage’. One possible explanation canbe attributed to the lack of information content provided in the messages. Pro-grammers may not strictly adhere to the practice of writing good commit mes-sages, which eventually leads to commit messages that do not provide enoughinformation for defect categorization. For example, the commit message ‘minorpuppetagain fixes’, implies that a programmer performed a fix-related actionon an IaC script, but what category of defect was fixed remains unknown. Weobserve organization-based guidelines to exist on how to write better commitmessages for Mozilla 22, Openstack 23, and Wikimedia 24. Our findings sug-gest despite the availability of commit message guidelines, programmers donot adhere to these guidelines.

Another possible explanation can be attributed to the lack of contextinherent in commit messages. The commit messages provide a summary ofthe changes being made, but that might not be enough to determine the de-fect category. Let us consider two examples in this regard, provided in Fig-ures 3a and 3b. Figures 3a and 3b respectively presents two XCMs catego-rized as ‘Other’, and obtained respectively from Mozilla and Openstack. InFigure 3b, we observe that in the commit, a newline is being added for print-ing purposes, which is not captured in the commit message ‘summary: bug746824: minor fixes ’. From Figure 3b, we observe that in the commit, the‘stdlib::safe package’ is being replaced by the ‘package’ syntax, which is notcaptured by the corresponding commit message ‘minor fixes’.

Recommendations for IaC Development: Chillarege et al. (Chillaregeet al, 1992) have reported that in ODC, each of the defect categories as deter-mined by the defect type attribute of ODC, provide practitioners the oppor-

22 http://tiny.cc/moz-commit23 https://wiki.openstack.org/wiki/GitCommitMessages24 https://www.mediawiki.org/wiki/Gerrit/Commit message guidelines

22 Akond Rahman et al.

tunity to improve their software development process. According to Chillaregeet al. (Chillarege et al, 1992), each defect category is related to one or multipleactivities of software development. For example, function-related defects arerelated with the design and functional testing of software development. Preva-lence of defects in a certain category indicates in which software developmentactivities IT organizations can invest more effort. Our findings indicate thatassignment-related defects are the most dominant category of defects. Accord-ing to Chillarege et al. (Chillarege et al, 1992), if assignment-related defectsare not discovered with code inspection and unit tests earlier, these defectscan continue to grow at latter stages of development. Based on our findings,IT organizations can reap process improvement opportunities, if they allocatemore code inspection and unit testing efforts.

Our findings may have implications on prioritizing verification and vali-dation (V&V) efforts, as well. Ideally, we would expect IT organizations toallocate sufficient V&V efforts to detect all categories of defects. But in thereal world, V&V efforts are limited. If practitioners want to prioritize theirV&V efforts, then they can first focus on assignment-related defects, as thesecategory of defects are dominant for all four organizations.

Assignment is the most frequently occurring defect category for all fourdatasets: Mirantis, Mozilla, Openstack, and Wikimedia. For Mirantis, Mozilla,Openstack, and Wikimedia, respectively 49.3%, 36.5%, 57.6%, and 62.7%, ofthe defects belong to the category, assignment. Based on our findings, we rec-ommend practitioners to allocate more efforts on code inspection and unittesting.

5.2 RQ2: What are the differences between infrastructure as code(IaC) and non-IaC software process improvement activities, asdetermined by their defect category distribution reported in priorliterature?

We identify 26 software systems using the three steps outlined in Section 3.4:

– Step-1: As of August 11, 2018, 818 publications indexed by ACM DigitalLibrary or IEEE Xplore or SpringerLink or ScienceDirect, cited the originalODC publication (Chillarege et al, 1992). Of the 818 publications, 674publications were published on or after 2000.

– Step-2: Of these 674 publications, 16 applied the defect type attribute ofODC in its original form to categorize defects for software systems.

– Step-3: Of these 16 publications, seven publications explicitly mentionedthe total count of defects and provided a distribution of defect categories.

In Table 4, we present the categorization of defects for these 26 softwaresystems. The ‘System’ column reports the studied software system followedby the publication reference in which the findings were reported. The ‘Count’

Categorizing Defects in Infrastructure as Code 23

column reports the total count of defects that were studied for the software sys-tem. The next eight consecutive columns respectively present the eight defectcategories used in the defect type attribute of ODC: algorithm (AL), assign-ment (AS), block (B), checking (C), documentation (D), function (F), interface(I), and timing (T). We do not report the ‘Other’ category, as this categoryis not included as part of the ODC defect type attribute. The ‘Lang.’ columnpresents the programming language using which the system is developed. Thedominant defect category is highlighted in bold.

From Table 4, we observe that for four of the 26 software systems, 40% ormore of the defect categories belonged to category assignment. For 15 of the26 software systems algorithm-related defects are dominant. We also observedocumentation, and timing-related defects to rarely occur in previously studiedsoftware systems. In contrast to IaC scripts, assignment-related defects are notprevalent: assignment-related defects were the dominant category for 3 of the26 previously studied non-IaC software systems. We observe IaC scripts tohave a different defect category distribution than non-IaC software systemwritten in GPLs; the dominant defect category for IaC scripts is assignment.

The differences in defect category distribution may yield different set ofguidelines for software process improvement. For the 15 software systems wherealgorithm-related defects are dominant, based on ODC, software process im-provement efforts can be focused on the following activities: coding, code in-spection, unit testing, and functional testing. In case of IaC scripts, as previ-ously discussed, software process improvement efforts can be focused on codeinspection and unit testing, which is different to that of non-IaC systems.

Defects categorized as assignment are more prevalent amongst IaC scriptscompared to that of previously studied non-IaC systems. Our findings sug-gest that process improvement activities will be different for IaC developmentcompared to that of non-IaC software development.

5.3 RQ3: Can the size of infrastructure as code (IaC) scriptsprovide a signal for improvement of the IaC development process?

As shown in Figure 4, we answer RQ3 by first reporting the distribution ofsize in LOC, for defective and neutral scripts. In Figure 4, the y-axis presentsthe size of scripts in LOC, whereas the x-axis presents the two groups: ‘De-fective’ and ‘Neutral’. The median size of defective script is 90.0, 53.0, 77.0,and 57.0 LOC, respectively for Mirantis, Mozilla, Openstack, and Wikimedia.The median size of a defective script is respectively, 2.3, 2.1, 1.6, and 2.8 timeslarger for Mirantis, Mozilla, Openstack, and Wikimedia than a neutral script.The size difference is significant for all four datasets (p − value < 0.001),with an effect size of 0.5, 0.5, 0.3, and 0.5, respectively for Mirantis, Mozilla,Openstack, and Wikimedia. Based on Cliffs Delta values, the size difference

24 Akond Rahman et al.

Table 4: Defect Categories of Previously Studied Software Systems

System Lang. Count AL(%) AS(%) B(%) C(%) D(%) F(%) I(%) T(%)Bourne Again Shell(BASH) (Cotroneo et al,2013)

C 2 0.0 100.0 0.0 0.0 0.0 0.0 0.0 0.0

ZSNES-Emulator forx86 (Cotroneo et al, 2013)

C,C++

3 33.3 66.7 0.0 0.0 0.0 0.0 0.0 0.0

Pdftohtml-Pdf to htmlconverter (Cotroneo et al,2013)

Java 20 40.0 55.0 0.0 5.0 0.0 0.0 0.0 0.0

Firebird-RelationalDBMS (Cotroneo et al,2013)

C++ 2 0.0 50.0 0.0 50.0 0.0 0.0 0.0 0.0

Air flight application (Lyuet al, 2003)

C 426 19.0 31.9 0.0 14.0 0.0 33.8 1.1 0.0

Apache web server (Pec-chia and Russo, 2012)

C 1,101 47.6 26.4 0.0 12.8 0.0 0.0 12.9 0.0

Joe-Tex editor (Cotroneoet al, 2013)

C 78 15.3 25.6 0.0 44.8 0.0 0.0 14.1 0.0

Middleware system forair traffic control (Cinqueet al, 2014)

C 3,159 58.9 24.5 0.0 1.7 0.0 0.0 14.8 0.0

ScummVM-Interpreter foradventure engines (Cotro-neo et al, 2013)

C++ 74 56.7 24.3 0.0 8.1 0.0 6.7 4.0 0.0

Linux kernel (Cotroneoet al, 2013)

C 93 33.3 22.5 0.0 25.8 0.0 12.9 5.3 0.0

Vim-Linux editor (Cotro-neo et al, 2013)

C 249 44.5 21.2 0.0 22.4 0.0 5.2 6.4 0.0

MySQL DBMS (Pecchiaand Russo, 2012)

C,C++

15,102 52.9 20.5 0.0 15.3 0.0 0.0 11.2 0.0

CDEX-CD digital audiodata extractor (Cotroneoet al, 2013)

C,C++,Python

11 9.0 18.1 0.0 18.1 0.0 0.0 54.5 0.0

Struts (Basso et al, 2009) Java 99 48.4 18.1 0.0 9.0 0.0 4.0 20.2 0.0Safety critical system forNASA spacecraft (Lutzand Mikulski, 2004)

Java 199 17.0 15.5 2.0 0.0 2.5 29.1 5.0 10.5

Azureus (Basso et al, 2009) Java 125 36.8 15.2 0.0 11.2 0.0 28.8 8.0 0.0Phex (Basso et al, 2009) Java 20 60.0 15.0 0.0 5.0 0.0 10.0 10.0 0.0TAO Open DDS (Pecchiaand Russo, 2012)

Java 1,184 61.4 14.4 0.0 11.7 0.0 0.0 12.4 0.0

JEdit (Basso et al, 2009) Java 71 36.6 14.0 0.0 11.2 0.0 25.3 12.6 0.0Tomcat (Basso et al, 2009) Java 169 57.9 12.4 0.0 13.6 0.0 2.3 13.6 0.0FreeCvi-Strategygame (Cotroneo et al,2013)

Java 53 52.8 11.3 0.0 13.2 0.0 15.0 7.5 0.0

FreeMind (Basso et al,2009)

Java 90 46.6 11.1 0.0 2.2 0.0 28.8 11.1 0.0

MinGW-Minimalist GNUfor Windows (Cotroneoet al, 2013)

C 60 46.6 10.0 0.0 38.3 0.0 0.0 5.0 0.0

Java Enterprise Frame-work (Gupta et al, 2009)

Java 223 3.1 16.7 36.0 3.1 0.0 21.2 19.9 0.0

Digital Cargo Files (Guptaet al, 2009)

Java 438 3.3 12.4 31.0 10.4 0.5 31.3 9.9 1.2

Shipment and Alloca-tion (Gupta et al, 2009)

Java 649 22.8 6.7 18.7 10.9 0.5 29.8 9.3 1.6

Mirantis [This paper] Puppet 344 6.5 49.3 6.7 1.9 7.5 6.4 12.5 2.6Mozilla [This paper] Puppet 558 7.7 36.5 6.4 17.0 2.3 10.0 1.9 8.4Openstack [This paper] Puppet 1987 5.9 57.5 8.6 6.7 2.6 2.4 2.9 6.5Wikimedia [This paper] Puppet 298 3.3 62.7 4.7 12.0 4.3 4.3 2.6 5.1

Categorizing Defects in Infrastructure as Code 25

●●●

●●

0

250

500

750

1000

Defective NeutralIaC Scripts

Siz

e (L

OC

)

a

●●

●●

●●●●●

●●

●●●●●●●●●●●●●●●●●

0

250

500

750

1000

Defective NeutralIaC Scripts

Siz

e (L

OC

)

b

●●●

●●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●

●●

0

250

500

750

1000

Defective NeutralIaC Scripts

Siz

e (L

OC

)

c

●●

●●●●

●●●●●●●

●●●

0

250

500

750

1000

Defective NeutralIaC Scripts

Siz

e (L

OC

)

d

Fig. 4: Size of defective and neutral IaC scripts, measured in LOC. Figures 4a,4b, 4c, 4d respectively presents the size of defective and neutral scriptsrespectively, for Mirantis, Mozilla, Openstack, and Wikimedia.

26 Akond Rahman et al.

between defective and neutral scripts are respectively ‘large’, ‘large’, ‘small’,and ‘large’, for Mirantis, Mozilla, Openstack, and Wikimedia.

1 2 2 1 2 1 1 1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

MIRANTIS

MOZILLA

OPENSTACK

WIKIMEDIA

AL AS B C D F I O T

Category

Dat

aset

1.00

1.25

1.50

1.75

2.00Rank

Fig. 5: The Scott-Knott (SK) ranks for each defect category and each dataset.For four datasets, the same rank across all defect categories states that withrespect to size, there is no significant difference between the defect categories.For Mirantis, Assignment and Build/Package-related defects are smaller insize than other defect categories.

In Figure 5, we report the results of our SK test. As shown in Figure 5,we do not observe any significant difference for script size between the defectcategories, as for the three datasets, Mozilla, Openstack, and Wikimedia therank is same for all nine defect categories. For Mirantis, the rank is highest (1)for categories ‘Algorithm’, ‘Checking’, ‘Function’, ‘Interface’, ‘Timing’, and‘Other’. For the categories ‘Assignment’, ‘Build/Package/Merge’ and ‘Docu-mentation’, the rank is lowest (2). Our findings indicate of no trends consistentacross the four datasets with respect to the relation between size and defectcategories. The observation that for Mirantis we observe script size differencesacross defect categories indicates IaC development process differences fromone organization to another.

As part of our methodology, we also present the distribution of size for eachdefect category in Figure 6. The y-axis in each subplot presents the size in LOCfor each defect category. For Mozilla, considering median script size, category‘Interface (I)’ is the largest (87.0 LOC), and category ‘Checking (C)’ is thesmallest (53.5 LOC). For Openstack and Wikimedia, categories ‘Documenta-tion (D)’ and ‘Build/Package/Merge (B)’ are the largest with a median sizeof of 160.0 LOC, and 118.0 LOC, respectively. Categories ‘Function (F)’ and‘Other (O)’ are the smallest, with a median size of 86.0 LOC, and 46.5 LOC, re-spectively. However, the differences in script size across defect categories is notsignificant, as suggested by the Scott-Knott test for Mozilla, Openstack, andWikimedia (Figure 5). In the case of Mirantis, category ‘Build/Package/Merge’is the smallest with a median size of 91 LOC. Category ‘Algorithm’ is thelargest category with a median size of 221 LOC.

Categorizing Defects in Infrastructure as Code 27

● ● ● ● ●

●●

●● ● ● ●

0

250

500

750

1000

AL AS B C D F I O TDefect Category

Siz

e (L

OC

)

MIRANTIS

a

●●

●●●●

● ●

●●

● ● ● ●●

● ●●

0

250

500

750

1000

AL AS B C D F I O TDefect Category

Siz

e (L

OC

)

MOZILLA

b

●●●●●

●●●●●●●●●

●●●●●

●●

●●●●

●●

●●

●● ●

●●●●

●●●●

●● ● ●

●●

0

250

500

750

1000

AL AS B C D F I O TDefect Category

Siz

e (L

OC

)

OPENSTACK

c

●●●●

●●

●●

●●●

● ●●

●●

● ●●

0

250

500

750

1000

AL AS B C D F I O TDefect Category

Siz

e (L

OC

)

WIKIMEDIA

d

Fig. 6: Distribution of script size for each defect category. Figures 6a, 6b, 6c,6d respectively presents the distribution for Mirantis, Mozilla, Openstack, andWikimedia.

28 Akond Rahman et al.

Our findings of the relationship between size and defect categories anddefect distribution highlight organizational process differences. Consideringscript size, IaC scripts are not all developed in the same manner, nor do theycontain the same category of defects. That said, regardless of defect categories,defective scripts are significantly larger than neutral scripts. This finding canbe useful for (i) practitioners: if their IaC codebase includes large scripts, thenthose scripts can have additional V&V resource; and (ii) researchers: futureresearch studies can identify defective IaC scripts through machine learningtechniques using size of IaC scripts.

Defective IaC scripts contain significantly more lines of code than neutralscripts. We recommend practitioners to invest additional verification and val-idation resources for IaC scripts that are relatively larger in size.

5.4 RQ4: How frequently do defects occur in infrastructure as codescripts?

The defect density, measured in DDKLOC is respectively, 27.6, 18.4, 16.2, and17.1 per 1000 LOC, respectively, for Mirantis, Mozilla, Openstack, and Wiki-media. Prior research shows that defect densities can vary from one softwaresystem to another, and one organization to another. For a Fortran-based satel-lite planning software system, Basili and Perricone (Basili and Perricone, 1984)have reported defect density to vary from 6.4 to 16.0, for every 1000 LOC.Mohagheghi et al. (Mohagheghi et al, 2004) have reported defect density tovary between 0.7 and 3.7 per KLOC for a telecom software system written inC, Erlang, and Java. For a Java-based system, researchers (Maximilien andWilliams, 2003) have reported a defect density of 4.0 defects/KLOC.

We notice the magnitude of defect density of IaC scripts to be higher thanthat of previously studied software systems. We use Basili and Perricone’sobservations (Basili and Perricone, 1984) as a possible explanation for ourfindings. Basili and Perricone (Basili and Perricone, 1984) suggested that ifdefects are spread out equally across artifacts then the overall defect densityof the system can be high. Our possible explanation in this regard is that forIaC defects are equally distributed as suggested by Basili and Perricone (Basiliand Perricone, 1984).

We report the overall temporal frequency values for the four datasets in Fig-ure 7. In Figure 7 we apply smoothing to obtain visual trends. The x-axis andy-axis respectively presents the months, the overall temporal frequency values,for each month. For Mozilla and Openstack we observe constant trends. ForWikimedia, initially defects appear more frequently, and decrease over time.According to our Cox-Stuart test results, as shown in Table 5, all trends areconsistent. The p-values obtained from the Cox-Stuart test output for Mi-rantis, Mozilla, Openstack, and Wikimedia are respectively, 0.22, 0.23, 0.42,

Categorizing Defects in Infrastructure as Code 29

Table 5: Cox-Stuart Test Results for Overall Temporal Frequency of DefectCount

Output Mirantis Mozilla Openstack WikimediaTrend Increasing Decreasing Decreasing Decreasing

p− value 0.22 0.23 0.42 0.13

and 0.13. Our findings indicate that overall, frequency of defects do not signifi-cantly change across time for IaC scripts. Findings from Figure 7 also highlightadoption time differences for IaC. For Wikimedia, IaC was adopted in 11 yearsago, but for Mirantis IaC was adopted in 2010, and for Mozilla and OpenstackIaC was adopted in 2011.

We use our findings related to temporal frequency to draw parallels withhardware and non-IaC software systems. For hardware systems, researchers (Smith,1993) have reported the ‘bathtub curve’ trend, which states when initiallyhardware systems are put into service, the frequency of defects is high. Astime progresses, defect frequency decreases, and remains constant during the‘adult period’. However, after the adult period, defect frequency becomes highagain, as hardware systems enter the ‘wear out period’ (Smith, 1993). For IaCdefects we do not observe such temporal trend.

In traditional software systems such as non-IaC systems, initially defect fre-quency is high, which gradually decreases, and eventually becomes low as timeprogresses (Hartz et al, 1996). Eventually, all software systems enter the ‘ob-solescence period’ where defect frequency remains consistent, as no significantupgrades or changes to the software is made (Hartz et al, 1996). For Wikime-dia we observe a similar trend to that of traditional software systems: initiallydefect frequency remains high, but decreases as time progresses. However thisobservation does not generalize for the other three datasets. Also, accordingto the Cox-Stuart test, the visible trends are not statistically validated. Onepossible explanation can be attributed to how IT organizations resolve existingdefects, for example, while fixing a certain set of defects, programmers maybe inadvertently introducing a new set of defects, which results in an overallconstant trend.

The defect density is 27.6, 18.4, 16.2, and 17.1 defects per KLOC respectivelyfor Mirantis, Mozilla, Openstack, and Wikimedia. For all four datasets weobserve IaC defects to follow a consistent temporal trend.

6 Discussion

In this section, we discuss our findings with possible implications:

30 Akond Rahman et al.

● ● ● ●

● ●

●●

● ●

● ●

● ●

●●

● ●●

●●

● ●

● ●

●0

25

50

75

100

2010

−05

2011

−06

2012

−03

2012

−08

2013

−08

2014

−02

2014

−07

2014

−12

2015

−05

2015

−10

2016

−03

2016

−09

Month

Tem

pora

l Fre

q. (

Ove

rall)

MIRANTIS

a

●●

● ●

●● ● ●

● ●●

● ● ● ● ● ●●

● ●

●0

25

50

75

100

2011

−08

2012

−04

2012

−09

2013

−02

2013

−07

2013

−12

2014

−05

2014

−10

2015

−03

2015

−08

2016

−01

2016

−06

Month

Tem

pora

l Fre

q. (

Ove

rall)

MOZILLA

b

●●

●●

●●

● ●

● ● ●

●●

● ● ● ●●

● ●

● ●

● ●

● ● ●●

0

25

50

75

100

2011

−05

2012

−03

2012

−08

2013

−01

2013

−06

2013

−11

2014

−04

2014

−09

2015

−02

2015

−07

2015

−12

2016

−05

Month

Tem

pora

l Fre

q. (

Ove

rall)

OPENSTACK

c

● ●

● ● ●

● ●

●●

● ●

●● ●

● ●

●●

●0

25

50

75

100

2006

−10

2008

−10

2009

−08

2012

−12

2013

−07

2013

−12

2014

−05

2014

−10

2015

−03

2015

−08

2016

−01

2016

−06

Month

Tem

pora

l Fre

q. (

Ove

rall)

WIKIMEDIA

d

Fig. 7: Temporal frequency values for all four datasets. Figures 7a, 7b,7c, and 7d respectively presents the temporal frequency values for Miran-tis, Mozilla, Openstack, and Wikimedia with smoothing. Overall, defect-relatecommits exhibit consistent trends over time.

Categorizing Defects in Infrastructure as Code 31

6.1 Implications for Process Improvement

One finding our paper is the prevalence of assignment-related defects in IaCscripts. Software teams can use this finding to improve their process in twopossible ways: first, they can use the practice of code review for developingIaC scripts. Code reviews can be conducted using automated tools and/orteam members’ manual reviews. For example, through code reviews softwareteams can pinpoint the correct value of configurations at development stage.Automated code review tools such as linters can also help in detecting andfixing syntax issues of IaC scripts at the development stage. Typically, IaCscripts are used by IT organizations that have implemented CD, and for theseorganizations, Kim et al. (Kim et al, 2016) recommends manual peer reviewmethods such as pair programming to improve code quality.

Second, software teams might benefit from unit testing of IaC scripts toreduce defects related to configuration assignment. We have observed fromSection 5.1 that IaC-related defects are mostly related to the assignment cat-egory which includes improper assignment of configuration values and syntaxerrors. Programmers can test if correct configuration values are assigned bywriting unit tests for components of IaC scripts. In this manner, instead ofcatching defects at run-time that might lead to real-world consequences e.g.the problem reported by Wikimedia Commons 25, software teams might beable to catch defects in IaC scripts at the development stages with the help oftesting.

6.2 Future Research

Our findings have the potential to facilitate further research in the area ofIaC defects. We have observed size of IaC scripts to be correlated with defectsin Section 5.3. Our findings can be helpful in building size-based predictionmodels to identify defective IaC scripts. In Sections 5.3 and 5.4 we have ob-served the process differences that occur between IT organizations, and futureresearch can systematically investigate if there are process differences in IaCdevelopment, and why they exist.

We have applied a qualitative process to categorize defects using the defecttype attribute of ODC. We acknowledge that our process is manual and labor-intensive. We advocate for future research that can look into how the processof ODC can be automated.

7 Threats To Validity

We describe the threats to validity of our paper as following:

25 http://tiny.cc/wik-outage-jan17

32 Akond Rahman et al.

– Conclusion Validity: Our findings are subject to conclusion validity asthe defect categorization process involves human judgment. Our approachis based on qualitative analysis, where raters categorized XCMs, and as-signed defect categories. We acknowledge that the process is susceptiblehuman judgment, and as the raters’ experience can bias the categoriesassigned. The accompanying human subjectivity can influence the distri-bution of the defect category for IaC scripts of interest.We mitigated this threat by assigning multiple raters for the same set ofXCMs. Next, we used a resolver, who resolved the disagreements. Further,we cross-checked our categorization with practitioners who authored theXCMs, and observed ‘substantial’ to ‘almost perfect’ agreement.

– Internal Validity: Our empirical study is based on OSS repositories main-tained by four organizations: Mirantis, Mozilla, Openstack, and WikimediaCommons. The selected repositories are subject to selection bias, as we ex-tract the required IaC scripts only from the OSS domain. We acknowledgethat our sample of IaC scripts can be limiting, and we hope to mitigatethis limitation by collecting IaC scripts from proprietary sources.We have used a combination of commit messages and issue report de-scriptions to determine if an IaC script is associated with a defect. Weacknowledge that these messages might not have given the full contextfor the raters. Other sources of information such as practitioner input, andcode changes that take place in each commit could have provided the ratersbetter context to categorize the XCMs.We acknowledge that we have not used the trigger attribute of ODC, asour data sources do not include any information from which we can infertrigger attributes of ODC such as, ‘design conformance’, ‘side effects’, and‘concurrency’.

– Construct validity: Our process of using human raters to determine de-fect categories can be limiting, as the process is susceptible to mono-methodbias, where subjective judgment of raters can influence the findings. Wemitigated this threat by using multiple raters.Also, for Mirantis and Wikimedia, we used graduate students who per-formed the categorization as part of their class work. Students who par-ticipated in the categorization process can be subject to evaluation appre-hension, i.e. consciously or sub-consciously relating their performance withthe grades they would achieve for the course. We mitigated this threat byclearly explaining to the students that their performance in the categoriza-tion process would not affect their grades.The raters involved in the categorization process had professional experi-ence in software engineering for at two years on average. Their experiencein software engineering may make the raters curious about the expectedoutcomes of the categorization process, which may effect the distribution ofthe categorization process. Furthermore, the resolver also has professionalexperience in software engineering and IaC script development, which couldinfluence the outcome of the defect category distribution.

Categorizing Defects in Infrastructure as Code 33

– External Validity: Our scripts are collected from the OSS domain, andnot from proprietary sources. Our findings are subject to external validity,as our findings may not be generalizable.We construct our datasets using Puppet, which is a declarative language.Our findings may not generalize for IaC scripts that use an imperative formof language. We hope to mitigate this limitation by analyzing IaC scriptswritten using imperative form of languages.

8 Conclusion

Use of IaC scripts is crucial for the automated maintenance of software de-livery and deployment infrastructure. Similar to software source code, IaCscripts churn frequently, and can include defects which has serious real-worldconsequences. IaC defect categorization can help IT organizations to identifyopportunities to improve IaC development. We have conducted an empiricalanalysis using four datasets from four organizations namely, Mirantis, Mozilla,Openstack, and Wikimedia Commons. With 89 raters we apply the defecttype attribute of the orthogonal defect classification (ODC) methodology tocategorize the defects. For all four datasets, we have observed that majorityof the defects are related to assignment i.e. defects related to syntax errorsand erroneous configuration assignments. We have observed compared to IaCscripts, assignment-related defects occur less frequently in non-IaC software.We also have observed the defect density is 27.6, 18.4, 16.2, and 17.1 defectsper KLOC respectively for Mirantis, Mozilla, Openstack, and Wikimedia. Forall four datasets we observe IaC defects to follow a consistent temporal trend.We hope our findings will facilitate further research in IaC defect analysis.

Acknowledgements The Science of Security Lablet at the North Carolina State Universitysupported this research study. We thank the Realsearch research group members for theiruseful feedback. We also thank the practitioners who answered our questions.

References

Alali A, Kagdi H, Maletic JI (2008) What’s a typical commit? a charac-terization of open source software repositories. In: 2008 16th IEEE In-ternational Conference on Program Comprehension, pp 182–191, DOI10.1109/ICPC.2008.24

Basili VR, Perricone BT (1984) Software errors and complexity: An empiricalinvestigation0. Commun ACM 27(1):42–52, DOI 10.1145/69605.2085, URLhttp://doi.acm.org/10.1145/69605.2085

Basso T, Moraes R, Sanches BP, Jino M (2009) An investigation of java faultsoperators derived from a field data study on java software faults. In: Work-shop de Testes e Tolerancia a Falhas, pp 1–13

34 Akond Rahman et al.

Battin RD, Crocker R, Kreidler J, Subramanian K (2001) Leveraging resourcesin global software development. IEEE Software 18(2):70–77, DOI 10.1109/52.914750

van der Bent E, Hage J, Visser J, Gousios G (2018) How good is your puppet?an empirically defined and validated quality model for puppet. In: 2018IEEE 25th International Conference on Software Analysis, Evolution andReengineering (SANER), pp 164–174, DOI 10.1109/SANER.2018.8330206

Chillarege R, Bhandari IS, Chaar JK, Halliday MJ, Moebus DS, Ray BK,Wong MY (1992) Orthogonal defect classification-a concept for in-processmeasurements. IEEE Transactions on Software Engineering 18(11):943–956,DOI 10.1109/32.177364

Christmansson J, Chillarege R (1996) Generation of an error set that emulatessoftware faults based on field data. In: Proceedings of Annual Symposiumon Fault Tolerant Computing, pp 304–313, DOI 10.1109/FTCS.1996.534615

Cinque M, Cotroneo D, Corte RD, Pecchia A (2014) Assessing direct moni-toring techniques to analyze failures of critical industrial systems. In: 2014IEEE 25th International Symposium on Software Reliability Engineering,pp 212–222, DOI 10.1109/ISSRE.2014.30

Cito J, Leitner P, Fritz T, Gall HC (2015) The making of cloud applications:An empirical study on software development for the cloud. In: Proceedingsof the 2015 10th Joint Meeting on Foundations of Software Engineering,ACM, New York, NY, USA, ESEC/FSE 2015, pp 393–403, DOI 10.1145/2786805.2786826, URL http://doi.acm.org/10.1145/2786805.2786826

Cliff N (1993) Dominance statistics: Ordinal analyses to answer ordinal ques-tions. Psychological Bulletin 114(3):494–509

Cohen J (1960) A coefficient of agreement for nominal scales.Educational and Psychological Measurement 20(1):37–46,DOI 10.1177/001316446002000104, URL http://dx.doi.org/

10.1177/001316446002000104, http://dx.doi.org/10.1177/

001316446002000104

Commons W (2018) Projects: Gerrit Wikimedia. https://gerrit.

wikimedia.org/r/#/admin/projects/, [Online; accessed 12-September-2018]

Cotroneo D, Pietrantuono R, Russo S (2013) Testing techniques selectionbased on odc fault types and software metrics. J Syst Softw 86(6):1613–1637, DOI 10.1016/j.jss.2013.02.020, URL http://dx.doi.org/10.1016/

j.jss.2013.02.020

Cox DR, Stuart A (1955) Some quick sign tests for trend in location and dis-persion. Biometrika 42(1/2):80–95, URL http://www.jstor.org/stable/

2333424

Duraes JA, Madeira HS (2006) Emulation of software faults: A field data studyand a practical approach. IEEE Trans Softw Eng 32(11):849–867, DOI10.1109/TSE.2006.113, URL http://dx.doi.org/10.1109/TSE.2006.113

Fenton NE, Ohlsson N (2000) Quantitative analysis of faults and failures ina complex software system. IEEE Trans Softw Eng 26(8):797–814, DOI10.1109/32.879815, URL http://dx.doi.org/10.1109/32.879815

Categorizing Defects in Infrastructure as Code 35

Fonseca J, Vieira M (2008) Mapping software faults with web security vul-nerabilities. In: 2008 IEEE International Conference on Dependable Sys-tems and Networks With FTCS and DCC (DSN), pp 257–266, DOI10.1109/DSN.2008.4630094

Fowler M (2010) Domain Specific Languages, 1st edn. Addison-Wesley Pro-fessional

Freimut B, Denger C, Ketterer M (2005) An industrial case study of imple-menting and validating defect classification for process improvement andquality management. In: 11th IEEE International Software Metrics Sympo-sium (METRICS’05), pp 10 pp.–19, DOI 10.1109/METRICS.2005.10

Gupta A, Li J, Conradi R, Rønneberg H, Landre E (2009) A case study com-paring defect profiles of a reused framework and of applications reusing it.Empirical Softw Engg 14(2):227–255, DOI 10.1007/s10664-008-9081-9, URLhttp://dx.doi.org/10.1007/s10664-008-9081-9

Hanappi O, Hummer W, Dustdar S (2016) Asserting reliable convergence forconfiguration management scripts. SIGPLAN Not 51(10):328–343, DOI10.1145/3022671.2984000, URL http://doi.acm.org/10.1145/3022671.

2984000

Harlan D (1987) Cleanroom Software EngineeringHartz MA, Walker EL, Mahar D (1996) Introduction to software reliability:

A state of the art review. The CenterHatton L (1997) Reexamining the fault density-component size connection.

IEEE Softw 14(2):89–97, DOI 10.1109/52.582978, URL http://dx.doi.

org/10.1109/52.582978

Henningsson K, Wohlin C (2004) Assuring fault classification agreement ” anempirical evaluation. In: Proceedings of the 2004 International Symposiumon Empirical Software Engineering, IEEE Computer Society, Washington,DC, USA, ISESE ’04, pp 95–104, DOI 10.1109/ISESE.2004.13, URL http:

//dx.doi.org/10.1109/ISESE.2004.13

Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: How misclassi-fication impacts bug prediction. In: 2013 35th International Conference onSoftware Engineering (ICSE), pp 392–401, DOI 10.1109/ICSE.2013.6606585

Humble J, Farley D (2010) Continuous Delivery: Reliable Software ReleasesThrough Build, Test, and Deployment Automation, 1st edn. Addison-Wesley Professional

IEEE (2010) Ieee standard classification for software anomalies. IEEE Std1044-2009 (Revision of IEEE Std 1044-1993) pp 1–23, DOI 10.1109/IEEESTD.2010.5399061

Jiang Y, Adams B (2015) Co-evolution of infrastructure and source code: Anempirical study. In: Proceedings of the 12th Working Conference on MiningSoftware Repositories, IEEE Press, Piscataway, NJ, USA, MSR ’15, pp 45–55, URL http://dl.acm.org/citation.cfm?id=2820518.2820527

Kim G, Debois P, Willis J, Humble J (2016) The DevOps Handbook: How toCreate World-Class Agility, Reliability, and Security in Technology Organi-zations. IT Revolution Press

36 Akond Rahman et al.

Labs P (2017) Puppet Documentation. https://docs.puppet.com/, [Online;accessed 19-July-2017]

Landis JR, Koch GG (1977) The measurement of observer agreement forcategorical data. Biometrics 33(1):159–174, URL http://www.jstor.org/

stable/2529310

Lutz RR, Mikulski IC (2004) Empirical analysis of safety-critical anomaliesduring operations. IEEE Transactions on Software Engineering 30(3):172–180, DOI 10.1109/TSE.2004.1271171

Lyu MR, Huang Z, Sze SKS, Cai X (2003) An empirical study on testing andfault tolerance for software reliability engineering. In: 14th InternationalSymposium on Software Reliability Engineering, 2003. ISSRE 2003., pp 119–130, DOI 10.1109/ISSRE.2003.1251036

Mann HB, Whitney DR (1947) On a test of whether one of two random vari-ables is stochastically larger than the other. The Annals of MathematicalStatistics 18(1):50–60, URL http://www.jstor.org/stable/2236101

Maximilien EM, Williams L (2003) Assessing test-driven development at ibm.In: Proceedings of the 25th International Conference on Software Engineer-ing, IEEE Computer Society, Washington, DC, USA, ICSE ’03, pp 564–569,URL http://dl.acm.org/citation.cfm?id=776816.776892

McConnell S (2004) Code complete - a practical handbook of software con-struction, 2nd Edition

McCune JT, Jeffrey (2011) Pro Puppet, 1st edn. Apress, DOI10.1007/978-1-4302-3058-8, URL https://www.springer.com/gp/book/

9781430230571

Mirantis (2018) Mirantis. https://github.com/Mirantis, [Online; accessed12-September-2018]

Mohagheghi P, Conradi R, Killi OM, Schwarz H (2004) An empirical study ofsoftware reuse vs. defect-density and stability. In: Proceedings of the 26thInternational Conference on Software Engineering, IEEE Computer Society,Washington, DC, USA, ICSE ’04, pp 282–292, URL http://dl.acm.org/

citation.cfm?id=998675.999433

Moller KH, Paulish DJ (1993) An empirical investigation of software fault dis-tribution. In: [1993] Proceedings First International Software Metrics Sym-posium, pp 82–90, DOI 10.1109/METRIC.1993.263798

Mozilla (2018) Mercurial repositories index. hg.mozilla.org, [Online; ac-cessed 12-September-2018]

Munaiah N, Kroh S, Cabrey C, Nagappan M (2017) Curating githubfor engineered software projects. Empirical Software Engineering pp 1–35, DOI 10.1007/s10664-017-9512-6, URL http://dx.doi.org/10.1007/

s10664-017-9512-6

Openstack (2018) OpenStack git repository browser. http://git.openstack.org/cgit/, [Online; accessed 12-September-2018]

Parnin C, Helms E, Atlee C, Boughton H, Ghattas M, Glover A, Holman J,Micco J, Murphy B, Savor T, Stumm M, Whitaker S, Williams L (2017)The top 10 adages in continuous deployment. IEEE Software 34(3):86–95,DOI 10.1109/MS.2017.86

Categorizing Defects in Infrastructure as Code 37

Pecchia A, Russo S (2012) Detection of software failures through event logs:An experimental study. In: 2012 IEEE 23rd International Symposium onSoftware Reliability Engineering, pp 31–40, DOI 10.1109/ISSRE.2012.24

Puppet (2018) Ambit energy’s competitive advantage? it’s really a devops soft-ware company. Tech. rep., Puppet, URL https://puppet.com/resources/

case-study/ambit-energy

Rahman A, Williams L (2018) Characterizing defective configuration scriptsused for continuous deployment. In: 2018 IEEE 11th International Confer-ence on Software Testing, Verification and Validation (ICST), pp 34–45,DOI 10.1109/ICST.2018.00014

Rahman A, Partho A, Meder D, Williams L (2017) Which factors influencepractitioners’ usage of build automation tools? In: Proceedings of the 3rdInternational Workshop on Rapid Continuous Software Engineering, IEEEPress, Piscataway, NJ, USA, RCoSE ’17, pp 20–26, DOI 10.1109/RCoSE.2017..8, URL https://doi.org/10.1109/RCoSE.2017..8

Rahman A, Partho A, Morrison P, Williams L (2018) What questions doprogrammers ask about configuration as code? In: Proceedings of the 4thInternational Workshop on Rapid Continuous Software Engineering, ACM,New York, NY, USA, RCoSE ’18, pp 16–22, DOI 10.1145/3194760.3194769,URL http://doi.acm.org/10.1145/3194760.3194769

Rahman AAU, Helms E, Williams L, Parnin C (2015) Synthesizing continuousdeployment practices used in software development. In: Proceedings of the2015 Agile Conference, IEEE Computer Society, Washington, DC, USA,AGILE ’15, pp 1–10, DOI 10.1109/Agile.2015.12, URL http://dx.doi.

org/10.1109/Agile.2015.12

Ray B, Hellendoorn V, Godhane S, Tu Z, Bacchelli A, Devanbu P (2016)On the ”naturalness” of buggy code. In: Proceedings of the 38th In-ternational Conference on Software Engineering, ACM, New York, NY,USA, ICSE ’16, pp 428–439, DOI 10.1145/2884781.2884848, URL http:

//doi.acm.org/10.1145/2884781.2884848

Romano J, Kromrey J, Coraggio J, Skowronek J (2006) Appropriate statisticsfor ordinal level data: Should we really be using t-test and Cohen’sd forevaluating group differences on the NSSE and other surveys? In: annualmeeting of the Florida Association of Institutional Research, pp 1–3

Shambaugh R, Weiss A, Guha A (2016) Rehearsal: A configuration verifica-tion tool for puppet. SIGPLAN Not 51(6):416–430, DOI 10.1145/2980983.2908083, URL http://doi.acm.org/10.1145/2980983.2908083

Sharma T, Fragkoulis M, Spinellis D (2016) Does your configuration codesmell? In: Proceedings of the 13th International Conference on Mining Soft-ware Repositories, ACM, New York, NY, USA, MSR ’16, pp 189–200, DOI10.1145/2901739.2901761, URL http://doi.acm.org/10.1145/2901739.

2901761

Smith AM (1993) Reliability-centered maintenance, vol 83. McGraw-Hill NewYork

Sullivan M, Chillarege R (1991) Software defects and their impact on systemavailability-a study of field failures in operating systems. In: [1991] Digest

38 Akond Rahman et al.

of Papers. Fault-Tolerant Computing: The Twenty-First International Sym-posium, pp 2–9, DOI 10.1109/FTCS.1991.146625

Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empir-ical comparison of model validation techniques for defect prediction models.IEEE Trans Softw Eng 43(1):1–18, DOI 10.1109/TSE.2016.2584050, URLhttps://doi.org/10.1109/TSE.2016.2584050

Thung F, Wang S, Lo D, Jiang L (2012) An empirical study of bugs in machinelearning systems. In: 2012 IEEE 23rd International Symposium on SoftwareReliability Engineering, pp 271–280, DOI 10.1109/ISSRE.2012.22

Voelter M (2013) DSL Engineering: Designing, Implementing and UsingDomain-Specific Languages. CreateSpace Independent Publishing Platform,USA

Zhang F, Mockus A, Keivanloo I, Zou Y (2016) Towards building a universaldefect prediction model with rank transformed predictors. Empirical SoftwEngg 21(5):2107–2145, DOI 10.1007/s10664-015-9396-2, URL http://dx.

doi.org/10.1007/s10664-015-9396-2

Zhang F, Hassan AE, McIntosh S, Zou Y (2017) The use of summation toaggregate software metrics hinders the performance of defect predictionmodels. IEEE Transactions on Software Engineering 43(5):476–491, DOI10.1109/TSE.2016.2599161

Zheng J, Williams L, Nagappan N, Snipes W, Hudepohl JP, Vouk MA (2006)On the value of static analysis for fault detection in software. IEEE Trans-actions on Software Engineering 32(4):240–253, DOI 10.1109/TSE.2006.38