[ieee 2013 4th international workshop on emerging trends in software metrics (wetsom) - san...

7
Measuring Software Projects Mayan Style Siim Karus University of Tartu Tartu, Estonia [email protected] Abstract—The progress of contemporary software projects is a subject to several different measurements. These measurements are often subjective and rely on developers’ personal predictions. In addition, software projects are assumed to progress linearly from the beginning to an end. This can be a good approximation of progress in projects with rigid and clearly defined and planned deliverables or deadlines, but is not sufficient for application in community-driven loosely defined projects. In this paper, we are proposing an alternative cyclical view on measuring progress in software development. Based on cyclical time perceptions from other fields of life, we analysed 23 open source software projects to find reoccurring patterns in open source software projects. The empirically derived cyclical and event-based measurements of software projects’ progress does not suffer from the linear approximation issues seen with many other measurements. Accordingly, we believe that the derived progress modelling technique describes community-driven software better as it adapts to environmental changes and lends itself for building estimations on the projects’ future. Index Terms—Open source software, wavelet analysis, patterns, evolution, measurement I. INTRODUCTION Open source software projects offer an increasing wealth of software project evolution data [1]. Much of this data is stored in different information systems like source code repositories, change management systems, project management systems and social information exchange systems like e-mail. All these have been useful sources of data for researchers aiming to improve the quality of software or the software development process. In our study, we are following this long-lasting practice by taking a look at source code repositories in order to find indicators that could be useful for tracking the projects’ progress. In other words, we are looking for reoccurring patterns in open source software projects. Knowing these patterns could help in: Identifying project state or relative progress; Identifying naturally occurring iterations in loosely led projects; Making estimations on subsequent software project evolution cycles; Preparing projects for their next milestones; Making projects comparable with each-other. In contemporary software development, subjective experience from previous projects and project life cycle theories, which are often based on the former, are used as the basis in these tasks. Sometimes linear approximations like project size in lines of code (LOC) 1 or cumulative code churn (sum of LOC added, deleted, and modified), or calendar-based duration are used to aid in these tasks. Unfortunately, as it can be seen from Figure 1, neither project size nor project duration provide truly meaningful and reliable means for tracking project progress due to the uneven activity in the projects. Even more, the projects differ greatly in size, activity and duration making size and duration based comparisons difficult. Learning from the calendars developed by ancient cultures like Mayans, we know that time and evolution can be considered cyclical. This notion allows to measure duration and progress even if we do not know the end date. Thus, it follows the practice of open source community-driven software projects that have no planned “end”. It does, however, come at the expense of requiring a “container” for cycles so that we would know when old cycle ends and new begins. We aim to reduce that void by applying scale-tolerant frequent pattern mining to identify similarities in software projects and offer a common scale for comparing projects. Correspondingly, we are trying to answer the following questions: RQ1. Are there reoccurring common evolution patterns in open source software projects? RQ2. If common evolution patterns exist, are they regular/cyclical? The paper starts by giving background on project life cycles of software projects, and wavelet analysis method in Section II. The dataset used for the study is explained in Section III and the method in Section IV. We present the discussion in Section V and conclude with Section VI. II. BACKGROUND To solve the undertaken task, we are making use of a technique popular in mechanics, medicine, image and audio processing. However, this technique is rarely used for business process data exploration, which is the theme in this paper. Thus, in the next subsections, we will give an overview of the project life cycle theories followed by a short introduction to wavelet analysis technique employed in this paper. A. Project Life Cycle Theories In order to determine, whether software projects follow natural cyclical patterns, we need to understand the death of software projects as it marks the end of the final cycle. The 1 In this paper, lines of code consists of all lines (including empty) of all textual files stored in the project repository. 978-1-4673-6331-0/13/$31.00 c 2013 IEEE WETSoM 2013, San Francisco, CA, USA 28

Upload: siim

Post on 14-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2013 4th International Workshop on Emerging Trends in Software Metrics (WETSoM) - San Francisco, CA, USA (2013.05.21-2013.05.21)] 2013 4th International Workshop on Emerging

Measuring Software Projects Mayan Style

Siim Karus

University of Tartu

Tartu, Estonia

[email protected]

Abstract—The progress of contemporary software projects is a

subject to several different measurements. These measurements

are often subjective and rely on developers’ personal predictions.

In addition, software projects are assumed to progress linearly

from the beginning to an end. This can be a good approximation

of progress in projects with rigid and clearly defined and planned

deliverables or deadlines, but is not sufficient for application in

community-driven loosely defined projects. In this paper, we are

proposing an alternative cyclical view on measuring progress in

software development. Based on cyclical time perceptions from

other fields of life, we analysed 23 open source software projects

to find reoccurring patterns in open source software projects.

The empirically derived cyclical and event-based measurements

of software projects’ progress does not suffer from the linear

approximation issues seen with many other measurements.

Accordingly, we believe that the derived progress modelling

technique describes community-driven software better as it

adapts to environmental changes and lends itself for building

estimations on the projects’ future.

Index Terms—Open source software, wavelet analysis,

patterns, evolution, measurement

I. INTRODUCTION

Open source software projects offer an increasing wealth of

software project evolution data [1]. Much of this data is stored

in different information systems like source code repositories,

change management systems, project management systems and

social information exchange systems like e-mail. All these have

been useful sources of data for researchers aiming to improve

the quality of software or the software development process.

In our study, we are following this long-lasting practice by

taking a look at source code repositories in order to find

indicators that could be useful for tracking the projects’

progress. In other words, we are looking for reoccurring

patterns in open source software projects.

Knowing these patterns could help in:

Identifying project state or relative progress;

Identifying naturally occurring iterations in loosely led

projects;

Making estimations on subsequent software project

evolution cycles;

Preparing projects for their next milestones;

Making projects comparable with each-other.

In contemporary software development, subjective

experience from previous projects and project life cycle

theories, which are often based on the former, are used as the

basis in these tasks. Sometimes linear approximations like

project size in lines of code (LOC)1 or cumulative code churn

(sum of LOC added, deleted, and modified), or calendar-based

duration are used to aid in these tasks. Unfortunately, as it can

be seen from Figure 1, neither project size nor project duration

provide truly meaningful and reliable means for tracking

project progress due to the uneven activity in the projects. Even

more, the projects differ greatly in size, activity and duration

making size and duration based comparisons difficult.

Learning from the calendars developed by ancient cultures

like Mayans, we know that time and evolution can be

considered cyclical. This notion allows to measure duration and

progress even if we do not know the end date. Thus, it follows

the practice of open source community-driven software

projects that have no planned “end”. It does, however, come at

the expense of requiring a “container” for cycles so that we

would know when old cycle ends and new begins.

We aim to reduce that void by applying scale-tolerant

frequent pattern mining to identify similarities in software

projects and offer a common scale for comparing projects.

Correspondingly, we are trying to answer the following

questions:

RQ1. Are there reoccurring common evolution patterns in

open source software projects?

RQ2. If common evolution patterns exist, are they

regular/cyclical?

The paper starts by giving background on project life cycles

of software projects, and wavelet analysis method in Section II.

The dataset used for the study is explained in Section III and

the method in Section IV. We present the discussion in Section

V and conclude with Section VI.

II. BACKGROUND

To solve the undertaken task, we are making use of a

technique popular in mechanics, medicine, image and audio

processing. However, this technique is rarely used for business

process data exploration, which is the theme in this paper.

Thus, in the next subsections, we will give an overview of the

project life cycle theories followed by a short introduction to

wavelet analysis technique employed in this paper.

A. Project Life Cycle Theories

In order to determine, whether software projects follow

natural cyclical patterns, we need to understand the death of

software projects as it marks the end of the final cycle. The

1 In this paper, lines of code consists of all lines (including empty) of

all textual files stored in the project repository.

978-1-4673-6331-0/13/$31.00 c© 2013 IEEE WETSoM 2013, San Francisco, CA, USA28

Page 2: [IEEE 2013 4th International Workshop on Emerging Trends in Software Metrics (WETSoM) - San Francisco, CA, USA (2013.05.21-2013.05.21)] 2013 4th International Workshop on Emerging

death of an open source software project is little studied

phenomenon. It is as if open source software projects are

expected to never die. Inactive projects are removed from the

active Web and become difficult to find soon to be forgotten,

which contributes to the illusion.

There exists a wealth of studies into the vitality of open

source software [2]. These studies have focused on the

projects’ ability to provide support and grow in the number of

releases. We are not interested in measuring projects’ strength

or success – we only look for signs related to the development

process ignoring community support phrases. This gives a very

different interpretation to the “death” and success of a project

as a project where development has ceased can still be

successful in terms of user adoption and continued community

support as described in Section II.B. Most importantly, cease of

development does not always mean a failure.

Success in formal closed source in-firm software

development projects is clearly defined by their financial

success. However, this measure of success does not apply to

many of the open source projects, which are community led

and distributed freely. This has led to redefinition of success in

open source software projects. Success in open source software

projects can be considered high level of end-user adoption [3],

high software quality [4], high level of developer engagement,

or in many other objective or subjective respects [5]. This

distinction along with the volunteering oriented processes

means a different approach to development needs to be taken.

Eric Raymond explored the differences between open

source projects and closed source projects in his book “The

Cathedral and the Bazaar” [6], which has become one of the

most cited books on the management of community-driven

open projects. In his book he compares the closed team projects

to building a cathedral, where the process is planned, and open

team projects to a bazaar, where participants are constantly

competing and collaborating to reach the target in independent

small scale steps. The bazaar-like agile behaviours are clearly

noticeable in open source development processes; however, the

two distinct management philosophies have begun merging as

open source projects have increased in scale [7, 8].

A frequent criticism about the studies of open source

software is the low number of projects involved in these studies

[9]. In this study, we are using 23 projects from different

repositories and development teams to reduce inherent risks

from data sampling.

B. Causes of Death

Even though open source software has been around since

the advent of programming, the nature of open source software

has become more varied. In particular, improved opportunities

to cooperate have made community developed long-living

software more common. This can be seen as a necessary

evolutionary step to handle the increasing complexity and

volume of modern software.

A common aspect of interest in those projects is the

motivation behind them. There are claims of intrinsic

motivation in open source software development as well as

external motivators in play [10, 11]. No matter what are the

exact motivators for participating in open source software

development, it is clear that they have significant impact on the

Figure 1. Relative progress by date vs. relative progress by cumulative LOC churn. Bubble size reflects number of commits.

29

Page 3: [IEEE 2013 4th International Workshop on Emerging Trends in Software Metrics (WETSoM) - San Francisco, CA, USA (2013.05.21-2013.05.21)] 2013 4th International Workshop on Emerging

way the projects are managed and progress. This impact will

also show in the evolutionary patterns found in the projects.

There is lots of research on the subject of trying to identify,

what makes some open source projects successful and why

some projects never seem to catch on. The research on the

success of open source software projects is dispersed by the

various definitions of success of open source software projects

[12]. The determinants of open source project success can be

external or internal [13, 2]. For example, developers and also

end-users have shown preference towards opener project

licenses [14].

The studies on the success of projects are mainly focused

on identifying the reasons of popularity among end-users.

While this is a valid definition of success, we are left

wondering, why some projects seem to carry on forever

releasing new versions every now-and-then, but others seem to

cease evolving (independent of their popularity).

For the purposes of this study, we consider two main causes

of discontinued development of community-driven open source

software projects: loss of popularity, and reaching maturity. In

case of solo-projects, we could also include cases where the

developer just would no longer be available (e.g. due to death,

employment, etc.), which are not a case in multiple-developer

community-driven projects.

1) Loss of Popularity

Loss of popularity can be a result of many different events.

For example, a product might lose its share due to another

product replacing it. Another scenario for loss of popularity is

be due to technical advancement of platforms. For example,

applications built for older operating systems or deprecated

hardware will be unusable or not needed in the new setting.

2) Reaching Maturity

If a project reaches maturity, it will be used without any

changes for the foreseeable future. That means, the product

either becomes future-proof or achieves high level of forward

compatibility. This differs from the development process

maturity as defined by CMMI [15] or OMM [16].

In case of becoming future-proof, the product will function

as-is without any change. A common way of achieving this is

by building in extensibility options. This makes it preferred aim

for software libraries and protocols (e.g. TCP/IP or HTTP),

which can go without change for decades.

Forward compatibility on the other hand is achieved by

adaptation. This could mean application of fuzzy logic,

extensibility, modularity, natural-language-processing, change

estimation, or even self-evolving programs (e.g. worms and

viruses often apply this approach). A simple example of such

kind of projects is wrappers for different libraries.

Even though the differentiation between the causes of death

would be advisable, there is too little data on the use of

software to make such distinction. At best we could say that

projects, which are no longer available on the Web, have died

due to loss of popularity. Unfortunately, the unavailability of

the projects makes it impossible to gather data about these.

Thus, we can assume that all the projects involved in this study

have at least minimal user base for which the software is

sufficiently mature and the projects have died a “good” death.

C. Wavelet Analysis

Wavelet analysis is analysis of signals (time series) by

decomposing the signal into wavelet coefficients (also known

as shift coefficients) and scaling coefficients based on wavelet

functions (also known as filters). Such decomposition allows

compression as the resulting number of coefficients can be

smaller than the number of original samples. This

decomposition can be repeated on the wavelet coefficients until

the number of resulting wavelet coefficients is smaller than the

filter length.

In this study we chose to use a Daubechies filter of length 2

(also known as Haar wavelet [17]) due to its simplicity and

simple interpretation. We are applying discrete wavelet

transform meaning we are using discrete shift when matching

the time series with the wavelet. A decomposition of LOC

series of project “docbook2X” against different time series to

different levels of wavelet transform is shown on Figure 2. On

this figure lines mark scaling coefficients (V) and bars mark

wavelet coefficients (W). All these coefficients are normalised

and the last level of decomposition is excluded as it has only 1

value.

Wavelet transform has proven important in signal

processing thanks to its inherent properties which allow

comparisons at different scales and shifts. This gives three

important advantages compared to many other time series

analysis techniques:

Scaling coefficients allow fuzzy matching as

differences in details are “smoothed out”.

Filter coefficients allow detection of small anomalies

in series.

Discrete transform levels make series of different

lengths or scale comparable.

These advantages have been beneficial in financial

analytics for identification of anomalies and correlations to

identify opportunities [18, 19, 20]. The fuzzy matching and

scale comparisons have proven useful for clone detection in

image processing [21, 22]. The advantages of wavelet analysis

techniques are also useful for frequent pattern analysis of time

series data like used in this study.

III. DATA

A. Projects

The analysis was performed on 23 open source software

projects. 18 of these projects were from a dataset of software

project chosen randomly using Google Code Search. The

projects were selected from various repositories employing

different source code languages, and having multiple

developers in a team. This made sure that we represent

different team sizes, and project types. 15 of these projects are

on-going and 3 have had no development activity for at least a

year (are “dead”). The alive projects have stayed alive for the

minimum of 4 years and at least 3 years after the last data

sample timestamp used in the analysis. Projects “fbug-read-

only” and “vim7” have an earlier last data sample date as the

development in these projects was moved to another repository.

The list of projects in this dataset is given in Table I. This

30

Page 4: [IEEE 2013 4th International Workshop on Emerging Trends in Software Metrics (WETSoM) - San Francisco, CA, USA (2013.05.21-2013.05.21)] 2013 4th International Workshop on Emerging

dataset was verified to have source code and activity structure

similar to the dataset of more than 400000 open source

software projects tracked by ohloh.net2. Thus, this dataset

should represent the overall state of open source software

development fairly well.

The dataset of 18 projects was complemented by 5 dead

projects from sourceforge.com. The aim of complementing the

original dataset was to balance the number of dead and alive

projects in the study. These 5 projects were also used as an

independent dataset for verifying some of the patterns found in

the 18 project dataset.

We used specially built software to implement ETL

(extract-transform-load). The software downloaded the

projects’ CVS and SVN repository data into a SQL Server3

database and processed the history data to count lines of code

(LOC) and code churn metrics. A commit log of the projects

was exported from the SQL server for wavelet analysis.

B. Metrics

We conducted wavelet analysis in respect to two different

time series dimensions: days since the first commit and

cumulative code churn. Code churn is the sum of code added,

2 http://www.ohloh.net/ 3 http://www.microsoft.com/en-us/sqlserver/

modified and removed [23]. Those two time series were chosen

due to their popularity in project process measurement

frameworks. Even though, some solutions use lines of code

(LOC) in project snapshot to measure progress in software

development, we consider it a bad practice as LOC is not

monotonously growing throughout the development process

(see Figure 2). This measure can still be used comparing

progress to estimated final size of the software code base.

Future cumulative code churn can be estimated with reasonable

accuracy based on project snapshots as well [24]. Thus,

cumulative code churn as development progress measure

combines some of the benefits of measuring progress in time

spent and LOC of final code produced.

The data series used in the analysis were related to the

developers participating in the projects, code churn, and project

size.

The metrics relating to developers were:

Average number of active developers – it is reasonable

to assume that more active developers will write code

faster (more cumulative code in the same timeframe)

Cumulative number of developers – this reflects the

diversity of knowledge of the code as different

developers work on different sections of the code

Figure 2. DWT decomposition of LOC series by time (left) and cumulative churn (right). Bottom graphs show original series, number in brackets shows

transform level, lines show normalized scaling coefficients, bars normalized wavelet coefficients.

31

Page 5: [IEEE 2013 4th International Workshop on Emerging Trends in Software Metrics (WETSoM) - San Francisco, CA, USA (2013.05.21-2013.05.21)] 2013 4th International Workshop on Emerging

Number of commits – we would expect to see the

commit frequency to drop gradually before the death of

a project

Relative team size (cumulative number of developers

divided by the total number of developers at the date of

the last data point collected about the project) – as the

inclusion of other/new developers might be planned,

we get to know how many different developers have

already touched the code

The metrics relating to code churn were:

Mean LOC added, modified, deleted, and churned per

commit (4 metrics) – large commits might lead to

defects, which could be the cause for a project to be

abandoned

Cumulative LOC added, modified, deleted, and

churned per commit (4 metrics) – the size of a project

history relates to the complexity and abundance of

different thought patterns, which could be a deterrent

to new and old developer0s

Relative cumulative LOC churn (only for dead

projects) – the progress of development measured in

LOC

The metrics relating to project size were:

Mean LOC – the size of a project relates to the

complexity, which could be a deterrent to new and old

developers

Mean number of files – the size of a project relates to

the complexity, which could be a deterrent to new and

old developers

Relative progress by date (only for dead projects)

In our study, lines of code is measured by counting all text

lines including source code, comments, configuration settings,

readme, and build files. This takes into account our previous

findings showing that on average 4 different types of code are

used in open source software project and plaintext or

configuration files are a significant portion of that code [25].

IV. METHOD

The analysis and data preparation was conducted in several

steps: data aggregation, discrete waveform transform, similar

region detection and grouping.

In the first step, data series were aggregated along the two

time series dimensions. For days since first commit, the data

was aggregated in 7 day frames (corresponding to a week). For

cumulative code churn, a frame of 1000 LOC was used instead.

TABLE I. LISTING OF PROJECTS INVOLVED IN THE STUDY.

Name State Duration (weeks) Cumulative churn (kLOC) Location

bibliographic inactive 309 348 www.openoffice.org/bibliographic

bizdev inactive 272 531 www.openoffice.org/bizdev

commons active 121 2498 wso2.org

dia active 641 2521 live.gnome.org/Dia

docbook active 454 9073 docbook.sourceforge.net

docbook2X inactive 432 234 docbook2x.sourceforge.net

esb active 121 1057 wso2.org

exist active 363 3578 exist.sourceforge.net

fbug-read-only repo moved 22 39 fbug.googlecode.com

feedparser-read-only active 246 105 feedparser.googlecode.com

gnome-doc-utils active 263 64 live.gnome.org/GnomeDocUtils

gnucash active 604 4835 gnucash.org

groovy active 321 1775 svn.codehaus.org/groovy

ivam inactive 19 19 ivam.sourceforge.net

jackcc inactive 12 1152 jackcc.sourceforge.net

jd4x inactive 173 25 jdx.sourceforge.net

jedidbd inactive 42 293 jedidbd.sourceforge.net

tei active 276 3642 tei.sourceforge.net

valgrind active 375 2646 valgrind.org

vim7 repo moved 21 496 vim.org

VirtualDubMod15 inactive 157 24922 virtualdubmod.sourceforge.net

wsas active 120 2077 wso2.org

wsf active 121 3630 wso2.org

32

Page 6: [IEEE 2013 4th International Workshop on Emerging Trends in Software Metrics (WETSoM) - San Francisco, CA, USA (2013.05.21-2013.05.21)] 2013 4th International Workshop on Emerging

In the second step, discrete wavelet transform with

Daubechies filter with length 2 (also known as Haar filter) was

applied on the data series. This gives us two coefficient vectors

(wavelet and scaling coefficient) for each transform

(compression) level.

Linearly positively similar (maximum deviation 0.5%)

maximal sub-sequences of the coefficient vectors were

identified in the third step. We were looking at sub-sequences

of the minimum length of 3 as any two 2-value sequences are

linearly similar (but not necessarily positively). We only

looked for similarities in the same dimension and the same type

of coefficient (for example, we did not look for similarities

between cumulative LOC added filter coefficient vectors and

number of commits scaling coefficient vectors).

The analysis and data aggregation was performed using R

Statistics Suite4 with “wavelets”, “zoo”, and “chron” packages.

Package “wavelets” includes discrete wavelet transform

methods, package “zoo” includes time series aggregation

methods and package “chron” extends support for date and

time manipulations.

This method allows identification of patterns that could

present themselves in different levels of detail in respect to

cumulative LOC churn and date. However, there might be

metrics that are not covered in this study showing similar

evolution patterns between projects. Thus, the patterns

identified in this study can not be considered a complete listing

of reoccurring patterns shared between projects.

V. DISCUSSION

The analysis of similar and very common patterns

identified 58 patterns and sub-patterns that occurred in at least

14 projects more than once. No pattern was identified as

common to all projects. Two patterns of steady increase of

cumulative LOC added when compared plotted in respect to

cumulative code churn was found in 18 projects (different sets

of projects) making them the most universal pattern found in

the study. These patterns occurred on average 4 times in a

project. These patterns were: approximately 2.1% increase in

cumulative LOC added for three consecutive periods (P1) and

approximately 4.4% increase followed by 3.8% increase, and

3.4% increase in cumulative LOC added (P2). These can be

summed up as, a stable growth pattern and a decreasing growth

speed pattern.

An important aspect of these patterns is that these patterns

occurred in different scales. That is, both pattern P1 and P2

contained itself. More specifically, P1 occurred up to twice in

lower level before occurring again in higher level. Pattern P2

did not display such cyclical pattern as each scaling level

increased the frequency 2-7 times.

The most common patterns in relation to calendar time was

revealed to relate to cumulative LOC added (P3: increase of

about 0.4% followed by two periods no change) and

cumulative LOC churned (P4: very small code churn increase

slowing down in following periods). Both of these patterns

were present in 16 projects and occurred on average 7.5 times

4 http://www.r-project.org/

in each project. Pattern P4 showed cyclical behaviour as the

occurrences were about twice higher in lower level than in

higher level while pattern P3 did not show cyclical behaviour.

When measuring project age using P4 pattern, we notice

that all dead projects that contain this pattern will die around

0:2:0 as their age when age is noted as O1:O2:O3, where O1 is

number of pattern occurrences in reverse scaling level 4 (time

series is divided into 24 sections), O2 number of occurrences on

lower scaling level (2 more detailed level than O1) since

previous O1 occurrence, and O3 on another level lower level.

An odd dead project is “bizdev”, which does not have O2

occurrences at all, dying at 0:0:4, which is close to 0:2:0 due to

the average frequency multiplier of 2 between subsequent

scaling levels. “VirtualDubMod15” is another deviation that

has a total of 5 O3 occurrences and dies at old age of 0:3:0.

Alive projects fall into two categories: projects that have

not reached 2:00 (most of the alive projects fall into this

category) or that have at least one O1 occurrence (“wsf” and

“feedparser-read-only” are examples of this category). The

project closest to reaching 0:2:0 is “commons”, which stayed at

0:1:1 at the time of data collection. Similar observation on a

different subset of projects can be made using pattern P1.

There exist complementary patterns to P3 and P4 that

covers projects not covered by P3 and P4 respectively. P4’s

complementary pattern P5: 1.4% increase in cumulative LOC

churn followed by 1.1% increase and 0.6% increase.

Interestingly, this pattern does not have cyclical properties. On

the other hand, P3’s complementary pattern P6 of diminishing

increase in cumulative LOC added does repeat itself in lower

scaling levels. Despite the similarities, the project sets sharing

patterns P3-P6 are all different.

The reoccurring and cyclical patterns are similar to ancient

calendars as the cycle length is not fixed (it is approximate).

Instead, the cycle end is determined by pattern occurring in

higher scale (in calendars, it is determined by the cycle of

another celestial body).

Other reoccurring patterns were found also in project size in

LOC (non-cyclical) and number of files (non-cyclical) in

relation to calendar date, and cumulative number of developers

(cycle time around 1.5 occurrences) and number of files (non-

cyclical) in relation to cumulative code churn. When patterns

common to less projects were allowed, more than 250

additional patterns satisfied the condition of occurring at least 4

times on average in a project. These patterns were identified in

the cumulative number of developers, cumulative churn and its

components, LOC, and number of files in relation to both time

series.

VI. CONCLUSIONS

We have demonstrated that open source software projects

contain reoccurring and similar evolution patterns (RQ1). We

do also confirm that there is no universal reoccurring evolution

pattern – instead, there are several reoccurring patterns that

show similarities between different projects. Thus, if one is to

utilise these similarities to adjust different projects into a

common scale, one needs to start by identifying similarity

patterns in these projects.

33

Page 7: [IEEE 2013 4th International Workshop on Emerging Trends in Software Metrics (WETSoM) - San Francisco, CA, USA (2013.05.21-2013.05.21)] 2013 4th International Workshop on Emerging

The second result of the study is in confirming that open

source software projects evolve in cyclical pattern (RQ2). That

is, evolution patterns in low scale are periodically repeated in

larger scale. This can be useful in understanding the seemingly

missing deadlines of the projects’.

The cyclical patterns turn out to be suitable as common

scale for the projects in the dataset. This is shown by the

uniformity in the calculated ages of the projects’ at the time of

their death. This uniformity was achieved using unadjusted

parameters for common pattern identification, thus, a further

study could introduce even better and more uniform patterns by

identifying better commonality criteria for the patterns.

The path of using pattern matching to create common

baselines for comparing software projects can be extended by

looking at interactions between the evolution patterns in the

projects. That is, we might find that some features have regular

repeating patterns within a repeat cycle of another feature.

Another type of interactions that we might be interested in is

classification of pattern occurrence events by another feature

(for example, distinguishing churn increase patterns according

to whether they coincide with increase in LOC). Such studies

have the potential of uncovering new estimation and planning

models for software development. That is, we could identify

when the projects reach critical stages depending on

administrative and architectural choices like openness to

developers or use of agile development processes.

A path we are pursuing by introducing wavelet analysis to

software evolution analysis services is validation and

verification of such patterns in industrial settings. Whilst this

approach might not identify project end correctly, it does offer

a secret-preserving means to interact with and process

proprietary source code in order to improve or extend our

models with industry experience.

ACKNOWLEDGMENT

This research is partly funded by ERDF via the Estonian

Centre of Excellence in Computer Science.

REFERENCES

[1] A. Deshpande and D. Riehle, “The Total Growth of Open Source,” in

Open Source Development, Communities and Quality, 2008.

[2] U. Raja and M. Tretter, “Defining and Evaluating a Measure of Open

Source Project Survivability,” IEEE Transactions on Software

Engineering, vol. 38, no. 1, pp. 163-174, 2012.

[3] J. M. Beaver, X. Cui, J. L. St Charles and T. E. Potok, “Modeling success

in FLOSS project groups,” in Proceedings of the 5th International Conference on Predictor Models in Software Engineering

(PROMISE09), Vancouver, British Columbia, Canada, 2009.

[4] C. Conley and L. Sproull, “Easier Said than Done: An Empirical Investigation of Software Design and Quality in Open Source Software

Development,” in 42nd Hawaii International Conference on System

Sciences, 2009. HICSS '09., 2009.

[5] A. H. Ghapanchi, A. Aurum and G. Low, “A taxonomy for measuring

the success of open source software projects,” First Monday, vol. 16, no. 8, 2011.

[6] E. S. Raymond, The Cathedral and the Bazaar, vol. 3.0, Thyrsus Enterprises, 2000.

[7] J. Wesselius, “The Bazaar inside the Cathedral: Business Models for Internal Markets,” IEEE Software, vol. 25, no. 3, pp. 60-66, 2008.

[8] S. Black, P. Boca, J. Bowen, J. Gorman and M. Hinchey, “Formal Versus Agile: Survival of the Fittest,” Computer, vol. 42, no. 9, pp. 37-45, 2009.

[9] K. Crowston, K. Wei, J. Howison and A. Wiggins, “Free/Libre open-source software development: What we know and what we do not

know,” ACM Comput. Surv., vol. 44, no. 2, p. 35, 2008.

[10] J. Bitzer, W. Schrettl and P. J. Schröder, “Intrinsic motivation in open source software development,” Journal of Comparative Economics, vol.

35, no. 1, pp. 160-169, 2007.

[11] P. V. Singh, “The small-world effect: The influence of macro-level

properties of developer collaboration networks on open-source project

success,” ACM Trans. Softw. Eng. Methodol., vol. 20, no. 2, p. 27, August 2010.

[12] K. Crowston, J. Howison and H. Annabi, “Information systems success

in free and open source software development: theory and measures,” Software Process: Improvement and Practice, vol. 11, no. 2, pp. 123-

148, 2006.

[13] S.-Y. T. Lee, H.-W. Kim and S. Gupta, “Measuring open source software

success,” Omega, vol. 37, no. 2, pp. 426-438, 2009.

[14] C. Subramaniam, R. Sen and M. L. Nelson, “Determinants of open source software project success: A longitudinal study,” Decision Support

Systems, vol. 46, no. 2, pp. 576-585, 2009.

[15] D. Ahern, A. Clouse and R. Turner, Cmmi® distilled: a practical

introduction to integrated process improvement, Third ed., Addison-

Wesley Professional, 2008.

[16] E. Petrinja, R. Nambakam and A. Sillitti, “Introducing the OpenSource

Maturity Model,” in Proceedings of the 2009 ICSE Workshop on Emerging Trends in Free/Libre/Open Source Software Research and

Development, 2009.

[17] R. S. Stanković and B. J. Falkowski, “The Haar wavelet transform: its status and achievements,” Computers & Electrical Engineering, vol. 29,

no. 1, pp. 25-44, 2003.

[18] F. In and S. Kim, “The Hedge Ratio and the Empirical Relationship

between the Stock and Futures Markets: A New Approach Using

Wavelet Analysis,” The Journal of Business, vol. 79, no. 2, pp. 799-820, 2006.

[19] A. Rua and L. C. Nunes, “International comovement of stock market

returns: A wavelet analysis,” Journal of Empirical Finance, vol. 16, no. 4, pp. 632-639, 2009.

[20] J. Yang and P. Lin, “Dynamic risk measurement of futures based on wavelet theory,” in Seventh International Conference on Computational

Intelligence and Security (CIS), 2011.

[21] S. Khan and A. Kulkarni, “Reduced Time Complexity for Detection of Copy-Move Forgery Using Discrete Wavelet Transform,” International

Journal of Computer Applications, vol. 6, no. 7, pp. 31-36, September

2010.

[22] Y. Wang, K. Gurule, J. Wise and J. Zheng, “Wavelet Based Region

Duplication Forgery Detection,” in Ninth International Conference on Information Technology: New Generations (ITNG), 2012.

[23] J. Munson and S. Elbaum, "Code churn: a measure for estimating the

impact of code change," in Proceedings. International Conference on Software Maintenance, 1998., 1998.

[24] S. Karus and M. Dumas, “Code Churn Estimation Using Organisational and Code Metrics: An Experimental Comparison,” Information and

Software Technology, vol. 54, no. 2, pp. 203-211, February 2012.

[25] S. Karus and H. Gall, “A study of language usage evolution in open

source software,” in Proceedings of the 8th International Working

Conference, Honolulu, HI, USA, 2011.

34