proceedings - pdfs.semanticscholar.org · software engineering research group of simula research...

Proceedings

The First International Workshop on Software Productivity Analysis and

Cost Estimation (SPACE'07)

Sponsored by

Information Processing Society of Japan (IPSJ), Special Interest Group on Software Engineering (SIG-SE)

In cooperation with The Institute of Electronics, Information and Communication Engineers (IEICE)

Special Interest Group on Software Science (SIG-SS) The Institute of Electronics, Information and Communication Engineers (IEICE) Special Interest Group on Knowledge-based Software Engineering (SIG-KBSE)

Japan Society for Software Science and Technology (JSSST), FOSE

SPACESoftware Productivity Analysis and Cost Estimation

The 1st International Workshop, SPACE 2007

4 December 2007

Nagoya, Japan,

APSEC 2007 Workshop Proceedings

Editor:

Jacky Keung (NICTA, Australia)

2007Workshop

Conference Chairs

General Chair

Dr. Jacky Keung

Empirical Software Engineering

National ICT Australia Ltd. (NICTA)

Sydney, Australia

Research Program Chair

Prof. Ross Jeffery

Empirical Software Engineering

National ICT Australia Ltd. (NICTA)

Sydney, Australia

Publicity Chair

Assoc. Prof. Makoto Nonaka

Faculty of Business Administration

Toyo University

Tokyo, Japan

SPACE 2007

i

Program Committee

Prof. Barbara Kitchenham Keele University, UK

Dr. Mahmood Niazi Keele University, UK

Prof. Martin Shepperd Brunel University, UK

Dr. Sarah Beecham Hertfordshire University, UK

Prof. Ross Jeffery NICTA, Australia

Dr. Jacky Keung NICTA, Australia

Dr. Liam O’Brien NICTA, Australia

Prof. Magne Jorgensen SIMULA Lab, Norway

Dr. Jurgen Munch IESE, Germany

Dr. JingYue Li NTNU, Norway

Prof. Stephen MacDonell AUT, New Zealand

Dr. Emilia Mendes Auckland University, New Zealand

Ms. Carol Dekkers QualityPlusTech, USA

Prof. Qing Wang ISCAS, China

Prof. Hajimu Iida NAIST, Japan

Prof. Hironori Washizaki NII, Japan

Dr. Makoto Nonaka Toyo University, Japan

Dr. Naoki Ohsugi NTT Data, Japan

SPACE 2007

ii

Preface

elcome to the First International Workshop on Software Productivity Analysis and Cost Estimation

(SPACE 2007), held in conjunction with the IEEE Asia-Pacific Conference on Software Engineering

(APSEC 2007), Nagoya, Japan.

Software project managers require reliable methods for estimating software project costs, and assessing

software development productivity. For over 25 years, there has been considerable research effort directed towards software cost estimation and software productivity analysis, various algorithmic

approaches developed and their performance reported in the research literature. But nevertheless, cost

estimation and productivity analysis remains a complex problem in the software industry.

The goal of the workshop on software productivity analysis and cost estimation, SPACE 2007, is to bring

together practitioners and researchers for discussion and presentation on the emerging aspects

pertaining to software cost estimation, productivity analysis, prediction models and techniques, and

lessons learned. The workshop provides a leading forum to present new ideas and to explore future

directions in these areas for software engineering and software project management.

I would like to take this this opportunity to thank the following, who have all made significant

contributions to the success of SPACE 2007:

• The APSEC 2007 organizing committee

• The SPACE 2007 organizing committee

• Members of the program committee

• The keynote speaker Professor Magne Jørgensen

And finally, thanks to all the participants in SPACE 2007, especially those of you from overseas. We hope

you find the workshop intellectually stimulating, as well as enjoying some of the many attractions that

Nagoya has to offer.

SPACE 2007

iii

Dr. Jacky Keung

General Chair, SPACE 2007NICTA (Sydney, Australia)

W

nsn

SPACE 2007

iv

Biography

Dr. Jacky Keung is a Postdoctoral Researcher in the Empirical Software Engineering Research Group at National ICT Australia (NICTA), based in Sydney. NICTA is Australia’s centre of excellence for Information and Communications Technology R&D. He also holds an academic fellow position in the School of Computer Science and Engineering at the University of New South Wales, Sydney. Australia. He completed his B.S (Hons) in Computer Science from the University of Sydney, and received his Ph.D from the University of New South Wales for his research into the statistical methods of software cost estimation. Jacky works for NICTA in a range of technical roles including consulting in software measurement and cost estimation for a number of software engineering organizations in Australia and Japan. His current research interests are in software measurement and its application to project management, cost estimation, quality control and risk management, as well as software process improvement. He is a member of the Australian Computer Society, and a member of the IEEE Computer Society.

Table of Contents

✤ Keynote

When to Use Estimation Models and When to Use Expert Judgment? .................................... 1 Professor Magne Jørgensen (SIMULA Research Lab, Norway)

✤ Accepted Research Papers

1. Measuring Productivity Using the Infamous Lines of Code Metric ........................................ 3

Benedikt Mas y Parareda and Markus Pizka

2. Fair Software Value Quantification by Productivity Data Analysis .........................................11 Naoki Ohsugi, Nobuto Inoguchi, Hideyuki Yamamoto, Seisuke Shimizu, Noboru Hattori, Jun Yoshino, Takeshi Hayama and Tsuyoshi Kitani

3. A Critique of How We Measure and Interpret the Accuracy of Software Development Effort

Estimation ..............................................................................................................................15

Magne Jørgensen

4. A SemiQ Model of Test-and-Fix Process of Incremental Development ................................23

He Zhang, Barbara Kitchenham and Ross Jeffery

5. Issues of Implementing the Analogy-based Effort Estimation in a Large IT Company ..........31

Jingyue Li and Reidar Conradi

6. Profitability Estimation of Software Projects: A Combined Framework ................................37 Stefan Wagner, Songmin Xie,Matthias Rubel-Otterbach and Burkhard Sell

7. Utilizing Functional Size Measurement Methods for Embedded Systems ...........................45 Ali Nazima Ergun and Cigdem Gencel

8. Evaluation of Ensemble Learning Methods for Fault-Prone Module Prediction ....................53 Sousuke Amasaki

SPACE 2007

n3n

Table of Content

Keynote

When to Use Estimation Models and When to Use Expert Judgment?

Professor Magne Jørgensen

When to Use Estimation Models and When to Use Expert Judgment?

ormal software development effort estimation models have been around for more than 40 years. They

are the subject of more than one thousand research studies and experience reports. They are described

and promoted in many software engineering textbooks and guidelines. They are supported by user

friendly tools and advisory services from consultancy companies. In spite of this massive effort and

promotion, formal estimation models are not in much use by the software industry. Judgment-based

software development effort estimation ("expert estimation") has not been much studied, is not much promoted and has typically no supporting tools, but is nevertheless the preferred estimation approach by

the majority of software companies. Is this a situation where software professionals are unwilling to use

models likely to increase estimation accuracy, i.e., irrational behavior, or is expert estimation just as good

as, or maybe even better than, formal effort estimation models? In this keynote, I will address these

questions through a summary of empirical evidence from software development and other disciplines. I

will outline the situations where models and where expert judgments are likely to provide the most

accurate effort estimates. Based on the available evidence, I will argue that much more of future

estimation process improvement and research initiatives should aim at better expert estimation

processes, and, not so much at improved formal models.

Biography

Professor Magne Jørgensen received the Diplom Ingeneur degree in Wirtschaftswissenschaften from the

University of Karlsruhe, Germany, in 1988 and the Dr. Scient. degree in informatics from the University of Oslo, Norway in 1994. He has about 10 years industry experience as software developer, project leader

and manager. He is now professor in software engineering at University of Oslo and member of the

software engineering research group of Simula Research Laboratory in Oslo, Norway with research focus

on software cost estimation. Magne Jørgensen has supported software project estimation improvement

work and been responsible for estimation courses in several software companies.

Keynote

Professor Magne Jørgensen

SIMULA Research Laboratory

Oslo, Norway

Keynote

F

Research Papers

Research Papers

- 2 - SPACE 2007

Measuring Productivity Using the Infamous Lines of Code Metric

Benedikt Mas y Parareda and Markus Pizka

itestra GmbH

Ludwigstrasse 35, 86916 Kaufering, Germany

[email protected]

SPACE 2007 - 3 -

mailto:[email protected]


Measuring Productivity Using the Infamous Lines of Code Metric

Benedikt Mas y Parareda, Markus Pizkaitestra GmbH

Ludwigstrasse 35, 86916 Kaufering, [email protected]

Abstract

Nowadays, software must be developed at an ever-increasing rate and, at the same time, a low defect counthas to be accomplished. To improve in both aspects, an ob-jective and fair benchmark for the productivity of softwaredevelopment projects is inevitably needed.

Lines of Code was one of the first widely used met-rics for the size of software systems and the productivityof programmers. Due to inherent shortcomings, a naivemeasurement of Lines of Code does not yield satisfying re-sults. However, by combining Lines of Code with knowl-edge about the redundancy contained in every software sys-tem and regarding total projects costs, the metric becomesviable and powerful.

The metric “Redundancy-free Source Lines of Code perEffort” is very hard to fake as well as objective and easyto measure. In combination with a second metric, the “De-fects per Source Lines of Code”, a fair benchmark for theproductivity of software development teams is available.

1. The need to measure

The ability to produce innovative software at a high rateis of utmost importance for software development compa-nies in order to persist in a competitive and fast movingmarket. At the same time, with increasing dependence ofbusiness processes on software, the ability to deliver high-quality software becomes crucial for economic success [9].

We define productivity as the ratio of the size of the out-put versus consumed input, i. e. the effort required to pro-duce one unit of output. Applied on a coarse level to an en-tire software development project, this definition retrospec-tively describes the performance of the development effort.

Achieving high productivity is not easy and maintain-ing it requires constant attention. The economic incentivefor improvement is enormous, as advancing the productivitynot only increases the profit margin of individual projects,but also allows to implement more projects at the same time.

A prerequisite to manage and improve productivity is theability to measure and compare it against industry standardsand internal benchmarks. All processes that exert influenceon productivity need to be appraised in order to identifypotential for improvement that can lead to optimal perfor-mance.

Factors that might impair or advance development pro-ductivity range from external influences such as the temper-ature in office spaces over the motivation of developers totricky technical challenges [11] [13]. However, the individ-ual examination of all these factors is virtually impossiblein commercial environments. Therefore, we are interestedin a productivity metric that concludes the effects in a singleassessment.

Counting Lines of Code is one of the oldest and mostwidely used software metrics [13] [7] to asses the size ofa software system. It has been argued repeatedly that thismetric does not adequately capture the complexity of soft-ware systems or the development process. Hence, it is con-sidered wrong to rely on Lines of Code for appraising theproductivity of developers or the complexity of a develop-ment [5] [11] project.

We will show that by excluding redundant parts of codefrom the Lines of Code count, combining the result withthe total effort needed and the defect rate in the outcome,a highly objective and efficiently measurable productivitybenchmark for software development projects is obtained.

2 Related work

We are well aware of the controversy that our suggestionto use some kind of Lines of Code as a foundation for mea-suring productivity might cause. Advantages and disadvan-tages of Lines of Code have been discussed in great detailby various authors [13] [7] [11]; a summary of this disus-sion can be found on the Internet [4]. Some authors evenwent as far as comparing Lines of Code to a weight-basedmetric that is based on the paper printout of a system [5].

Capers Jones explains in detail how the measurement ofLines of Code might result in apparently erratic productiv-

- 4 - SPACE 2007

ity rates (see section 5.1.2). Therefore, instead of usingLines of Code, he suggests using function point based met-rics [11].

To our knowledge, no attempt has been made yet to com-bine redundancy metrics [6] with Lines of Code to assemblea more suitable definition of the relevant size of a softwareproduct.

3 Basics of the productivity metric

A desirable property of a software productivity metric isthe ability to compare a large variety of different systems.Furthermore, the computation of the metric must only causelow costs. This leads to four key requirements:

• The metric should be applicable to many different pro-gramming languages.

• The effort (cost and time) to perform a measurementshould be low.

• The metric has to be objective and repeatable. Thismeans that an assessment of the same system executedby different individuals has to produce identical re-sults.

• Errors in measurement should be ruled out.

This indicates the use of metrics that can be assessed auto-matically by tools at least to a significant extent. We suggestto combine the following key performance indicators into aproductivity metric:

Redundancy The redundancy of a software system de-scribes the amount of code that is dispensable, i. e. theparts of code that are semantically duplicated or sim-ply unused.

Source Lines of Code A count of relevant Source Lines ofCode (SLOC) of a system, using a standard definition.Various extensive and precise definitions of Lines ofCode are available [8] [12] [14].

RFSLOC By ignoring redundant parts of a system in theSLOC count, we obtain the number of Redundancy-free Source Lines of Code (RFSLOC).

Defect count Defects include software bugs as well as fail-ures to implement required functionality.

Man-days The total count of the man-days (MD) a soft-ware development took or takes from setup until deliv-ery.

To achieve fair results, these performance indicators haveto be examined after completion of a software developmentproject. They are combined into two metrics.

The composed Redundancy-free Source Lines of Codeper Man-day metric,

RFSLOC/MD

represents a simple yet viable measure for the productivityof a software project. The second metric, the Defects perRedundancy-free Source Lines of Code

defects/RFSLOC

is an important indicator for the quality of the outcome ofthe project. Obviously, when judging productivity usingRFSLOC/MD, the defect rate has to be considered, too.

4 Performance indicators

In order to ensure fairness and comparability, the fourbase indicators used in the composed productivity and qual-ity metrics have to explained in greater depth.

4.1 Redundancy-free Source Lines ofCode

The precise definition of Source Lines of Code is largelyirrelevant to the productivity metric, as long as the samedefinition is reused for projects that are compared. How-ever, the Source Lines of code count should not include thefollowing classes of code:

• Re-used code (such as external libraries)

• Test code; test code is not part of the final product

• Redundant code (see 4.1.1)

• Dead code, i. e. unreachable code

• Generated code (see below)

A standard code formatting has to be applied to the sourcecode before the counting takes place.

As the intricate differences between programming lan-guages make a uniform formatting for all languages impos-sible, the definition has to be extended by language-specificcode formatting rules. If need be, normative factors can bedefined to diminish differences in formatting between pro-gramming languages.

Particular attention has to be paid to generated code.Generating code is usually much faster than writing code.Hence, if this type of code would be included in the met-ric, the productivity of projects using code generation wouldbe elevated artificially. To prevent this, we omit generatedcode. Instead, we include the size of the configuration andthe input to the code generator, thereby regarding all humanachievements as part of productivity.

SPACE 2007 - 5 -

One advantage of regarding a variation of Lines of Codeis that the assessment of the metric can be performed at verylittle cost. For almost virtually any programming languagecounting Lines of Code is easy. David A. Wheeler’s scriptSLOCCount [3] alone is capable of counting source linesof code for approximately 25 different programming lan-guages.

4.1.1 Redundancy

By copying code instead of re-using it, the volume of asystem increases significantly while the functionality getshardly extended. Besides, the changeability of a softwaresystem will even suffer.

In order to take this effect that could have a strong impacton the productivity metric into account, we always rely onRedundancy-free Source Lines of Code – lines that do notcontribute to the functionality of the system are excluded.As a consequence, it becomes very difficult to fake the pro-ductivity metric. The most effective way to fake productiv-ity measures by copying code lines is ruled out.

Redundancy is defined on two layers:

• Two code snippets are syntactically redundant if theyare syntactically similar with respect to some defini-tion of edit distance. Such textual copies are mostfrequently produced if developers fail to recognize apossibility for re-use or are unable to implement thenecessary abstractions and instead use copy and paste.

• The definition of redundancy can be extended to se-mantically redundant. Two pieces of code that imple-ment a similar behavior with respect to a given narrowdefintion of similarity are considered semantically re-dundant.

The vast amount of redundancy contained in a software sys-tem can be detected or at least estimated with specializedclone analysis tools such as ConQAT [2]. To ensure fair-ness, all assessments have to be carried out using a standardtool configuration.

4.2 Defect count

It is generally accepted that producing error-free codeis considerably more expensive than implementing “quick-and-dirty” solutions that possibly contain a substantialamount of bugs. To ensure fairness of the metric, onlyprojects that display a similar defect rate can be comparedwith each other.

4.3 Man-days

The count of man-days includes all effort put into com-pletion of the project. This includes activities such as ad-ministrative tasks, requirements analysis, implementation

or testing. Overtime and unpaid work should also be con-sidered.

The inclusion of activies that are not directly related tothe implementation means that the efficiency of e.g. projectmanagement will be reflected in the result. This behaviorof the proposed productiviy metric is absolutely desirableas all tasks performed within the development project con-tribute to its (economic) success or failure.

5 Justification

Measuring programming progress by Lines ofCode is like measuring aircraft building progressby weight. Bill Gates [1]

For various reasons, Lines of Code are regarded as deliver-ing wrong and misleading results for measuring the produc-tivity of developers. If applied to the comparison of systemswritten in different programming languages, it is even con-sidered to be “professional malpractice“ [11].

We will now justify how the proposed metric compen-sates many shortcomings of Lines of Code. While certainlimitations to the applicability will remain, we believe thatthese are of limited importance for practical application incommercial environments.

5.1 Compensating LOC disadvantages

The following sections are a summary of the most com-monly cited disadvantages of Lines of Code-based metricsfrom various sources [4] [11] [5]. The first paragraph ofeach section describes the shortcoming, the paragraphs be-low explain our counter-arguments.

5.1.1 Lack of accountability

The implementation phase of a softwareproject makes up for only one third of the overalleffort of a software development project. Besides,the implementation is only one of many results.Hence, measuring the productivity of a project byLines of Code means ignoring the bigger part ofthe effort.

The development of any non-trivial software system re-quires certainly a complicated process that creates severaladditional results such as a user manual and design docu-mentation. While these artefacts are valuable in their ownright, the main outcome of the project is still executablecode,; i. e. the implementation. Additionally, the investmentin e. g. requirements analysis and system design is usuallyperformed because these activities are necessary for a suc-cessful completion of the project in the first place and theyincrease overall productivity and software quality:

- 6 - SPACE 2007

• An increased effort in system design will reduce theeffort required for implementation.

• A well-designed and engineered system will be imple-mented with less defects per RFSLOC. As the fixingof defects is costly, producing less defects initially willincrease the overall productivity rate of a software de-velopment project [11].

Therefore, assessing the productivity of a project by the re-sulting system itself already includes intermediate resultsapart from the implementation. In order to incorporate thecomplexity of all contributing activities, it is even essentialto include all efforts of all activities and not only the effortput into the implementation phase.

The volume of supplemental results that were possiblyrequired by the customer of the project is not included in theassessment, as our definition of productivity is focused onthe development of software without possibly additionallyrequested byproducts.

However, e. g. the requirement to produce extensive doc-umentation justifies some reduction in productivity.

5.1.2 Lack of correlation with functionality

Different programming languages are differ-ent in their verbosity, the statements offered bythe language and, most importantly, the function-ality displayed by a fixed amount of lines.

If two systems written in different program-ming languages exhibit identical functionality,they will still be expressed in a different num-ber of Lines of Code. Due to the diseconomy ofscales, this might lead to confusing results. Con-sider the following example [11]:

A program written in Assembler requires1,000,000 SLOC to implement and implementa-tion takes 10,000 days. Implementing the sameprogram in C takes 6,250 days and requires500,000 SLOC. The Assembler version of theprogram is obiously the economically worst op-tion, as the total cost are higher. However, if themetric RFSLOC/MD is calculated, the Assemblerversion will yield a ratio of 100 while the C ver-sion will only yield 80 SLOC per man-day.

In contrast to Jones, we argue that the example supportsthe correctness of the proposed metric and that the result isexactly what is expected from a productivity metric.

It is generally accepted that project size and effort donot scale equally; big projects require a disproportional ef-fort due to increased complexity and are less likely to con-clude successfully than smaller projects [10]. Hence, beingdouble in size, the Assembler project can be expected torequire a proportionally stronger effort than the C project.

Consequently, if the bigger project is able to produce moreredundancy-free source lines of code per man-day than thesmaller project and achieves a similar defect count (a pre-requisite for the application of the productivity metric), theproductivity of this project has to be rated higher. This holdstrue even if one considers that the total cost of that projectis higher.

5.1.3 Lack of correlation with effort

For various implementation tasks, effort andlines of code do not correlate well. Activitiessuch as bugfixing usually require great effort, butdo not add a significant number of lines. Hence,while these tasks are necessary for the completionof functionality, their valuable contribution is notreflected in the metric. This shows that the metricoccasionally delivers wrongful results.

The reason for the failure of the metric is the restric-tion of the assessment to a particular activity or time framewithin a project. The assessment will only be fair if per-formed on the total cost of a completed project.

Depending on the development process, different activi-ties are required in different phases of the project. As eachactivity influences the lines of code count differently, it isimpossible to assess productivity at an arbitrary point intime or to compare the productivity of individual activities.On completion of the project, the individual tasks are notcarrying weight any more, as they are subsumed in the totaleffort of the project, and the metric will be correct, showinga correlation with effort.

5.1.4 Code verbosity

Skilled development teams are able to de-velop the same functionality with less code thanless skilled development teams. In a metric basedon lines of code, the more skilled developmentteams will come off worse than less skilled devel-opment teams.

While we agree that a theoretical possibility for such mis-leading results exists, we believe that they are very unlikelyto occur in real world development projects, because:

• Verbose code becomes redundant quickly. Hence,large parts of the overly verbose code do usually notaffect the RFSLOC count.

• Skilled development teams will produce code that con-tains fewer bugs than unskilled development teams andwill spend less time on redoing and undoing work.

SPACE 2007 - 7 -

• Skilled development teams will more likely fulfill therequirements, resulting in a lower defect count and re-ceive less productivity penalties on defects.

• Concise code is easier to understand than verbosecode. Therefore, expensive activities such as fixing(inevitable) bugs will be cheaper in the less verbosecode base. This will affect productivity during de-velopment, making it unlikely that less skilled teamsachieve the same productivity rate if it is measured onthe total cost of a project.

5.1.5 Generated code

As generated code is excluded from the Linesof Code count, the metric might display undesir-able behaviour: if project A uses code generationfor developing a certain functionality and projectB is developing the same functionality withoutusing code generation, it is likely that project Bwill be more expensive in total and take longer.But as project A produces fewer lines of code, theproject might achieve an almost identical rate ofRFSLOC/MD.

Similar to the code verbosity problem, we consider this sce-nario to be very unlikely in reality. If significant parts of asystem are generated, an alternative manual implementationwill also be of significant size. Hence, project B would besignificantly larger than project A. As explained in section5.1.2, due to the diseconomy of scales this makes it veryunlikely that project B will achieve a similar or even higherlevel of productivity than project A.

Additionally, it can be assumed that fixing bugs in thegenerated code requires less effort than in the manuallywritten code. First, there will hardly be any bugs in thegenerated code with a reasonable generator. Second, recon-figuration of the code generator can be expected to be lessextensive than changing the manual implementation. Thetime needed to detect and fix bugs in the manual implemen-tation decreases its productivity.

5.1.6 Project complexity and comparability

In general, different projects can not be com-pared to each other due to various external influ-ences and characteristics that severly affect pro-ductivity. These include:

• The particular customer caused high man-agement overhead.

• The implementation technology is new.• The project domain is particularly complex.• The project is using a certain development

process.

• Volatile requirements cause frequentchange.

We consider these characteristics to be typical and com-mon challenges for the management of software develop-ment projects that have to be dealt with as part of the de-velopment process. If these factors affect productivity byincreasing the total cost of a project, this is also consideredpart of the outcome of a project and should therefore be re-flected in the metric.

5.2 Limitations

The proposed metric compensates many of the com-monly cited shortcomings of Lines of Code-based metrics.However, limitations to the comparability between projectsremain.

5.2.1 Non-functional requirements

Development efforts often have to deal with non-funtionalrequirements such as:

• Performance requirements

• Security requirements

• Availability and reliability requirements

Non-functional requirements have a considerable impact onthe complexity of a development project. The developmentof a critical system with high performance requirements isnot comparable to the development of a less demanding sys-tem, as increasing the performance of a system is expensiveand laborous.

This is not reflected adequately in the RFSLOC countand less demanding projects will generally perform bet-ter. Therefore, only the comparison of systems with similarnon-functional requirements is valid.

6 Conclusion

The proposed productivity metric has several advantagesover alternative assessment methods:

• The metric can be measured using tools.

• The metric is very hard to fake.

• RFSLOC are universally applicable: any programminglanguage as of today is based upon code. Even visualtools result in code that requires compiling.

• The metric is fair, if a few limitations are observed.

• Effects of the environment, such as e. g. ineffectivetools and management, are reflected in the metric.

- 8 - SPACE 2007

We were able to employ the proposed metric in the assess-ment of the application portfolio of a large industrial part-ner, examining systems comprising more than 20 millionlines of code in total. The results of the experience showedthat productivity can be measured effectively using Linesof Code, redundancy and the defect count. Additionally, wecould ascertain that possible failures of the metric (such asdescribed in 5.1.2, 5.1.4, 5.1.5) are of a low probability forall practical purposes.

7 Future work

The productivity metric has to be refined to becomeuniversally applicable. A major gap is the comparisonof projects of different size. To allow the comparison ofprojects of random size, the productivtity rate should be nor-malized according to the size of each project. However, atthe moment it is unclear how this normalization could takeplace.

7.1 Maintenance projects

More promptly, the metric will be extended to cover theefficiency of software maintenance services. Maintenancecan be conducted successfully without increasing the linesof code count but greatly advancing the functionality of asystem. Obviously, the absolute count of lines of code cannot be used any more.

Key to evaluate the efficiency of maintenance could bea count of the lines that were added, removed or changed.These lines can be counted easily through clone detection:all lines that were not changed will be detected as clones ina clone assessment that compares the resulting system withitself before the maintenance activities.

As maintenance is an entirely different task from thegreen-field development of a system, this metric can notbe compared to the results found for development projects.New experimental applications to maintenance projects areneeded and the metric will probably need to be refined toaddress the particularities of maintenance activities.

References

[1] Best programming quotations. World Wide Web, Aug. 2007.http://www.linfo.org/q programming.html.

[2] Conqat - continuous quality assessment toolkit. World WideWeb, Aug. 2007. http://conqat.cs.tum.edu/.

[3] Sloccount. World Wide Web, Aug. 2007.http://www.dwheeler.com/sloccount/.

[4] Wikipedia: Source lines of code. World Wide Web, Aug.2007. http://en.wikipedia.org/wiki/Source lines of code.

[5] P. G. Armour. Beware of counting loc. Communications ofthe ACM, 47(3):21–24, 2004.

[6] B. S. Baker. On finding duplication and near-duplicationin large software systems. In L. Wills, P. Newcomb, andE. Chikofsky, editors, Second Working Conference on Re-verse Engineering, pages 86–95, Los Alamitos, California,1995. IEEE Computer Society Press.

[7] B. W. Boehm. Software Engineering Economics. Advancesin Computing Science & Technology. Prentice-Hall, Engle-wood Cliffs , NJ , USA, Dec. 1981.

[8] S. D. Conte, H. E. Dunsmore, and V. Y. Shen. Software en-gineering metrics and models. Benjamin-Cummings Pub-lishing Co., Inc., Redwood City, CA, USA, 1986.

[9] W. E. Deming. Out of the Crisis. The MIT Press, 1986.[10] T. S. G. I. Inc. Chaos: A recipe for success, 1999.[11] C. Jones. Software Assessments,Benchmarks and Best Prac-

tises. Information Technology Series. Addison Wesley,2000.

[12] R. E. Park. Software size measurement: A framework forcounting source statements. Technical Report CMU/SEI-92-TR-20, Software Engineering Institute, Carnegie MellonUniversity, Sept. 1992.

[13] C. E. Walston and C. P. Felix. A method of program-ming measurement and estimation. IBM Systems Journal,16(1):54–73, 1977.

[14] D. A. Wheeler. More than a gigabuck: Estimatinggnu/linux’s size. World Wide Web, July 2002.

SPACE 2007 - 9 -

ninad45fg

- 10 - SPACE 2007

Fair Software Value Quantif ication by Productivity Data Analysis

Naoki Ohsugi, Nobuto Inoguchi, Hideyuki Yamamoto, Seisuke

Shimizu, Noboru Hattori, Jun Yoshino, Takeshi Hayama and

Tsuyoshi Kitani

NTT DATA Corporation, Tokyo, Japan

{oosugin, inoguchin, yamamotohdy, shimizusi, hattorinb, yoshinoj, hayamatk, kitanit} @nttdata.co.jp

SPACE 2007 - 11 -

Fair Software Value Quantification by Productivity Data Analysis

Naoki Ohsugi, Nobuto Inoguchi, Hideyuki Yamamoto, Seisuke Shimizu, Noboru Hattori, Jun Yoshino, Takeshi Hayama, Tsuyoshi Kitani

NTT DATA Corporation { oosugin, inoguchin, yamamotohdy, shimizusi, hattorinb, yoshinoj, hayamatk, kitanit } @nttdata.co.jp

1. Introduction

Value-based pricing has attracted much attention from both of customers and vendors in software industry. The customers expect it provides more reliable evidences for the software’s prices than cost data reported from the vendors [3]. It is because value-based pricing is the pricing manner based on customers’ perception of the value of software (hereafter, we call it software value) [3]. On the other hand, the vendors expect it to increase their long-term profit. In this pricing manner, the price is independent from the production cost. Thus, cost reduction effort provides larger benefits to the vendors than when other pricing applied.

However, there is no standard to quantify software value in fair perspective for both of the customers and the vendors. For example, today, we have Function Point to measure volume of software’s functions perceived by the customers [4]; however, it is not fair from the vendors’ perspective. Function Point does not measure quality and performance of software although these are important cost drivers. Fairer standard is needed to build consensus on software price.

We propose an analysis procedure to make the standard to quantify the software value. If the analysis involves the vendors’ data and the customers’ opinions, it can derive fairer results for both of them. Close collaboration between the customers and the vendors is vital for the proposed analysis. The customers have to express their opinions about their perception for the value. The vendors have to disclose the analysis results of their productivity data. In addition, they have to discuss to refine the analysis results into the standard.

The remainder of this paper is structured as follows. Section 2 presents related work, especially value-based software pricing for outlining background of the discussion. Section 3 explains the procedure to quantify the software value. Section 4 describes some

potential issues on this idea. Section 5 concludes the paper and provides directions for further research.

2. Related Work

Many vendors have studied customers’ perceived value. Many of them focused on mass marketing of the consuming public [1], [5]. On the other hand, some studies explored buying behaviors of enterprises [6], [7]. Some of them related to the customers of information systems [2]. However, most studies aimed to make business strategy to get more profit. Thus, these include the analysis results from only vendors' perspective. In contrast, here we describe the analysis to make a tool of consensus building. It should be fair from both perspectives for the customers and the vendors.

3. Software Value Quantification

The proposed standard consists of one or more quantification models and their manuals. Each model is derived as a prediction model of software value. Some statistical methods such as regression analysis derive the model from the vendors’ data. Then, the customers and vendors make discussions to adjust the model to share a mutual interest. If necessary, the analysis can make other models with some data stratifications by project type, business field and so on. Once they get a set of agreed models, they write some manuals to describe the models.

Figure 1 shows the procedure for building a standard model to quantify software value. In the figure, a orb symbol denotes stakeholders of the standard. A paper symbol and a diamond symbol denote the involved electronic documents, and mathematical model, respectively. A solid line and a dotted line denote the stakeholders’ action and a deriving from an artifact.

- 12 - SPACE 2007

Customers and vendors perform the procedure according to numbers in the figure, as follows:

1. The customers make a list of their perceivable attributes that seem to affect the software value.

2. The vendors review the list of attributes and adjust it through discussions with the customers.

3. The vendors collect productivity data and the list of attributes from their development projects.

4. The vendors merge the productivity data and the list of attributes into one data.

5. The vendors analyze the data after appropriate stratification.

6. The vendors derive a prediction model of the software value by some statistical method.

7. The customers review the derived model and adjust it through discussions with the vendors.

8. The prediction model is refined to the standard quantification model.

4. Possible Issues

There are many complex issues to make the above standard although its procedure is not complex. This section describes some of the issues and the ways to deal with them.

One of the large issues is the definition of the software value. The customers and the vendors have to define a prediction target (i.e. the objective variable) and some stratification criteria to build a prediction model. The prediction target can be the software’s price, development cost, or the customer’s benefit by using the software. In addition, stratification criteria can be customers’ business field, or development type such as new development, enhancement, and maintenance. The customers and the vendors have to define them before the data collection and the analysis.

One of the customers’ large issues is the authenticity of the vendors’ data. The vendors can add some margin to increase price rising pressure. That margin leads overestimation of software value by the model. For alleviating this issue, some third-party organizations such as government and academic organizations can involve data validation process.

On the other hand, a vendors’ large issue is the customers’ price cutting pressure. The customers can put pressure on the vendors to cut the prices because they can get stronger position by sharing price information. For alleviating this issue, the customers and the vendors should confirm the usage of this quantification standard before the price negotiation. In addition, the vendors have to prepare some evidences to show excessive price-cut decreases the quality of software.

5. Conclusion

In this paper, we described an idea to use the productivity data analysis to make the software value quantification standard with fair perspective. This standard can increase transparency of the vendors’ asking prices and decrease the customers’ excessive price cutting pressure. In addition, discussions for developing the standard themselves promote mutual understanding between the customers and the vendors.

6. References

[1] A. R. Andreasen, “A Taxonomy of Consumer Satisfaction/Dissatisfaction Measures,” In Proc. of Conceptualization and Measurement of Consumer Satisfaction and Dissatisfaction, pp.11-35, 1977.

[2] T. Chikara, and K. Fujino, “Models and Techniques for Measuring Customer Satisfaction so as to Evaluate Information Systems,” Trans. of Information Processing Society of Japan, vol. 38, no.4, pp.891-903, 1997.

[3] R. Harmon, D. Raffo, and S. Faulk, “Value-Based Pricing for New Software Products: Strategy Insights for Developers,” In Proc. of the Portland Intl. Conf. on the Management of Eng. and Tech. (PICMET’04), 2004.

[4] IFPUG, Function Point Counting Practices Manual, Release 4.1, IFPUG, Mequon, Wisconsin, USA, 1999.

[5] R. Kohli, and V. Mahajan, “A Reservation-Price Model for Optimal Pricing of Multiattribute Products in Conjoint Analysis,” Journal of Marketing Research, vol.28, pp.347-54, 1991.

[6] U. B. Ozanne, and G.A. Churchill Jr., “Five Dimensions of the Industrial Adoption Process,” Journal of Marketing Research, vol.8, pp.322-328, 1971.

[7] J. N. Sheth, “A Model of Industrial Buyer Behavior,” Journal of Marketing, vol.37, pp.50-56, 1973.

1.make list

Customers Perceived Attributes

2.review

3.collect

Productivity Fact Data

3.collectVendors

The Analyzed Data

5.analyze4. merged

4. merged

Prediction Model

6.derived

7.review

Standard Quantification Model

8.refined

1.make list

Customers Perceived Attributes

2.review

3.collect

Productivity Fact Data

3.collectVendors

The Analyzed Data

5.analyze4. merged

4. merged

Prediction Model

6.derived

7.review

Standard Quantification Model

8.refined

Figure 1. Procedure for building a standard model to quantify software value

SPACE 2007 - 13 -

nynad45fg

- 14 - SPACE 2007

A Crit ique of How We Measure and Interpret the Accuracy of Software Development Effort Estimation

Prof. Magne Jørgensen

Simula Research Laboratory

Oslo, Norway

[email protected]

SPACE 2007 - 15 -

A Critique of How We Measure and Interpret the Accuracy of Software Development Effort Estimation

Magne Jørgensen Simula Research Laboratory

[email protected]

Abstract

This paper criticizes current practice regarding the

measurement and interpretation of the accuracy of software development effort estimation. The shortcomings we discuss are related to: 1) the meaning of ‘effort estimate’, 2) the meaning of ‘estimation accuracy’, 3) estimation of moving targets, and 4) assessment of the estimation process, and not only the discrepancy between the estimated and the actual effort, to evaluate estimation skill. It is possible to correct several of the discussed shortcomings by better practice. However, there are also inherent problems related to both laboratory and field analyses of the accuracy of software development effort estimation. It is essential that both software researchers and professionals are aware of these problems and their implications for the analysis of the measurement of effort estimation accuracy. 1. Introduction

As early as 1980, Boehm and Wolverton [1] wrote about the “need to develop a set of well-defined, agreed-on criteria for the 'goodness' of a software cost model; the need to evaluate existing and future models with respect to these criteria; and the need to emphasize 'constructive' models which relate their cost estimates to actual software phenomenology and project dynamics”. We here claim that there are essential unsolved problems related to the accuracy of effort estimation measurements and that some of these problems are not due to lack of maturity or training, but to inherent problems that may be impossible to solve.

1.1 Problems with MMRE and PRED

The typical current practice when comparing estimation models or evaluating the estimation performance of software organizations in field settings

is to apply the accuracy measures Mean Magnitude of Relative Error (MMRE) and PRED or similar measures. MMRE and PRED seem to be have been introduced to the software community by Conte, Dunshmore and Shen in 1985 [2] and are defined as follows:

MMRE = mean MRE = ∑=

−n

i i

ii

ActActEst

n 1|)(|1

PRED(r) = nk

, where k is the number of projects in

a set with n projects whose MRE <= r. The MRE-based accuracy measures have been

criticized by several researchers in software engineering, e.g., [3-5]. Several alternative measures have been proposed, e.g., Mean Balanced Relative Error (MBRE) [6], Weighted Mean of Quartiles of relative errors (WMQ) [7] and Mean Variation from Estimate (MVFE) [8]. Especially illuminating of the problems related to an interpretation of MMRE and PRED as accuracy measures is the paper by Kitchenham et al. [4]. That paper proposes that MMRE and PRED should not be interpreted as accuracy measures at all, but instead spread and the kurtosis of the distribution of the estimation accuracy variable z, where z=estimated effort/actual effort.

Conte, Dunshmore and Shen [2] refer to the accuracy measure as mean MRE (MMRE). This is surprising, given that MAPE (Mean Absolute Percentage Error) and MARE (Mean Absolute Relative Error) are the common terms for the same accuracy measure in most other disciplines involving quantitative forecasting. We are not aware of any other community that applies the term MMRE for this accuracy measure. This choice of non-standard terminology is unfortunate. It may have contributed to a surprising lack of references in software engineering papers on how to measure and analyze effort estimation accuracy to the large amount of relevant

- 16 - SPACE 2007

forecasting, prediction and estimation research papers outside the software engineering community in [9].

Conte, Dunshmore and Shen are also frequently referred to as sources for the claim that effort estimation models should have a MMRE <= 0.2 and a PRED(0.25) >= 0.75. We examined their paper on this topic [2] and found no reference to studies or argumentation providing evidence for these values. Instead we found undocumented claims such as these: “Most researchers have concluded that an acceptable criterion for an effort prediction model is PRED(0.25) >= 0.75”. Many researchers have applied these arbitrarily set criterion values to evaluate estimation accuracy of estimation models. We believe that this has reduced the quality of their studies and should be avoided in future studies. The acceptable accuracy of a specific model of processes of judgment depends on many factors, e.g., the accuracy of alternative ways of providing effort estimates in a particular context.

Although there are strong limitations in the accuracy measures themselves, as illustrated by previous papers and our own discussion of MMRE and PRED, it is our opinion that the most severe problems are more basic and, to a large extent, not mentioned in textbooks and papers on software effort estimation. Such problems include those related to what we mean by ‘effort estimate’ and ‘more accurate than’, the effect of issues of system dynamics on the meaningfulness of accuracy measurement, and problems related to the outcome-focus of the measurement, i.e., the fact that we are only evaluating the outcome of an estimation process and not the estimation process itself. Neglect of these topics by the software engineering research community and the software industry motivate this paper. 2. What is an ‘effort estimate’?

To date, the software development community does not have a precise, agreed upon definition of its most central term, ‘effort estimate’ [10]. The term ‘effort estimate’ is sometimes used to mean “planned effort”, sometimes the “budgeted effort”, sometimes the “most likely use of effort” (modal value), and sometimes the “the effort with a 50% probability of not exceeding” (median value). Sometimes it is not even possible to identify an unambiguous meaning. This confusion may be worst in judgment-based effort estimation (expert estimation), where we have observed that software professionals frequently communicate their effort estimates without stating, or sometimes even being aware of, which interpretation they have used. One consequence of the lack of clear definitions of the term

‘effort estimate’ is that surveys on the accuracy of software development effort estimation are inherently difficult to interpret; see [11] for an overview. While a 30% effort overrun is likely to result in significant problems of management when the effort estimate was meant to be the planned or budgeted effort, such problems might not arise if it was meant to be the most likely use of effort and the manager added sufficiently large contingency buffers. We seriously doubt the usefulness of surveys where the core concept under investigation, i.e., ‘effort estimate’, is not precisely defined, inconsistently interpreted, and where there is no information about the degree to which interpretation is inconsistent among the respondents. Unfortunately, this may be the case in most surveys, including our own.

Even in highly controlled situations, e.g., when comparing the estimates of estimation models in laboratory settings, it is frequently not clear what is meant by an effort estimate. For example, it is not clear that the effort estimates derived from analogy-based effort estimation models are of the same type as those derived from regression-based effort estimation models. They have different “loss functions” (optimization functions) and one type of model may, for example, systematically provide higher estimates than others. As a result, an assessment that one estimation model performs better than another may be to the consequence of different interpretations of ‘effort estimate’1. However, problems related to the interpretation of the term ‘effort estimate’ are less severe with estimation models than with the more intuition-based expert judgment models because their estimation processes are more explicit.

A reasonable precise interpretation of the term ‘effort estimate’ is an obvious prerequisite for meaningful measures of estimation accuracy. Without this, it will frequently be difficult to determine whether differences or trends in estimation accuracy result from differences or trends in estimation performance or

1 Assume, for example, a comparison between analogy and regression-based effort estimation models. The regression-model, due to the least square optimization, will tend to emphasize the historical projects that spent unusually much or little effort compared with the other projects. By contrast, an analogy-based model may assign the same weight to all similar projects when calculating the effort estimate. If the high-impact projects in the data set used for the regression-based estimation models tend to be the projects with unusually high usage of effort, as is frequently the case in software development, the consequence is that the estimates of the regression-based model tend to be higher than those of the analogy-based models. The MRE measures punish over-estimation more than under-estimation. This may mean that the difference in the type of estimates may give the regression model a slight advantage over analogy-based models when applying MRE-based measures.

SPACE 2007 - 17 -

from different interpretations of ‘effort estimate’. As an illustration, estimations of agile projects tend to be based on “how much work can we put into an increment?” rather than the “how much effort will a project require?” of more traditional projects. This means that effort estimates probably play a different role and, possibly, have a different interpretation in agile projects. Thus, comparing estimation accuracy of agile projects with those of traditional projects may be misleading.

3. A moving target

Software development effort estimation is different from many other types of forecasting, e.g., weather or stock price forecasting. While, for example, a weather forecast has no impact on the actual weather and there is typically little doubt what the forecast refers to, the same is not the case in software development projects. In software development projects the target (the required software product) is typically not well-defined, the requirements may change during the project’s life-time, and, the estimate itself may impact the process of reaching the target [12]

As an illustration of the implications for the measurement of estimation accuracy, assume that a software provider’s estimate of a project is 1000 work-hours and that the price is based on that estimate. Once the project has begun, it soon becomes evident that the estimate of 1000 work-hours was far too low. Hence, the provider either has to face the prospect of huge financial losses or act opportunistically2 to avoid losses. If the provider acts opportunistically, he might spend less effort on maintainability, usability and robustness properties of the software. As a consequence, the measured estimation accuracy will

2 Opportunistic behaviour (sometimes termed “moral hazard”) is a term commonly used in economics. It occurs in software development projects when the provider takes advantage of his superior knowledge about the development processes and software product properties to deliver a product of lower quality than is expected by the client. This would occur, for example, if a provider delivers software products with problems regarding quality that are not likely to be discovered by the client. The likelihood of opportunistic behaviour increases as the following increase: the degree of information asymmetry between the client and the provider, incentives to deviate from acting in accordance with the clients’ goals, and clients’ lack of ability to specify and monitor the development process and product. There are controversies regarding to what degree and when opportunistic behaviour will be neutralized by so-called “altruistic behaviour” (the opposite of egoistic behaviour) and “work ethics”, but there is no doubt that this phenomenon occurs frequently in both software development and other disciplines.

improve, due to the provider’s ability and willingness to adjust the work to fit the initial estimate. Consequently, it is not clear to what extent the measured estimation accuracy measures properties of the estimate or the project’s ability to fit the work to the estimate.

The system dynamics of software effort estimation is also important for understanding why it does not necessarily help to ‘add 30% to every effort estimate’ to remove the bias towards optimism in effort estimation. For example, if a project leader knows that 30% will be added, he may: i) remove 30% from his estimate in advance to get the effort estimate he believes in, or ii) adjust the delivered product to be larger or better. The latter is frequently possible because the effort estimate to some extent defines, as well as reflects, the requirement specifications. Rewarding accurate effort estimates does not work well for similar reasons. For example, we know one company that rewarded those project leaders who had the most accurate effort estimates. The immediate effect was that effort estimates increased and productivity fell. The reason was simply that the project leaders discovered that a simple strategy, rational for them but not necessarily for the company, to achieve accurate effort estimates was to provide higher effort estimates and spend any remaining effort on improving the product (‘gold-plating’).

In [13] we suggest several methods for measuring estimation accuracy when the target is moving. Unfortunately, it seems that several of the problems involved are inherent and not easy to adjust for. The methods, such as adjusting effort for differences between the specified and actual product, may alleviate problems, but are seldom able to eliminate them completely.

The problems with the measurement of the accuracy of software estimation that result from the target’s moving constantly apply mainly in field settings. In laboratory settings, the estimation models are developed from historical data about the completed products and the estimates are derived from those models. In that case, the actual products are used to both develop the model and to evaluate the accuracy of the estimates; hence, there is no moving target problem. However, as soon as we apply and evaluate the models in real-life settings, the problems with respect to measuring estimation accuracy that are induced by the fact that the target may move begin to appear. These problems may strongly restrict the relation of measures of estimation accuracy to properties of the estimates, as opposed to the development process, regardless of the type of accuracy measure chosen.

- 18 - SPACE 2007

4. What is the meaning of ‘more accurate than’?

Preferably, a measure of estimation accuracy should reflect its users’ intuitive understanding of important relationships that pertain to accuracy, such as the relationship ‘more accurate than’. Otherwise, it may be difficult to find the measure meaningful and to communicate the results. If, for example, most people would agree that the effort estimate X is more accurate than the effort estimate Y, but our accuracy measure tells us the opposite, there may be problems related to communication and willingness to adopt the accuracy measure.

From discussion with software professionals, we have found that their understanding of ‘more accurate than’ deviates substantially from that implied by common accuracy measures. The main reason for this is related to the systems dynamics of software development, i.e., the types of reason discussed in Section 3. For example, we have encountered the opinion from a project manager who overestimated a project by 20% that his effort estimate was much more accurate than that of a project that was underestimated by 20%. The reason for the project manager’s holding that his estimate was the more accurate is that he knew perfectly well how easily he could have expended exactly as much effort as was estimated, while this option was not available for the project that was underestimated. It is evident that software professionals do not necessarily have a strictly quantitative interpretation of ‘more accurate than’.

The lack of a precise and commonly accepted understanding of what we mean by ‘more accurate than’ may have been a major reason for the acceptance of measures like the MMRE. As an illustration, when we accept that MMRE is a meaningful measure for comparing the estimation accuracy of different estimation methods, we implicitly accept that an effort estimate where the actual effort is 300% of the estimate (MRE = 0.67) is more accurate than an effort estimate where the actual effort is 59% of the estimate (MRE = 0.69). This entails, for example, that if we estimate the effort to be 1000 work-hours and it turns out to be 3000 work-hours, an acceptance of the MMRE measure means that we should agree that we have estimated more accurately than if the actual effort turns out to be 590 work-hours. However, the opposite would probably be the case in practice, we believe.

Other accuracy measures corresponding better to common interpretations of “more accuracy than” and other accuracy relations will hardly ever solve this

problem. Even if researchers managed to agree on an ‘empirical relational system’, i.e., a set of relations and definition sufficiently precise for the purpose of estimation accuracy measurement, it is not reasonable to assume that the software industry will share this ‘empirical relational system’. The software industry’s loss functions and measurement goals vary a lot, depending on such factors as application domain, person roles, and type of client and project.

We recommend that the estimation accuracy measures be tailored to the situation at hand and, even more importantly, based on a clear understanding of the purpose of the measurement. This may, in several particular situations, lead to precise and meaningful interpretations of relationships that pertain to accuracy, such as ‘more accurate than’. Assume, for example, that an organization wants to monitor the proportion of projects that have large estimation overruns to determine whether this proportion increases or decreases over time. The organization must then define precisely what it means by ‘large estimation overrun’. They may decide that, for their measurement purposes, situations with ‘large estimation overruns’ can meaningfully be defined as situations in which the actual effort is more than 30% and more than 500 work-hours higher than the estimated effort, after adjusting the actual effort for differences between planned and actually delivered functionality and quality. By taking these steps, the organization introduces an ordinal scale of estimation overrun measurement, where estimates are categorized and ordered into the categories ‘no large effort overrun’ and ‘large effort overrun’.

If the purpose of the measurement is not only to monitor, but to understand the reasons behind, estimation overruns, several other steps have to be taken, e.g., the steps suggested in [13]. Otherwise, it will not be possible to determine whether a decrease in estimation accuracy within a company is a result of increased estimation skill, change in interpretation of ‘effort estimate’, less complex projects, or better project management.

5. Focus on the outcome of accuracy measurement

The two main strategies when evaluating the accuracy of judgments are: i) determining coherence with a normative process (the coherence-based strategy) and ii) determining correspondence with the real world (the correspondence-based strategy) [14]. While many studies on human judgment are based on

SPACE 2007 - 19 -

coherence with a normative strategy, software development effort estimation accuracy evaluations are, as far as we know, based solely on correspondence with the actual effort. A possible reason for this strong reliance on correspondence is that it is frequently not obvious what a normative effort estimation process should look like. However, the disadvantage of this reliance on correspondence is that an effort estimate may be accurate for the wrong reasons.

We are currently analyzing the judgmental processes of 28 software professionals who estimate the effort of software projects and maintenance tasks (work-in-progress). Frequently, we observed, the effort estimation strategies deviated from what we consider to be normative responses, e.g., the response that best reflects the historical data. Sometimes, however, the less defensible estimates were more accurate than those based on normative estimation strategies. Figure 1 displays one such situation.

2500200015001000

9

8

7

6

5

4

3

2

1

0

Estimated Productivity (LOC/man-month)

Freq

uenc

y

MPJavaPCl1

Figure 1: Estimation of programming productivity

In Figure 1, PCl1 denotes the productivity of the closest analogy (the project most similar to the one to be estimated), while MPJava denotes the mean productivity of all previous tasks of same type (five Java tasks). In this estimation situation, the only information about the new task was its estimated size and development platform. This was introduced to force the software professionals to base their estimates on the historical data, so that we could analyze their use of these data.

It is evident from Figure 1 that most software professionals estimated effort values that implied a productivity of the estimated task close to PCl1, MPJava or a combination of these two values. However, there were a few estimates of more than 2000 LOC/man-month. These estimates were difficult to defend on the basis of the historical data received by the software developers. These non-normative estimates may have been caused by: i) calculation

error, ii) assumption of organizational learning (not supported by the data), or iii) assumption of much higher productivity of smaller tasks (not supported by the data).

The actual productivity of the task was 2333 LOC/man-month. Hence, those who followed the seemingly least normative estimation strategies had the least discrepancy between estimated and actual effort. This illustrates that if we want to use estimation accuracy as a measure of estimation skill, or predict the future estimation accuracy of a software developer, a strong reliance on outcome without considering the normativeness of the estimation strategy may lead to poor interpretations.

In order to improve the interpretation of estimation accuracy measurement, we recommend that the normativeness of the estimation strategy should be evaluated. Elements of an evaluation of the normativeness of an estimation process should include: • Including variables that historically have affected

the use of effort. • Excluding variables that have had no or very

limited impact on the effort. • Basing the evaluation on a defensible strategy,

e.g., similarity to other projects. • Regressing towards the mean effort or

productivity of a larger group of similar tasks with higher uncertainty levels.

• Not assuming substantial learning from experience, i.e., better performance than on previous projects, unless a strong argument for this is provided.

We acknowledge that it is typically very difficult to evaluate the degree of normativeness of judgment-based estimation strategies. There may, for example, be an essential difference between a software professional’s claim that he has not been affected by irrelevant variables and the actual effect [15]. 6. Final reflections

Common measures of the accuracy of software development effort have, we believe, shortcomings that have a severe impact on their communication, interpretation and meaningful use. Several of these shortcomings have, as far as we are aware, received little attention from the software development communities. We believe that researchers and organizations that apply these effort estimation accuracy measures will benefit from greater awareness of these shortcomings.

We also believe that a necessary precondition for the sustainable improvement of effort estimation is that

- 20 - SPACE 2007

the evaluation of judgmental strategies and formal estimation models has a solid foundation. We find that this is currently not the case, and that there is a strong need for more mature analyses and studies on this topic. If we are not able to evaluate and compare estimation models and processes properly, any measure of change may be a result of shortcomings with respect to how we measure, rather than an actual change in estimation performance. 7. References 1. Boehm, B.W. and R.W. Wolverton, Software

cost modeling: Some lessons learned. Journal of Systems and Software, 1980. 1: p. 195-201.

2. Conte, S.D., H.E. Dunsmore, and V.Y. Shen, Software effort estimation and productivity. Advances in Computers, 1985. 24: p. 1-60.

3. Shepperd, M., M. Cartwright, and G. Kadoda, On building prediction systems for software engineers. Empirical Software Engineering, 2000. 5(3): p. 175-182.

4. Kitchenham, B.A., et al., What accuracy statistics really measure. IEE Proceedings Software, 2001. 148(3): p. 81-85.

5. Foss, T., et al., A simulation study of the model evaluation criterion MMRE. IEEE Transactions on Software Engineering, 2003. 29(11): p. 985-995.

6. Miyazaki, Y., et al., Robust regression for developing software estimation models. Journal of Systems and Software, 1994. 27(1): p. 3-16.

7. Lo, -.B.-W.-N. and Xiangzhu-Gao, Assessing software cost estimation models: criteria for accuracy, consistency and regression. Australian Journal of Information Systems, 1997. 5(1): p. 30-44.

8. Hughes, R.T., A. Cunliffe, and F. Young-Martos, Evaluating software development effort model-building techniques for application in a real-time telecommunications environment. IEE Proceedings Software, 1998. 145(1): p. 29-33.

9. Jørgensen, M. and M. Shepperd, A systematic review of software cost estimation studies. IEEE Transactions on Software Engineering, 2007. 33(1): p. 33-53.

10. Grimstad, S., M. Jørgensen, and K. Moløkken-Østvold, Software Effort Estimation Terminology: The Tower of Babel.

Information and Software Technology, 2006. 48(4): p. 302-310.

11. Moløkken, K. and M. Jørgensen. A review of software surveys on software effort estimation, in International Symposium on Empirical Software Engineering. 2003. Rome, Italy: Simula Res. Lab. Lysaker Norway.

12. Jørgensen, M. and D.I.K. Sjøberg, Impact of effort estimates on software project work. Information and Software Technology, 2001. 43(15): p. 939-948.

13. Grimstad, S. and M. Jørgensen. A Framework for the Analysis of Software Cost Estimation Accuracy. in ISESE. 2006. Rio de Janeiro: ACM Press.

14. Hammond, K.R., Human judgement and social policy: Irreducible uncertainty, inevitable error, unavoidable injustice. 1996, New York: Oxford University Press.

15. Jørgensen, M., A review of studies on expert estimation of software development effort. Journal of Systems and Software, 2004. 70(1-2): p. 37-60.

SPACE 2007 - 21 -

nynad45fg

- 22 - SPACE 2007

A SemiQ Model of Test-and-Fix Process of Incremental Development

He Zhang1,2, Barbara Kitchenham3, Ross Jeffery1

1 School of Computer Science and Engineering, University of New South Wales

2 National ICT Australia(NICTA) Sydney, Australia

3 School of Computing and Mathematics, Keele University

1,2 {he.zhang, ross.jeffery}@nicta.com.au, [email protected]

SPACE 2007 - 23 -

A SemiQ Model of Test-and-Fix Process of Incremental Development

He Zhang1,2, Barbara Kitchenham3, Ross Jeffery1,2 1School of Computer Science and Engineering, University of New South Wales

2National ICT Australia 3School of Computing and Mathematics, Keele University

1,2{he.zhang, ross.jeffery}@nicta.com.au, [email protected]

Abstract

Software process modeling has become an

important technique for managing software development processes. However, purely quantitative process modeling requires very accurate measurement of the software process attributes, which in turn relies on accurate historical data. This paper presents a semi-quantitative (SemiQ) modeling approach. It allows the software process to be modeled even when there is the uncertainty about the values of software process attributes. We demonstrate its value and flexibility by developing SemiQ models of the test-and-fix process of incremental software development. 1. Introduction

Software process modeling has become an important technique for managing the software development processes. However, purely quantitative process modeling requires very accurate measurement of the software process attributes, which in turn relies on accurate historical data.

In contrast, Semi-Quantitative (SemiQ) modeling and simulation allows the software process to be modeled when there is uncertainty about the actual values of software process attributes. We introduced this approach in a previous study [1]. In this paper, we demonstrate how an incremental development process model can be developed using SemiQ modeling, and present the simulation results. 2. Background and approach 2.1. Software process modeling

Osterweil identified two complementary types of

software process research [2], which can be characterized as macro-process research, focused on

phenomenological observations of external behaviors of processes, and micro-process research, focused on the study of the internal details and workings of processes. At macro-process level, a process model is composed of a set of equations whose variables describe a process. The input variables deal with the observable and manageable aspects of the process.

In this paper, we model software processes semi-quantitatively by identifying the inherent cause-effects in process. Currently we restrict ourselves to macro-process level. 2.2. Incremental development processes

Incremental development is a broad term of development process, covering iterative development, versioned implementation, etc. The basic idea is to divide the system into smaller subsystems, which are gradually integrated to become the full system [3].

Basically, there are three phases in one increment: analysis, implementation, and testing. In this article, we model the implementation and test-and-fix processes of an intermediate increment. After a release is implemented, there is a test-and-fix process, where the release is thoroughly tested and corrected. During this period, no new functionality is added. The testing process aims to ensure that each release provides a robust foundation for subsequent releases. 2.3. SemiQ modeling and simulation

Semi-quantitative modeling is implemented in two stages: first the system is model qualitatively, and then the quantitative constraints are placed on the model. Figure 1 shows how a system described by single QDE (qualitative differential equation) enables semi-quantitative modeling and simulation.

- 24 - SPACE 2007

Figure 1. Semi-quantitative modeling and simulation

Qualitative models reflect the systems in the real world at an abstract level. Fewer assumptions are required than for quantitative models. Qualitative models can be further used for process simulation by the QSIM [4]. The output generated is a set of possible qualitative behaviors and each behavior consists of a sequence of states that describe open temporal intervals or time points [5]. These states present the system behavior from its initial state to final point. Time is treated as a qualitative variable in the model. The qualitative landmarks are created when necessary, they indicate critical points of the model parameters.

Quantitative constraints use bounding intervals to represent partial quantitative knowledge. Q2 is the semi-quantitative extension to QSIM. Given quantitative interval bounds on the values of some landmarks and envelope functions, Q2 defines a constraint-satisfaction problem (CSP). A solution to CSP is an assignment of a value range to each landmark consistent with the constraints. This reduces the number of valid behaviors since a contradiction refutes any inconsistent qualitative behavior and all its associated behaviors [5]. 3. SemiQ modeling for test-and-fix process of incremental development

The primary purpose of test-and-fix process model is to examine whether the increment can be completed in the desired time period with the required quality. This section presents the qualitative modeling and semi-quantitative constraints for this process. 3.1. Related models

Some previous researchers have investigated the software testing process using quantitative models.

Abdel-Hamid and Madnick (AHM) modeled the basic software testing process, which is a part of their integrated software process model, using System Dynamics [6]. However, their model is based on the waterfall testing process, rather than the incremental testing. There is only one explicit type of error, namely “Error” in the model. They did not differentiate the newly generated errors and the undetected errors from the previous increments. Furthermore, their model neglects the switchover phenomenon of error fixing productivity.

Huff et al. developed an alternative causal model for the test-and-fix process of incremental development [7]. They quantified the system using quantitative equations. Nevertheless, their models did not identify the propagation of remaining active errors in succeeding increments.

Tvedt developed a comprehensive process model of concurrent incremental development [8]. He considered the impact on defect creation of the engineer’s capability, technical risk, and inter-dependency among the concurrent increments, which we do not model. However, active and passive errors were not explicitly handled in his model. 3.2. Qualitative modeling

The qualitative models consist of two interlinked models, which are developed to model the implementation (error generation) process and test-and-fix (error detection and correction) process for producing an intermediate release.

At the qualitative modeling stage, we investigate the incremental development processes, and abstract error-related features in qualitative form from them. 3.2.1. Modeling implementation process. One widely used linear model for software implementation is employed as the basic skeleton of this model, i.e. given workforce (WF), release size (S), and unit productivity (PD), the elapsed calendar days to complete the release can be calculated by S/(WF*PD). However, because we are interested in the processes related to error generating and testing during each increment, more features related to error generation need to be included in this model. In addition, the management overheads [9] (including communications, adaptation, staff absence, configuration management, and so on) need to be considered when calculating the elapsed time.

During the implementation, there are two basic types of errors generated: active errors and passive errors. An undetected “active error” may propagate

SPACE 2007 - 25 -

more active or passive errors in its succeeding increments until it is fixed or retired. The undetected “passive errors” remain dormant until they are fixed.

We concentrate on the active and passive errors in the implementation process because they will affect the total number of errors introduced to test-and-fix. If the system is developed with incremental top-down strategy [3], then in the early releases, most of the errors committed are in the core or high level components and become active. If these errors are not detected, they tend to propagate through the succeeding increments that build on one another.

Therefore, the errors generated through each implementation in the model come in two ways. The first is through the development of the incremental code for each release; the second is through the propagation of active errors.

However, for many undetected active errors, the propagation will not continue after producing one or two “generations” of errors [6]. In this case, they effectively become undetected passive errors.

The qualitative assumptions for modeling the implementation process are explicitly given as below: 1. All resources are focused on one increment, i.e.

increments are linked sequentially, and there are no concurrent increments at any time.

2. The size of current release (S) does not change during current increment.

3. Current release size includes the necessary effort for. design, coding, unit testing and rework except test-and-fix effort.

4. Active errors (EAO) from previous releases will propagate active and passive errors (EARp and EPRp) in the current implementation.

5. Increasing software development rate (RSD) increases both active and passive error generation rate (REAG and REPG).

6. Increasing the team size (WF) leads to a larger implementation overheads (ROh).

7. The average development productivity (PDIm) does not change during current implementation.

8. A fraction of remaining active errors (EARt) are retired to become a faction of the passive errors (EPO) in each intermediate release. Based on the above qualitative assumptions, the

qualitative model for implementation is given in Figure 2, where an asterisk denotes an input parameter, and an apostrophe indicates an output parameter.

Figure 2. Qualitative implementation process model

3.2.2. Modeling test-and-fix process. The objective of test-and-fix process is to achieve an “acceptable level of quality”, meaning that a certain percentage of errors will remain unidentified upon release of the software [10]. In incremental development, a small percentage of errors may fail to be detected in the current release, and remain in the next increment.

In test-and-fix process, a specific test suite is run and analyzed, detected errors are reported, assigned, and eventually fixed. In practice, the test cases are usually run by a standalone team to avoid any potential bias. The work of correcting errors mostly falls back to the implementation team.

The test suite contains multiple test cases, which are prepared prior to testing. We assume that the black-box testing strategy is applied in this process. The time spent detecting errors depends on the size of test suite (or the number of test cases in queue) for the current release. Besides, the time spent on correcting errors depends on the number of detected errors.

Huff et al. argued that the error clearing rate (productivity) slows as average error queue size drops below a certain number [7]. Such switchover normally happens near the end of error fixing work, when there are only a small number of errors in queue for developers. One multiplier is proposed here to reflect the productivity drop.

- 26 - SPACE 2007

The qualitative assumptions for modeling the test-and-fix process are summarized as: 1. Test suite (Ts) is prepared before test-and-fix starts,

and does not change during the process. 2. Errors (active and passive, old and new) are

uniformly distributed across the test suite for current release. In the other words, given the implementation and test suite, completing more test cases results in more errors being detected.

3. Errors generated in current release (EAN and EPN) have a higher probability of being detected by a test case than the old errors (EAO and EPO) remaining from previous releases.

4. The nominal average effort required to correct errors does not change during the process.

5. The team clearing errors is the same team developing current release (WF).

6. Increasing the team size leads to a larger test-and-fix overheads (RTOh).

7. All detected errors in current release (EF) are cleared before next release.

8. There is no new error generated (bad fix) during error correction.

9. The error fixing rate decreases after the switchover point, which is represented by a decreased value of the multiplier (mf).

+

*

Figure 3. Qualitative test-and-fix process model

We pay more attention on the “new” and “old” errors in test-and-fix process model (Figure 3), because

they are associated with different probability of detection and effort to clear which further influences the performance of test-and-fix process.

The model can have multiple exits when performing reasoning. This process is not completed until all detected errors are fixed, or a required percentage of test cases are passed within desired period, and so on. These two models are connected iteratively to model the whole process. Two QDEs have to be coded to represent the above qualitative models. This formal representation is required for the further simulation of the qualitative models.

3.3 Quantitative Constraints

Using quantitative constraint, the qualitative model can be quantified by partially quantitative equations, which include intervals and envelope functions that are the representations of incomplete quantitative knowledge and uncertain scenarios.

Envelope functions. Monotonic functions (M+ and M-) may be restricted to take on values within an envelope defined by upper and lower edges. The edges are defined by a quantitative function (linear or nonlinear) and its inverse function [5]. A pair of values satisfies the constraints if the intervals associated with its landmarks fall within the envelope.

Parameter intervals. Units and unit conversion are not addressed in the qualitative model because the landmark values are symbolic names for unknown real values that are assumed to have appropriate and compatible units. However, the units must be explicitly denoted in quantitative constraints due to some numeric interval involved. The interval span mainly depends on the completeness, consistency, and certainty of knowledge or empirical data. It could be narrow for a software process expert in a mature organization, or conversely, broad for the novices in an ad hoc project. 3.3.1. Quantifying implementation process. As shown in Figure 2, there are six monotonic functions in the implementation model. Thus, we have to specify the envelope functions for them separately.

To simplify the discussion based on the above-mentioned assumptions, we assume the linear relationships between software development rate and error generation rate, between escaping active errors and retired errors, and between old active errors and propagated errors (indicated by dashed line in Figure 2). Therefore, some auxiliary parameters need to be introduced to specify quantitative constraints: error densities (EDAt and EDPt) in Portion 1, error retirement rate (RERt) in Portion 2, and error propagation rates

SPACE 2007 - 27 -

(REAR and REPR) in Portion 3 (Figure 4). A discussion of the relation between team size (WF) and overheads (ROh) can be found in [1].

Figure 4. Refinement of portions of implementation model

Based on the above assumptions, the envelope functions for these monotonic constraints can be converted to the assignment of value ranges. For a ten-increment project, with reference to [6], the retirement rate can be quantified by the value ranges in Table 1.

Table 1. Value ranges for active error retirement rate Increment 1,2,3 4,5,6 7,8 9 10 RERt [0, 0] [.05, .1] [.1, .3] [.4, .8] [.9, 1]

3.3.2. Quantifying test-and-fix process. In qualitative test-and-fix model (Figure 3), four types of errors, the combinations of new-old and active-passive error types, are modeled individually. The new errors and old errors are associated with different hitting rates (i.e. probability of detecting an error).

*ED

POTc

*ED

PNTc

-hn:

Hitt

ing

rate

of n

ew e

rror

s-h

o: H

ittin

g ra

te o

f old

err

ors

-ED

PN

Tc: D

etec

ted

pass

ive

new

erro

rs p

er te

st c

ase

-ED

AN

Tc: D

etec

ted

activ

e ne

w e

rror

s pe

r tes

t cas

e-E

DA

OTc

: Det

ecte

d ac

tive

old

erro

rs p

er te

st c

ase

-ED

PO

Tc: D

etec

ted

pass

ive

old

erro

rs p

er te

st c

ase

Figure 5. Refinement of portion of test-and-fix model

The error detection rate may change across the test cases. Its value depends on the applied testing strategy, the design of the test case, and the portfolio of test suite. In the SemiQ model, we assume the hitting rates do not change across the testing cases. Thus, we use two hitting rates (hn>ho) and the corresponding error detection rates (Figure 5). Their value ranges can be based on historical records, such as [.85, 1.0] (hn) and [0, .1] (ho) of the nominal detection rate.

The estimate of multiplier (mf) depends on the time period when the average size of a developer’s error queue falls below a specified value. Huff et al.

reported the switchover normally happens when the number of detected errors to be fixed drops to [2, 3] times of the team size, he further reasoned that a developer spends one quarter time waiting for an error to arrive, and three quarters handling a single error in queue [7]. However, this value is difficult to observe and measure directly. It seems more reasonable to set the value in a range, say [.7, .8], in general context. 4. Simulation with SemiQ model 4.1 Qualitative Simulation

Given the qualitative assumptions and QDE models, no specific numeric values are needed to perform qualitative simulation. We can simulate one intermediate increment by assuming the volume of escaped active and passive errors (EAES, EPES) from previous releases qualitatively. The simulation generates 1762 possible qualitative behaviors for this intermediate increment. Figure 6 shows four variables’ state changes of one behavior in terms of time.

Figure 6. Behavior 750 of Qualitative Simulation

Most of the qualitative behaviors observed by simulation represent the complicated relationships in the variable space, including EAEs vs. EAN, EPEs vs. EPN, EEs (EAEs + EPEs) vs. TS, RFX vs. RTc, etc. For example, the value of EAEs can be increased due to a low active error retirement rate RERt and a high active error generating rate (REA) or a high active error propagation rate (REAR), and vice versa. In addition, we observed a small number of behaviors consistent with some extreme situations, which can be ignored. 4.2 Semi-Quantitative Simulation

The SemiQ model simulates all possible behaviors of a specific system with incomplete quantitative knowledge. Hence, it cannot be fully evaluated by particular sets of quantitative data. In this section, we employ our SemiQ model to simulate one incremental development project, which was used as case study of Tvedt’s comprehensive quantitative (System

- 28 - SPACE 2007

Dynamics) model [8]. Furthermore, we conduct a brief comparison between the model outputs.

Table 2. Characteristics of the baseline project Attributes ValuesProject size 90,667Loc Increment 1 26,667Loc Increment 2 32,000Loc Increment 3 32,000LocProject schedule 250daysProject team size 15 engineersEstimated budget 3750 man-days

Tvedt’s model used the project shown in Table 2. The variables in test-fix process are also affected by auxiliary factors from other related sectors, such as human resource sector. SemiQ model treats different types of defects separately, and gains more insight into error propagation across increments and allows some ignorance of the impacts from other sectors.

Figure 7. One behavior tree of test-fix process

Figure 7 shows one behavior tree produced by simulation for Increment 2. The number of possible behaviors is much less than for the qualitative model. Each dot point indicates one critical time point during the process. The transition points (TP) are denoted by landmarks “Φ”. For instance, the first TP of each behavior (branch) means that the start of testing, the second indicates the switchover point; and the process ends at the last point.

Table 3. Simulation results and comparison EN EEs Density Increment 1 [480, 533] [0, 84] <3.1KLocIncrement 2 [576, 716] [0, 183] <2.8KLocIncrement 3 [576, 801] [0, 351] <3.9KLocTvedt’s model 279 3.5/KLoc

In contrast, Tvedt’s model simulated concurrent development between increments. Thus, it is not necessary to compare the elapsed times. We present only their quality features for the brief comparison in Table 3. Our model estimates the project can produce the software product with no more than 351 defects, which is consistent with the simulation result generated from Tvedt’s model.

The quantitative constraints that were applied in our SemiQ model are mostly defined based on literature and experience data, which are not calibrated with the baseline project. The information of baseline project is only used to specify the initial state and control the

simulation. The outputs prove that this approach can produce reasonable ranges as prediction under the circumstance with uncertainty and contingency. 5. Conclusions

This paper reports a study of modeling incremental development focusing on test-and-fix process, and proposes the process modeling by using SemiQ approach, a powerful supplementary technique to quantitative modeling which handles incomplete knowledge and uncertainty effectively.

Our future research includes the aspects as below: 1. Extending the SemiQ process model by including

more factors, e.g. bad-fixing, test suite regression, and increment dependency.

2. Investigating the SemiQ process modeling at micro-process level.

References [1] H. Zhang and B. Kitchenham, "Semi-Quantitative

Simulation Modeling of Software Engineering Process," in Software Process Workshop/International Workshop on Software Process Simulation and Modeling (SPW/ProSim) Shanghai: Springer, 2006.

[2] L. J. Osterweil, "Unifying Microprocess and Macroprocess Research," in Software Process Workshop (SPW) Beijing: Springer, 2005.

[3] E.-A. Karlsson, "Incremental Development - Terminology and Guidelines," in Handbook of Software Engineering &Knowledge Engineering. vol. 1, S.-K. Chang, Ed.: World Scientific, 2001.

[4] "QSIM," 4.3 Alpha 4 ed: UT Qualitative Reasoning Software, pp. http://www.cs.utexas.edu/users/qr/QR-software.html.

[5] B. Kuipers, Qualitative Reasoning: Modeling and Simulation with Incomplete Knowledge: MIT Press, 1994.

[6] T. K. Abdel-Hamid and S. E. Madnick, Software Project Dynamics: An Integrated Approach. Englewood Cliffs, N.J.: Prentice Hall, 1991.

[7] K. E. Huff, J. V. Sroka, and D. D. Struble, "Quantitative Models for Managing Software Development Processes," Software Engineering Journal, vol. 1, pp. 17-23, 1986.

[8] J. D. Tvedt, "An Extensive Model for Evaluating the Impact of Process Improvements on Software Development Cycle Time." vol. PhD: Arizona State University, 1996.

[9] F. P. Brooks, The Mythical Man-month: Essays on Software Engineering, Anniversary ed.: Addison-Wesley Longman, 1995.

[10] D. Galin, Software Quality Assurance: from Theory to Implementation: Pearson, 2004.

SPACE 2007 - 29 -

nfnad45fg

- 30 - SPACE 2007

Issues of Implementing the Analogy-based Effort Estimation in a Large IT Company

Jingyue Li and Reidar Conradi

Norwegian University of Science and Technology, Trondheim, 7491, Norway

{jingyue, conradi} at idi.ntnu.no

SPACE 2007 - 31 -

Issues of Implementing the Analogy-based Effort Estimation in a Large IT Company

Jingyue Li Reidar Conradi

Norwegian University of Science and Technology, Trondheim, 7491, Norway {jingyue, conradi} at idi.ntnu.no

Abstract

Analogy-based estimation seems be a very promising estimation methods. A large Norwegian IT consulting company is going to use analogy-based method for further effort estimation. Thus, a historical project database must be built. Several issues have come up when discussing the feasibility of building such a project database. One challenging issue is which projects should be involved in the database and how to classify these projects? Another issue is which project features should be measured and how to use these features for further analogy-based effort estimation. The position paper investigates these issues and proposes solutions for discussions. 1. Introduction

There are several formal methods that have been proposed to support software effort estimation, and most methods rely on the same assumption: Similar projects are likely to have similar cost characteristics. Among these methods, two of the most commonly applied effort estimation methods are analogy-based [31] and regression-based [17]. Analogy-based effort estimation is similar to expert judgment in that it relies on a comparison and adjustment between previous completed projects and the new project. The similarity between each historical project and the target project is usually expressed as a distance measurement [31].

Although it is still inconclusive whether analogy-based estimation outperforms regression techniques with respect to the estimation accuracy [22], the analogy-based estimation outperforms other methods by its transparency (as opposed to, for example, black-box approaches) [5] and is likely to be accepted by practitioners [24].

One large Norwegian software consulting company is going to implement the analogy-based effort estimation internally. However, several issues have been observed, and need to be addressed by researchers and practitioners. 2. Company Background

The company is a large Nordic IT company. It has around 4000 employees. The application domains of their

software and IT solutions include bank and finance, telecom, retail, and industry. The company is under way to improve its effort estimation practices. Analogy-based estimation is selected as the most promising method for further elaboration.

3. Analogy-based Estimation

When estimating a new project, the analogy-based method consists of several steps. First, measure the observable features of the new project at the time of estimation. Then identify projects with similar feature values from a historical project database. Finally, determine the new estimate using the known costs of chosen historical projects. More precisely, a project p is described by a list of features <e, d1,…dl>, where d1,…dl denote those features that are observed at the time of estimation, and e denotes the project’s effort upon completion [31]. The similarity of two projects p and p’ can be defined as a weighted Euclidean distance over the features d1,…dl, where wi is the weight of feature di.

2'

1)()',( ii

l

ii ddwpp −= ∑

=

δ

A small distance indicates a high degree of similarity. When a new project is estimated, its distance to each project in the historical feature database is calculated. The effort of the most similar projects then determines the new cost estimate. 4. Issues to be addressed

In order to precisely estimate the effort of a new project using analogy-based estimation, all parameters of the Euclidean distance equations should be accurate, that is:

• The projects (i.e. p) in the historical project databases must be comparable with the new project.

• The features (i.e. d1,…dl) for comparison must be relevant.

• The feature weights (i.e. wi) must be precise.

- 32 - SPACE 2007

At the moment, the company has no historical project database that can be used for analogy-based estimation. A plan has been made to build such a database. However, most relevant empirical studies [3, 5, 7, 8, 9, 10, 11, 12, 14, 15, 16, 23, 25, 26, 27, 28, 30, 32] focus on algorithm improvement of analogy-based method, and use projects in existing project databases to evaluate the accuracy of analogy-based estimation. Few studies have addressed how to build a historical project dataset for future analogy-based estimation.

4. 1. The issues of selecting relevant projects

During the feasibility analysis of establishing a

historical project database for analogy-based estimation, the first question is: Which projects should be put into the historical database?

The company has many divisions building different software products, such as finance and bank, public sector, retail, industry, and telecom. Does it make sense to put all projects into the project databases and use all of them as a basis for analogy comparison with a new project? Several studies have compared the accuracy of using analogy-based methods and regression-based method to estimate project effort using data from heterogeneous companies [7, 14, 15, 25, 28]. Three studies [14, 25, 28] claimed that analogy-based estimation is better than regression-based estimation for predictions across large heterogeneous data, while two other studies [7, 15] refute this conclusion.

Shepperd and Schofield have observed that dividing the dataset on the basis of different development environments (e.g. programming language) leads to enhanced accuracy for both analogy-based estimation and regression-based estimation [31]. Thus, it may be more reasonable to classify projects into different categories based on certain characteristics of the projects, such as application domains, application platforms, and development languages. However, few empirical studies have investigated which characteristics should be used to divide the dataset, except the development environment examined in [31]. In addition, the company has over 40 years of experience on software development. There are many projects finished long time ago. Due to the fast pace of the IT technology, these old projects may not be comparable with the current projects. Thus, another issue is: how should we deal with the old projects? Is the age of a project also a character to divide projects? To address this issue, our planned solution includes two steps. • The first step is to divide the projects based on the

system application domain. The rationale is that we assume developers may have similar productivity when they work for projects within the same domain, although this assumption needs to be empirically tested [29]. Using the mean value of the productivity

has been proven to improve the accuracy of analogy-based estimation [16].

• The second step is to divide the projects based on programming language, because it may lead to enhanced accuracy for estimation [31]. This division will also solve the age issue of the projects, because the programming languages used in very old projects are usually different from the newest projects.

4. 2. The issues of selecting relevant features

To estimate the effort of a new project, feature values

of the new project needs to be compared with feature values of projects in the historical database. Different datasets with different feature sets have been used for analogy-based estimation studies, as shown in Table 1. It shows that some studies used datasets including only features related to the size of a system [3, 5, 10, 11, 28, 30], and other studies involve datasets with more features [7, 8, 9, 12, 14, 15, 20, 23, 25, 26, 27, 32, 33], such as project-related and personnel-related information. Surprisingly, the feature “development language”, which is used in [31] to divide the projects into different datasets, has been used in [5, 7, 8, 9, 32, 33] to compare projects within one dataset. Thus, the question is: Should we use this feature to divide projects into different datasets or use it to compare projects within a merged dataset? A further question is: What features should be measured and stored in the historical project database? To address this issue, a three-step investigation will be performed. • First, a large feature set will be selected from the

factors listed in Table 1 of the study [6]. Values of these features of historical projects will be collected and stored in the project database.

• Second, three feature sets of the above will be independently selected based on the following three methods respectively: - A structured interview will be performed to

collect the senior project manager’s opinions on the relevant features.

- The methods proposed in [5] can compare the features and give weights to each feature. The features with low weight can be regarded as irrelevant and be excluded.

- The method proposed in [19] also gives weight to project features. However, the method in [19] is different from the one used in [5].

• Third, the accuracy of analogy-based estimation using these three feature sets will be compared. The feature set with best estimation accuracy will be selected for further use. Another strategy is to build a dynamic features set by combining the most relevant features selected from the above three mechanisms.

SPACE 2007 - 33 -

Table 1. Summary of features investigated in analogy-based effort estimation studies

Studies

Dataset

Num. of features

Detailed features

[3] 24 IBM DP projects [2] 4 Number of external inputs, number of external outputs, number of external and internal files, and number of user inquiries.

[3] 21projects from Canada 10 Number of data element type for internal logical files, number of file type referenced for inputs, number of data element type for inquiries etc.

[5, 8] 108 projects from European Space Agency

11 Programming language, , maximum size of implement team, lines of code, required software reliability, execution time constraint, main storage constraint etc.

[5, 30] 24 IBM DP projects [2] 6 Lines of code, number of function points, file handles, masks for input, and output inquiries.

[5, 30] 77 Canadian software house-commercial projects

8 Adjusted function points, raw function points, number of transactions, number of entities, technology of adjust factor, experience of project management, experience of equipment, and development environment.

[5, 30] 15 large business applications [18]

2 Quantitative (e.g. SLOC, database size) and qualitative (e.g. complex rating).

[7, 33] 206 projects from 26 Finnish companies

19 Function points, organization type, application type, target platform, and 15 productivity factors.

[9] 81 projects 10 Team experience, number of entities, development environment etc. [10] 23 IBM projects [21] 5 Input count, output count, query count, file count, and adjustment factor. [10] 21 projects of a

financial organization [1]

6 Input count, output count, inquiry count, internal logical file count, external interface files, and adjustment factor.

[11] 229 projects 3 Unadjusted function points, adjusted function points, productivity. [12] 4 projects 10 Number of data items passed to and from system, signal passed etc. [14] 324 projects 6 Function points, maximum team size, development platform etc. [15] 451 projects from

multi-companies and 19 projects from one company

3 System size (unadjusted function points), max team size, and project delivery rate.

[20] 77 modules of a medical records database system

26 The number of entities, and the number of transactions, and the number of different types etc.

[23] 77 projects 8 Team experience in years, project manager’s experience in years, number of entities, unadjusted function points etc.

[25] 67 web projects 43 Service company provide, application domain, number of web pages, number of new web pages, number of text pages scanned, number of new images etc.

[27] An experiment including 48 projects

10 Users, sites, companies, interfaces, etc.

[26] 37 web projects 8 Number of html files, number of media files, number of JavaScript, etc. [28] 149 maintenance tasks 5 External input, external output, external inquiry, internal logic file,

external logical file. [32] 19 Australian projects

[13] 7 Unadjusted function points, maximum team size, distributed system or not,

programming language, design experience etc.

5. Discussions and conclusion To use analogy-based estimation efficiently, a

historical project database with relevant and complete project information is essential. However, most

previous empirical studies on analogy-based estimation use existing project databases and assume that data in these databases are relevant and accurate. This paper has presented two major issues of building a high quality historical project database from a practitioners’

- 34 - SPACE 2007

viewpoint. The first issue is how to determine if a project is relevant and should be included into the project database. The second issue is how to decide which features of a project are relevant and how they should be measured. We have proposed possible solutions to deal with the two issues. Several further empirical studies will be performed to valid the accuracy and reliability of our proposals. 6. References [1] Abran, A., Robillard, P. N., “Function Points Analysis: An Empirical Study of Its Measurement Processes,” IEEE Transactions on Software Engineering, 22(12): 895-910, Dec. 1996. [2] Albrecht, A. J. and Gaffney, J.R., “Software Function, Source Lines of Code, and Development Effort Prediction: a Software Science Validation,” IEEE Transactions on Software Engineering, 9(6): 639-648, Nov. 1983. [3] Angelis, L. and Stamelos, I., "A Simulation Tool for Efficient Analogy Based Cost Estimation," Journal of Empirical Software Engineering, 5(1):35-68, March 2000. [4] Atkinson, K. and Shepperd, M. J. “The Use of Function Points to Find Cost Analogies,” Proc. European Software Cost Modelling Conference, Ivrea, Italy, 1994. [5] Auer, M., et al., “Optimal Project Feature Weights in Analogy-Based Cost Estimation: Improvement and Limitations,” IEEE Transactions on Software Engineering, 32(2): 83-92, Feb. 2006. [6] Boehm, B., et al., “Software Development Cost Estimation Approaches - A Survey,” Annals of Software Engineering, 10(1-4): 177-205, 2000. [7] Briand, L. C., et al., "An Assessment and Comparison of Common Software Cost Estimation Modelling Techniques," Proc. 21st Int’l Conf. Software Engineering, LA, California, pp. 313-322, May 1999. [8] Briand, L. C., et al, "A Replicated Assessment and Comparison of Common Software Cost Modelling Techniques," Proc. 22nd Int’l Conf. Software Engineering, Limerick, Ireland, pp. 377-386, June 2000. [9] Burgess, C. J. and Lefley, M., "Can Genetic Programming Improve Software Effort Estimation? A Comparative Evaluation," Journal of Information and Software Technology, 43(14): 863-873, Dec. 2001.

[10] Chiu, N. H., and Huang, S. J., “The Adjusted Analogy-based Software Effort Estimation Based on Similarity Distances,” Journal of Systems and Software, 80(4): 628-640, April 2007. [11] Finnie, G. R., et al., "A Comparison of Software Effort Estimation Techniques: Using Function Points with Neural Networks, Case-based Reasoning and Regression Models," Journal of Systems and Software, 39(3): 281-289, Dec. 1997. [12] Hughes, R. T., et al., "Evaluating Software Development Effort Model-building Techniques for Application in a Real-time Telecommunications Environment," IEE-proceedings Software, 145(1):29-33, Feb. 1998. [13] Jeffery, D. R. and Stathis, J., “Function Point Sizing: Structure, Validity, and Applicability,” Journal of Empirical Software Engineering, 1(1): 11-30, Jan. 1996. [14] Jeffery, D. R., Ruhe, M., and Wieczorek, I., "Using Public Domain Metrics to Estimate Software Development Effort," Proc. 7th Int’l Symp. Software Metrics Symposium, London UK, April 2001, pp. 16-27. [15] Jeffery, R., et al., "A Comparative Study of Two Software Development Cost Modelling Techniques Using Multi-organizational and Company-specific Data," Journal of Information and Software Technology, 42(14): 1009-1016, Nov. 2000. [16] Jørgensen, M., et al., “Effort Estimation: Software Effort Estimation by Analogy and Regression toward the Mean,” Journal of Systems and Software, 68(3):253-262, Dec. 2003. [17] Jørgensen, M., “Regression Models of Software Development Effort Estimation Accuracy and Bias,” Journal of Empirical Software Engineering, 9(4): 297-314, Dec. 2004. [18] Kemerer, C. F., “An Empirical Validation of Software Cost Estimation Models,” Communication of the ACM, 30(5): 416-429, May 1987. [19] Keung, J. W., and Kitchenham, B., “Optimising Project Feature Weights for Analogy-based Software Cost Estimation using the Mantel Correlation,” to appear in Proc. 14th Asia-Pacific Software Engineering Conference, Nagoya, Japan, Dec. 2007.

SPACE 2007 - 35 -

[20] MacDonell, S. G. and Shepperd, M. J., “Combining Techniques to Optimize Effort Predictions in Software Project Management," Journal of Systems and Software, 66(2): 91-98, May 2003. [21] Matson, J. E., et al., “Software Development Cost Estimation Using Function Points,” IEEE Transactions on Software Engineering, 20(4): 275-287, April 1994. [22] Mair, C. and Shepperd, M., “The Consistency of Empirical Comparisons of Regression and Analogy-based Software Project Cost Prediction,” Proc. 2005 Int’l Symp. Journal of Empirical Software Engineering, Noosa Heads, Australia, Nov. 2005, pp. 509-518. [23] Mair, C. et al. "An Investigation of Machine Learning Based Prediction Systems," Journal of Systems and Software, 53(1):23-29, July 2000. [24] Mendes, E., et al., “Do Adaptation Rules Improve Web Cost Estimation?” Proc. 14th ACM Conf. on Hypertext and Hypermedia, Nottingham, UK, Aug. 2003, pp. 173-183. [25] Mendes, E. and Kitchenham, B., "Further Comparison of Cross-Company and Within-Company Effort Estimation Models for Web Applications," Proc. 10th Int’l Symp. Software Metrics, Chicago, Illinois, Sept. 2004, pp. 348-357. [26] Mendes, E. and Mosley, N., "Further Investigation into the Use of CBR and Stepwise Regression to Predict Development Effort for Web Hypermedia Applications," Proc. 2002 Int’l Symp. Empirical Software Engineering, Nara, Japan, Oct. 2002, pp. 79-90.

[27] Myrtveit, I. and Stensrud, E., "A Controlled Experiment to Assess the Benefits of Estimating with Analogy and Regression Models," IEEE Transactions on Software Engineering, 25(4):510-525, July 1999. [28] Niessink, F. and Vliet, H. V., "Predicting Maintenance Effort with Function Points," Proc. Int’l Conf. Software Maintenance, Bari, ITALY, Oct. 1997, pp. 32-39. [29] Premraj P. et al. “An Empirical Analysis of Software Productivity Over Time,” Proc. 11th Int’l Symp. Software Metrics, Como, Italy, Sept. 2005, pp. 37. [30] Shepperd, M. and Schofield, C., “Effort Estimation Using Analogy,” Proc. of 18th Int’l Conf. on Software Engineering, Berlin, Germany, May 1996, pp. 170-178. [31] Shepperd, M., and Schofield, C., “Estimating Software Project Effort Using Analogies,” IEEE Transactions on Software Engineering, 23(12): 736-743, Nov. 1997. [32] Walkerden, F. and Jeffery R., “An Empirical Study of Analogy-Based Software Effort Estimation,” Journal of Empirical Software Engineering, 4(2):135-158, June 1999. [33] Wieczorek, I. and Ruhe, M., "How Valuable is Company-Specific Data Compared to Multi-company Data for Software Cost Estimation?" Proc. 6th Int’l Symp. Software Metrics, June 2002, pp. 237-246.

- 36 - SPACE 2007

Profitabil ity Estimation of Software Projects: A Combined Framework

Stefan Wagner, Songmin Xie,Matthias Rubel-Otterbach and

Burkhard Sell

Institut fur InformatikTechnische Universitat MunchenMunchen, Germany

Kabel Deutschland Breitbandservices GmbH85774 Unterfoohring, Germany

SPACE 2007 - 37 -

Profitability Estimation of Software Projects: A Combined Framework

Stefan Wagner, Songmin XieInstitut fur Informatik

Technische Universitat MunchenBoltzmannstr. 3

85748 Garching b. Munchen, Germany{wagnerst,xies}@in.tum.de

Matthias Rubel-Otterbach, Burkhard SellKabel Deutschland Breitbandservices GmbH

Betastr. 6–885774 Unterfohring, Germany

{Matthias.Ruebel,Burkhard.Sell}@kabeldeutschland.de

Abstract

Decisions on carrying out software projects are a recur-ring problem for managers. These decisions should ide-ally be based on solid estimates of the profitability of theprojects. However, no single solution has been establishedfor this task. This paper combines the German WiBe frame-work for costs and benefits of IT projects with certain costestimation approaches in order to ensure reliable profitabil-ity estimates. The applicability of the framework is shownin an industrial case study,

1. Introduction

A central and recurring question for software managersis: Should this project be carried out? This question is usu-ally followed by: Is it profitable? Hence, methods for esti-mating the profitability of software projects should be in thetoolbox of every manager in the software business. How-ever, there is no established set of methods that are recom-mended to be used. Even for the cost estimation part alone.There is a large variety of possible methods to be used. Cur-rently, expert estimation is the most commonly used tech-nique [5]. However, it is not clear whether it is the mostaccurate or effective. Jørgensen showed in his review [5]that the 15 available studies on different methods are notconclusive. There are 5 studies that show that expert esti-mation is more accurate, 5 found no difference, and 5 foundmodel-based estimation to be more accurate. For profitabil-ity, it is even difficult to find a fully-fledged method.

Problem. There is a lack in established methods for prof-itability estimation although this is a common, day-to-dayproblem in software management. Typically, managers re-sort to expert estimations. However, the empirical researchhas not been able to conclusively show that this method is

most accurate.

Contribution. We propose a method for profitability es-timation for software projects called SW-WiBe that is basedon (1) a proven framework for IT profitability estimationand (2) the results of empirical research on software costestimation. In essence, we employ the WiBe frameworkthat has been in use for 15 years and that provides a meansfor estimating non-quantifiable benefits. The cost side is es-timated by (at least) 3 different methods containing expertas well as model-based methods. To improve the accuracy,these methods are combined by the Wideband Delphi pro-cess. The applicability of SW-WiBe is shown in an industrialcase study.

2. Profitability analysis

Profitability analysis is concerned with the relation ofcosts and benefits. Hence, it shows whether an endeavour isprofitable.

2.1. Costs and Profitability

A large part of research in software economics deals withthe estimation (or prediction) of the costs of software de-velopment and maintenance, e.g. [2, 4]. Although even inthat area no conclusive results have been reached what ap-proaches (expert or model-based) are better in what cases,this is only half the way. It is equally important to analysethe benefits, quantifiable and non-quantifiable, in order todecide on its profitability. This part is largely underdevel-oped in software economics research [3].

2.2. WiBe

WiBe [8] stands for Wirtschaftlichkeitsbetrachtung andis a method for estimating and calculating the profitability

- 38 - SPACE 2007

of IT projects. It was developed for the German FederalMinistry of the Interior and has been improved several timesover the last 15 years. It has been used in various publicprojects involving information technology. Hence, it is anestablished and proven method for profitability analysis ofsuch projects with an emphasis on in-house development.

WiBe is suitable for a comprehensive analysis of soft-ware projects. It is especially interesting that it consists of aset of building blocks for the analysis and mainly proposesa framework. It does not prescribe specific methods forestimating costs and benefits. However, it considers non-quantifiable benefits explicitly and describes a utility anal-ysis for them. In this way, the non-quantifiable benefits canbe appropriately dealt with. The main modules of WiBe arethe following:

• Monetarily quantifiable costs and benefits (WiBe KN)

– Monetarily quantifiable costs

– Monetarily quantifiable benefits

• Non-quantifiable benefits

– Urgency (WiBe D)

– Qualitative and strategic importance (WiBe Q)

– External effects (WiBe E)

The monetarily quantifiable costs and benefits must beprovided by some other estimation method. Then the netpresent value is calculated in order to account for their tem-poral distribution. The non-quantifiable benefits are han-dled using utility analysis, a standard approach for quali-tative factors. In essence, experts assign points to variousqualitative issues and these points form the basis for the de-cision on the profitability.

3. Combined Framework: SW-WiBe

The WiBe method provides a solid ground for analysingthe profitability of IT projects. However, it mainly providesa framework for this analysis. Hence, we propose a concreteinstantiation for software projects: SW-WiBe. We aim tofulfil two goals with our method:

1. Providing a complete method for the analysis of theprofitability of software projects

2. Ensuring reliable estimates

To reach the first goal, we choose concrete cost estimationmethods for software to be used inside WiBe. The sec-ond goal is supported by using several diverse methods andcombining them based on the Delphi method [2, 10]. Anoverview of SW-WiBe is given in Fig. 1. This strategy is

supported by Jørgensen [5] who showed that it is most ben-eficial to combine estimates from different experts and esti-mation strategies as well as to ask the estimators to justifyand criticise their estimates.

Utility analysis

SW−WiBe

Otherexpert COCOMOActivities

Quantifiable costs

Expert

Quantifiable Non−quantifiable

Delphi method

Benefits

Figure 1. Overview of the SW-WiBe method

3.1. Quantifiable costs

As shown in Fig. 1, we propose to use (at least) threedifferent methods for estimating the quantifiable costs ofthe project. In order to achieve reliable estimates, thesemethods should be diverse. Therefore, the activity-basedmethod (an expert method) and COCOMO II (a model-based method) are an integral part of SW-WiBe. The thirdmethod can be flexibly fitted to the available competence.To improve diversity this usually is a different kind of ex-pert estimation. This way, we include at least three differentviews on the estimation problem:

1. An activity-based expert estimation that brings in theexperience of an expert. Furthermore, the structuringwith activities allows an easy check of the estimatesduring the project.

2. A different expert estimation that introduces a differ-ent view and uses different experiences. It may use adifferent work breakdown structure (WBS) [10].

3. The model-based method COCOMO II that is basedon an explicit model and empirical data.

For economically correct handling of the quantifiable costs,the distribution of the costs over time must be considered.For this, the standard method of net present value is avail-able. It allows to calculate the present value of the wholecosts by discounting them w.r.t. the point in time when theyoccur.

3.2. Quantifiable benefits

The estimation of the benefits of a software project ismore difficult. WiBe structures this in estimating the quan-tifiable benefits and the non-quantifiable benefits. Hownon-quantifiable benefits are estimated is explained later.

SPACE 2007 - 39 -

However, for the quantifiable benefits, there are no com-mon methods. Furthermore, Boehm and Sullivan [3] pointout that “effective methods for modelling software benefitstend to be highly domain-specific.” Hence, we cannot use a“one-size-fits-all” in this case. In many instances, an expertestimation based on available accounting data will be a pos-sibility. In any case, the net present value of the estimatedquantifiable benefits has to be used.

3.3. Non-quantifiable benefits

The most difficult part to handle are the non-quantifiablebenefits. The standard WiBe suggests to use utility analysis.In essence, this is a ranking of the various influential crite-ria on a qualitative basis. This ranking uses points that areassociated with the different qualitative ranks. An examplefor the criteria stability of the legacy system: downtime isshown in Tab. 1. It gives 6 possible qualitative ratings forthe downtime of the system to be replaced and the corre-sponding points.

The influential criteria have been compiled based on theexperience with IT projects from the original WiBe authors.However, this list can be tailored to the specifics of theproject. The general classification is in (1) urgency, (2)qualitative and strategic importance, and (3) external effectsas described in Sec. 2.2. Examples are abidance by the law,reuse of existing technology, or acceleration of work pro-cesses. A complete list can be found in [8]. Each criteriahas a weight that reflects its importance and that is multi-plied with the point value. Then all the weighted points areadded for each of the 3 above mentioned classes of criteria.These sums are later used for decision-making.

3.4. Roles and workflow

All these results for the 3 WiBe parts are compared andadjusted in a Delphi process. The detailed workflow is de-picted in the activity diagram in Fig. 2. There are 4 rolesnecessary for the application of SW-WiBe. First, the Projectmanager is supposed to have all the basic information aboutthe project. For example, this information should containspecifications of the functional and quality requirements ofthe system to be built. Second, the Expert A uses an activity-based method to estimate the costs. This means that theproject is broken down into activities that need to be per-formed to develop the system based on its specification.These activities are estimated separately and then combinedto an overall estimation.

Third, Expert B uses a different estimation method thanExpert A. This can be another kind of activity-based estima-tion or an estimation based on another WBS such as compo-nents of the software system. Fourth, the COCOMO expertuses the project information to estimate the size of the sys-

tem and all the necessary parameters to perform a cost esti-mation. All these roles can be filled with more than 1 personand 1 person can work in more than 1 role. However, wesuggest that at least 3 different experts work in the estima-tion process to ensure enough diversity. Furthermore, it isbeneficial to have technical as well as non-technical profes-sionals in these different roles. As Moløkken-Østvold andJørgensen show in [7], professionals in technical roles tendto be too optimistic in their estimates. Hence, a combinationcan mediate this.

Estimate quantifiable

benefits


benefits


benefits

benefits

Estimate non−quant.

benefits


benefits


Calculate productivity

Identify activitiesPrepare project info

Break down system Estimate size

Classify project

Estimate costsEstimate costs Estimate costs

Revise estimates Revise estimatesRevise estimates

Collect estimates

[Differences

estimates < d]between

[else]

COCOMO expertProject manager Expert A Expert B

Figure 2. Activity diagram of the workflow

3.5. Delphi process

All 3 experts also give estimates for the quantifiable andnon-quantifiable benefits of the software system. Although,they cannot use their diverse methods for cost estimation,the differing estimates are still useful for the Delphi process.This process is now used to adjust the estimates for the threeparts. We use the wideband delphi as described in [2,10] toadjust the estimates and minimise the variations. For this,the experts explain their estimates in a group session andare allowed to adjust them. These new estimates are againcollected and discussed in another meeting. This is repeateduntil the variation lies below a threshold value d. This isusually a value of 10–15%.

To use such a group process is vital for reliable estimates.As Moløkken-Østvold and Jørgensen investigated in [6], es-

- 40 - SPACE 2007

Table 1. Example criteria for stability of the legacy system: downtimePoints 0 2 4 6 8 10Rating not at risk hardly affected tolerable troublesome highly troublesome intolerable

timates based on a group process have higher accuracy asthe single estimates or even as the direct average of severalindividual experts. The latter is also described by Shepperdin [9]. Moreover, He proposes the combination of severalpredictors as one of the main research challenges.

3.6. Decision

For the profitability estimate, we use the 4 modules ofWiBe as explained in Sec. 2.2: quantifiable costs and bene-fits (KN), urgency (D), qualitative-strategic importance (Q)and external effects (E). The KN module is used to calculatethe basic profitability. Taking the modules D, Q and E intoaccount allows to estimate the extended profitability.

Basic profitability. The basic profitability is simply thequantifiable profitability. Hence, it constitutes the differ-ence of the quantifiable benefits and the quantifiable costs.Depending on the context, this should usually be divided inthe development costs and the maintenance costs. The costsas well as the benefits need to be distributed over the esti-mation period (usually 5–10 years). Based on the temporaldistribution, the net present value (NPV) is calculated usingstandard methods. The basic profitability (BP) is then:

BP = NPV(quantifiable benefits)−NPV(quantifiable costs)(1)

It can already be used for decision making. In case the BPis positive, i.e. the quantifiable benefits are greater than thecosts, the project should definitely be carried out. If theresult was a negative BP, the non-quantifiable criteria shouldbe considered in the extended profitability.

Extended profitability. The extended profitability (EP)introduces the non-quantifiable aspects of urgency (D),qualitative-strategic importance (Q) and external effects (E)into the decision making. As described above, the weightedpoints for each module have to be accumulated. Thesesums represent the non-quantifiable necessity of the soft-ware project. There is a set of rules that guides in decision-making based on these points. They are summarised inTab. 2.

The project must be carried out in case the current systemdoes no abide by the law (any more). There is obviously noway to avoid this. It might, however, be possible to improvethe BP by decreasing the costs. The second rule highly rec-ommends to execute the project in spite of a negative BP in

Table 2. Decision rules for the extended prof-itability

Guard ResultAbidance by the law = notabided (10 points)

Must be carried out

Significance inside the IT con-cept = key position (10 points)

Should be carried out

D > 50 ∨ Q > 50 ∨ E > 50 Can be carried out

case it is central to the general IT concept of the company.Other future developments in the software landscape of thecompany may depend on this project and hence it can benecessary to build it. Finally, the standard rule is that theaccumulated points of at least one module need to be higherthan 50 in order to carry out the project. This implies thatthe project is either significantly urgent, strategically impor-tant or has significant external effects that justify the execu-tion of the project. In the two cases in which the projectshould or can be carried out, obviously the amount of quan-tifiable costs needs to be strongly considered in relation tothe benefits in order to come to a decision.

4. Case study

The applicability of the proposed method to a real in-dustrial environment is shown in a case study with a largeGerman cable network operator. The profitability of a webportal project is analysed.

4.1. Environment

Kabel Deutschland Breitbandservices GmbH is the lead-ing cable network operator in Germany. The company pro-vides TV, radio, Internet and telephony via its cable net-work. It employs about 2,500 people in seven locations.The department Web Applications is a service provider forthe other departments by providing infrastructure for pro-cess support. The department develops the Internet and ap-plications for the agents, marketing staff and end customerswhich are connected to the business logic, core applicationsand logistics of the company.

Over 3 years, 7 web portals have been developed thatserve for the communication with these different stakehold-ers. Currently, it is foreseeable that new requirements will

SPACE 2007 - 41 -

come up for these portals. Hence, the software consult-ing house softlab was entrusted with providing proposalsfor a restructuring of the portals. This proposal containsa unification of the different portals based on a uniformtechnology. The management at Kabel Deutschland Breit-bandservices GmbH is now interested in the profitability ofthe proposal. The profitability analysis will be the basis forthe decision on carrying out the project.

4.2. Profitability estimation

Roles. The roles of the SW-WiBe are filled with personnelas follows: The role of the Project manager is performed bythe department manager at Kabel Deutschland Breitband-services GmbH (one of the authors) using information fromthe study of softlab. He together with the second authorfrom the company occupy also the role of Expert A. ExpertB are the technology experts at softlab. The authors of TUMunchen together with experts of Kabel Deutschland Bre-itbandservices GmbH fill the role of the COCOMO expert.

Costs. All experts made their estimates based on a spec-ification of the necessary solution for the unified web por-tal. Expert A used the percentage method [1] in which oneproject phase (implementation) was estimated and extrap-olated to the other phases based on experience data. Ex-pert B used an activity-based method, i.e. breaking downthe project into activities and estimating each activity indi-vidually. The COCOMO expert made a size estimate basedon the existing portals and determined the necessary CO-COMO parameters in expert interviews at Kabel Deutsch-land Breitbandservices GmbH.

The three individual estimates were then compared in theDelphi process. It turned out that Expert A and Expert Bmade rather close estimates whereas the COCOMO esti-mate was a magnitude higher. Hence, this result was re-examined and errors in the size estimate were uncovered.In the second estimation round, the estimates were insidethe range of d < 15%. The cost estimates are depicted inFig. 3. The final agreement was to use the average of 1,678person-days.

Benefits. It was decided by the Project manager that thereare no benefits of the project that are currently quantifiable.Hence, the experts concentrated on the non-quantifiablebenefits. Two experts, Expert A and Expert B performed autility analysis on the basis of the criteria provided by WiBe(cf. Sec. 2.2). The results divided into the three modules ofnon-quantifiable benefits are provided in Tab. 3. The resultswere not subject to a further Delphi process because (1) theyhave been discussed inside Kabel Deutschland Breitband-services GmbH and softlab separately already and (2) the

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Expert A Expert B COCOMO expert AverageRole

Effo

rt in

per

son−

days

Figure 3. The cost estimates of the differentroles

Project manager wanted to see the differences between theinternal and external opinions.

Table 3. Utility analysis of the non-quantifiable benefits

Module Expert A Expert BUrgency (D) 37 19Qualitative/strategic (Q) 43 50External effects (E) 44 52

Decision. The collected results for the costs and benefitsneed now to be combined in order to reach a decision aboutthe project. The quantitative module of SW-WiBe can eas-ily be calculated because there are no quantifiable benefits.Note that we expect all the costs to occur in the first yearand hence no discounting is applied. Hence, the basic prof-itability for the project is negative:

BP = −1678 person-days ≈ −1, 360, 000 Euro (2)

Following our decision rules, the project should not becarried out unless the utility analysis can provide counter-arguments. The first two rules from Tab. 2 do not apply forthis project. Neither are any laws not abided by the legacysystem nor has the project (currently) a key position in theIT concept of the company. These ratings were consistentfor both experts.

Then the third rule describes whether at least one of themodules urgency, qualitative and strategic importance, orexternal effects is strong enough to justify the project. Ascan be seen in Tab. 3, none of the module ratings of Ex-pert A are higher than 50 and only the external effects arerated as such from Expert B. The Project manager sees this

- 42 - SPACE 2007

as not enough justification to carry out the project. Nev-ertheless, the ratings are close to 50, especially for WiBEQ and WiBe E. This means that the necessity of the projectmight change in the future. Moreover, we have to note that itis planed to investigate possible quantitative benefits whichmight change this decision as well.

4.3. Discussion

The case study with Kabel Deutschland Breitbandser-vices GmbH demonstrates a real world application of theSW-WiBE framework. It shows that SW-WiBe is applica-ble to such situations and that it can provide guidance forthe whole profitability analysis process. The combinationof several cost estimation methods led to an estimate that isfar more trustworthy than single estimates alone. This canbe seen in the fact that we actually discovered an error inthe application of COCOMO by comparing it to the otherestimates in the Delphi process.

However, the cost estimate alone would have been dif-ficult to use as a basis for the decision about carrying outthe project. If the costs lay beyond the possible budget, theproject could not be done anyway. If this is not the case,we will need further guidance. This guidance is given bythe utility analysis of SW-WiBe. It ranks important non-quantifiable or difficult to quantify factors and combinesthem in three modules that affect the decision. This wasperceived as very useful in the case study.

5. Conclusions

Estimating the profitability of a software project to bedone is a common problem in practical software engineer-ing. A method that helps in that estimation process wouldbe a useful tool in the toolbox of software managers. How-ever, there are only few such approaches.

We propose SW-WiBe as a method for such profitabilityestimations for software projects. It is based on the WiBeframework for profitability of IT projects. Based on cur-rent research results, the framework is filled with diverseexpert and model-based cost estimation methods that arecombined by a Delphi process. This improves the reliabil-ity and accuracy of the estimates. The difficult part of thenon-quantifiable benefits is handled by utility analysis thatprovides a set of important criteria and possible rankings.The quantified estimates together with the utility analysisresult in the decision about the project’s profitability.

SW-WiBe was applied in a real industrial environment.A project for the restructuring of the web portals of a largeGerman cable network operator was analysed. The methodproved to be applicable in that environment. The combined

cost estimates as well as the utility analysis was perceivedas very useful for reaching the profitability decision.

We plan to apply the SW-WiBe in further case studies.Obviously, we need to go along with the planned projects(in case they are carried out) and to compare the estimateswith the actual costs and benefits. This would allow us totest the hypothesis that the estimates are more accurate andreliable more formally.

Acknowledgements

We are grateful to the engineers from softlab for theirhelp on the specification and their estimates.

References[1] H. Balzert. Lehrbuch der Software-Technik. Spektrum

Akademischer Verlag, 1996.

[2] B. W. Boehm. Software Engineering Economics. PrenticeHall, 1981.

[3] B. W. Boehm and K. J. Sullivan. Software economics: Aroadmap. In Proc. 22nd International Conference on Soft-ware Engineering (ICSE ’00). ACM Press, 2000.

[4] L. C. Briand, K. El Emam, and B. F. Cobra: A hybridmethod for software cost estimation, benchmarking, and riskassessment. In Proc. International Conference on SoftwareEngineering (ICSE ’98), pages 390–399. IEEE ComputerSociety, 1998.

[5] M. Jørgensen. A review of studies on expert estimation ofsoftware development effort. The Journal of Systems andSoftware, 70:37–60, 2004.

[6] K. Moløkken-Østvold and M. Jørgensen. Group processes insoftware effort estimation. Empirical Software Engineering,9:314–334, 2004.

[7] K. Moløkken-Østvold and M. Jørgensen. Expert estimationof web-development projects: Are software professionals intechnical roles more optimistic than those in non-technicalroles? Empirical Software Engineering, 10:7–29, 2005.

[8] P. Rothig, K. Bergmann, and C. Muller. WiBe 4.0.Empfehlung zur Durchfuhrung von Wirtschaftlichkeitsbe-trachtungen in der Bundesverwaltung, insbesondere beimEinsatz der IT. Schriftenreihe der KBSt, Band 68, Bun-desministerium des Innern, 2004.

[9] M. Shepperd. Software project economics: a roadmap. In2007 Future of Software Engineering (FOSE ’07), pages304–315. IEEE Computer Society, 2007.

[10] A. Stellman and J. Greene. Applied Software Project Man-agment. O’Reilly, 2005.

SPACE 2007 - 43 -

nvnad45fg

- 44 - SPACE 2007

Util izing Functional Size Measurement Methods for Embedded Systems

Ali Nazima Ergun and Cigdem Gencel

KAREL ARGE, BilkentAnkara, [email protected]

Middle East Technical UniversityAnkara, [email protected]

SPACE 2007 - 45 -





Utilizing Functional Size Measurement Methods for Embedded Systems

Ali Nazima Ergun KAREL ARGE, Bilkent,

Ankara, Turkey [email protected]

Cigdem Gencel

Middle East Technical University, Ankara, Turkey

[email protected]

Abstract

There have been many attempts at measuring

software size based on the amount of functionality laid out on project requirements. Since projects that triggered the need for proper software management have been typically database driven enterprise soft-ware, most estimation methods were well suited to analyzing such systems and the question of how to analyze embedded systems has only begun to be ex-plored since late nineties. This study aims to evaluate and discuss the results of the implementation of two ISO certified Functional Size Measurement (FSM) methods; MkII FPA and COSMIC FFP in a case study involving three embedded projects. Based on the functional size figures and the actual development effort, the productivity delivery rates of the company are derived and discussed as well.

1. Introduction

Since the early 1980s great effort has been put forth to identify and fine tune the “software process” and its proper management. Unique tools and tech-niques have been developed for size, effort, and cost estimation to address challenges facing the manage-ment of software development projects [23][32][33].

Since software size is one of the key measures for most effort and cost estimation methods, a significant amount of these efforts have been put on software size measurement. Earlier size measurement methods are based on Source Lines of Code (SLOC) of soft-ware, however, this information is not available until after the project is completed.

Recent methods attempt at measuring project size by trying to capture the amount of functionality laid out on project requirements, which are available ear-lier on in the project lifecycle. The topic of Function Point Analysis (FPA) evolved quite a bit since the in-

troduction of the concept by Albrecht in 1979 [1]. Many variations and improvements on the original idea were suggested [32], some of which proved to be milestones in the development of Functional Size Measurement (FSM).

In 1996, International Organization for Standardi-zation (ISO) established the common principles of FSM methods and published ISO/IEC 14143 standard in order to promote the consistent interpretation of FSM principles [11][12][13][14][15][16][17].

Currently, MkII FPA [18][31][34], the Interna-tional Function Point Users Group (IFPUG) FPA [19][35][19], the Common Software Metrics Interna-tional Consortium Full Function Points (COSMIC FFP) [20][20][36] and the Netherlands Software Met-rics Association (NESMA) FSM [21][37] methods have been certified as international FSM standards by ISO.

Due to these constructive progresses, FSM has be-gun to be applied more and more worldwide. The number of benchmarking data on the projects that were measured by FSM methods has significantly in-creased in well-known and recognized benchmarks such as the one by ISBSG [28] with more than 4,100 projects.

On the other hand, one of the major uses of soft-ware size is its use in software effort estimation. There already exist a number of studies with an em-phasis on the relationship between software size and effort. However, although different estimation models are reported to be successfully used in certain envi-ronments, none of them have been deemed univer-sally applicable by the software community.

Therefore, until such a widely accepted estimation model is developed, it is usually recommended [23][33] to use the conventional approach to estimate effort, which takes into account two key components: the size of software and the rate of work, also called the ‘Productivity Delivery Rate (PDR)’. The effort is estimated based on the size of the software and the PDR. However, this brings the software organizations

- 46 - SPACE 2007

a challenging requirement to identify their PDR val-ues for specific types of projects or use industrial av-erage values taken from the benchmarking datasets. However, in order to use the second choice, the benchmark dataset shall involve a very similar pro-ject, which is usually not available.

This paper draws upon our case study on four pro-jects completed at Karel Arge, which we used to pit MkII FPA, a popular method in Turkey and widely required in defense contracts, against COSMIC FFP, to explore the suitability of these methods for the company. The average PDR values for the company are also derived based on the functional size values obtained and the actual effort utilized and discussed.

The related research on software functional size measures and methods are briefly summarized in the second section. In the third section, the description of the case projects, the application of MkII FPA and COSMIC FFP to these case projects, the results ob-tained and the comparisons of these two methods are presented. The discussions on the results of the case study are presented in the last section.

2. Background

Albrecht’s 1979-proposal [1] for estimating the functional size became a serious contender for soft-ware size measurement. During the following years, variations of the original method have been devel-oped [8]. Currently four methods are certified by the ISO, which are MkII FPA [34], IFPUG FPA [35], COSMIC FFP [36] and NESMA FSM [37] methods.

Albrecht’s original idea has become the basis for IFPUG FPA [35], one of the earliest ISO standard-ized FSM methods [19]. IFPUG FPA enjoys wide-spread popularity and large publicly available data sets for those who wish to train their own company-specific IFPUG model, or to compare their measure-ments with others. It is based on the idea of measur-ing the amount of functionality delivered to users in terms of Function Points (FP). IFPUG FPA was mainly developed to measure data-strong systems such as Management Information Systems (MIS).

MkII FPA was developed by Symons in 1988 in order to improve the original FPA method [31]. He brought some suggestions to reflect the internal com-plexity of a system. Currently, the Metrics Practices Committee (MPC) of the UK Software Metrics Asso-ciation (UKSMA) is the design authority of the method [34]. It was mainly designed to measure business information systems. It can also be applied to other application domains such as scientific and real-time software with some modifications. MkII

FPA has been approved as being conformant to ISO/IEC 14143 and become an international ISO standard in 2002 [18]. This method views the system as a set of Logical Transactions (LT) and calculates the functional size by counting the Input Data Ele-ment Types (DETs), Data Entity Types Referenced and Output Data Element Types for each LT.

NESMA FPA [37] is also based on the principles of the IFPUG FPA method. The function types used for sizing the functionality are the same as IFPUG FPA. NESMA give more concrete guidelines, hints and examples. It was certified by ISO in 2005 [21].

COSMIC FFP [36], on the other hand, is a fairly recent method, gaining ground in the international community, thanks to its ability to measure real-time systems, as opposed to earlier variants which shined in measuring data intensive MIS software. COSMIC FFP was approved as an international ISO standard in 2003 [20]. COSMIC FFP is designed to measure a functional size of software based on the count of four types of data movement, i.e. the number of Entry, Exit, Read, and Write operations.

In parallel to these developments in FSM, signifi-cant research has been going on about software effort estimation models and techniques, which are based on software size [3][4][10][22][24][25]. COCOMO II [4], the revised version of the original COCOMO, provide for measuring functional size and converting this result to SLOC. However, this technique, called ‘backfiring’ still can not account for the extra uncer-tainty introduced by adding another level of estima-tion [7][29][30].

In a number of studies such as [2][5][6][27], the related works on estimation models are assessed and compared. However, the common conclusion of these studies is that although different models are success-fully used by different groups and for particular do-mains, none of them has gained general acceptance by the software community.

Effort estimation based on the functional size fig-ures have just begun to emerge as more empirical data are collected in benchmarking datasets.

3. Case study

In this case study, we evaluate and discuss the im-plementation of two ISO certified FSM methods; MkII FPA and COSMIC FFP for measuring three embedded projects which was already completed.

Both MkII FPA and COSMIC FFP measure the Functional User Requirements (FURs) to measure the functional size. FURs are decomposed into Base Functional Components (BFCs). The BFCs of MkII

SPACE 2007 - 47 -

FPA are the Logical Transactions (LTs). The LTs are identified by decomposing each FUR into its elemen-tary components. Each LT has three constituents; in-put, process and output components. The base counts are derived by counting Input Data Element Types (DETs) for the input component, by counting the Data Entity Types Referenced for the process com-ponent, and by counting the Output DETs for the output component. The functional size of each LT is computed by multiplying the size of each component by an industry-average weight factor. The functional size of each LT is summed up to compute the func-tional size of the whole system.

In COSMIC FFP, the BFCs are Functional Proc-esses. Each of these Functional Processes comprises a set of sub-processes which perform either a data movement or a data manipulation. Currently, only “data movement types” are considered, which are fur-ther categorized into Entry, Exit, Read, and Write data movement types. The functional size of each Functional Process is determined by counting the En-tries, Exits, Reads and Writes in each Functional Process. Then, the functional sizes of all Functional processes are aggregated to compute the overall size of the system.

3.1 Description of the software organization

Karel Arge, founded in 1986, is a local and re-gional telecommunications company, which designs and manufactures PBX systems. Karel Arge was spun-off from Karel in 2005, and performs R&D of PBX systems and telecommunication products. Its R&D capability is also available for outsourcing in electronic control and military projects. A majority of the projects are real-time embedded systems and are often accompanied by data intensive GUI or cli-ent/server applications. In 2006, Karel Arge decided to define its software processes to conform to CMMI (Capability Maturity Model Integrated) level 3.

3.2 Description of the case projects

We have picked three embedded projects under-taken at Karel Arge for our case study. These pro-jects were completed by a different, single senior en-gineer each. These three engineers typically work together in the same team of four engineers, and the projects undertaken are always embedded control cir-cuitry for household appliances.

Since this team works on these projects on a con-tract basis, both project requirements and effort in-formation were well documented.

A short description of each project is as follows: Project A – Fridge and freezer combo (Real-

time): A microprocessor monitors certain switches and temperature sensors to control heaters and com-pressors within various cooling and defrost cycles. The program cycles are rigorously algorithmic. All stages of this project were completed by a single ex-perienced software designer using C utilizing 538 person-hours.

Project B – Combo boiler (Real-time): Simple, user programmable residential heater. User can pro-gram multiple on and off schedules via buttons and a simple display. All stages of this project were com-pleted by a single senior software designer using C utilizing 132 person-hours.

Project C – Shower control (Real-time): Micro-controller based system controls shower heat and wa-ter flow per user input. Monitors sensors, controls heaters and relays. System behavior is explained in the requirements over a state machine, which points to a considerable algorithmic element. All stages of this project were completed by a single senior soft-ware designer using C utilizing 40 person-hours.

All projects were completed by different engi-neers; however the engineers who have worked on Projects A, B and C work together on many other projects and most likely have similar design prac-tices.

4. Case study conduct and results

For measuring the functional sizes of the projects, we picked MkII FPA and COSMIC FFP. COSMIC FFP is the first ISO standardized method with a clear emphasis on measuring real-time systems and MkII FPA counts are widely required from defense con-tractors, which set the bar in Turkish software devel-opment practices nationwide.

To help with MkII FPA measurement we modeled each project using E-R diagrams. In order to system-atically perform a Cosmic FFP measurement, we de-fined relevant data groups and their members for each project. We used the “End-user viewpoint” for all Cosmic FFP measurements.

One of the authors performed all the measure-ments; he works for the development organization and had direct access to people who worked on the projects while the other author reviewed and verified for objectivity. She is experienced in measuring by both methods.

The source lines of code (SLOC) count provided are logical SLOC and was obtained by using an automated tool developed in-house at Karel Arge.

- 48 - SPACE 2007

The effort information provided was likewise taken from a now-obsolete in-house developed time-sheet keeping software. This software would require users to input their effort information on a weekly ba-sis, in person-hours, at a detail level of one tenth of an hour.

Project A – Fridge and freezer combo: The de-

velopment effort utilized for the project is 538 per-son-hours and the size of code is 2044 SLOC. The functional size measurement results by COSMIC FFP and MkII FPA, the PDR values and SLOC-Functional size ratio for the project are given in Table 1.

Modeling an embedded system with an E-R dia-gram turned out to be a major difficulty. A useful analogy assumed was considering the microprocessor as the user and sensors, triggers etc. as the interface. Timers and similar asynchronous interrupts frequent in embedded applications still proved to be problem-atic. Defining entities, as part of the E-R diagram, were highly subjective; there is no record based data-base, hence everything could be considered within a single entity. And it would not matter in reality, as an inter-entity complexity does not exist as it exists in databases. Hence, MkII FPA’s counting of entities referenced (which increases complexity significantly) does not really apply well to embedded systems.

Applying COSMIC on the other hand was very easy and intuitive.

Project B - Combo boiler: The development ef-fort utilized for the project is 132 person-hours and the size of code is 1151 SLOC. The functional size measurement results by COSMIC FFP and MkII FPA, the PDR values and SLOC-Functional size ratio for the project are given in Table 2.

E-R modeling for MkII FPA was very counter-intuitive and the analysis failed to capture much of the algorithm stated in the requirements.

Along with a proper data structure modeling, COSMIC analysis yielded functional size of 79 CFSU (COSMIC Functional Size Unit). COSMIC FPA was suitable to analyzing this project. However, being a small project, Project B had little movement of data, and so the algorithmic component had more prominence. Excluding the algorithmic complexity arguably resulted in an oversimplification of size measurement.

Project C - Shower control: The development ef-fort utilized for the project is 40 person-hours and the size of code is 432 SLOC. The functional size meas-urement results by COSMIC FFP and MkII FPA, the PDR values and SLOC-Functional size ratio for the project are given in Table 3.

Table 1. Fridge and freezer combo measurement results

FSM Method Functional Size

PDR (Effort/Functional Size)

SLOC/Functional Size

Measurement Effort (person-hours)

MkII FPA 75,98 7,08 26,90 2,00

COSMIC FFP 119,00 4,52 4,52 0,33

Table 2. Combo boiler measurement results





MkII FPA 54,40 2,43 2,43 0,60

COSMIC FFP 79,00 1,67 1,67 0,30

Table 3. Shower control measurement results

SPACE 2007 - 49 -





MkII FPA 65,04 0,62 0,62 0,80

COSMIC FFP 53,00 0,75 0,75 0,30

The E-R modeling required for an MkII FPA was very hard and failed at capturing the algorithm. How-ever MkII FPA was better than COSMIC analysis in capturing interface actions (counting actual number of interface parameters changed, i.e. two motors con-trolled rather than counting one motor control ac-

cess). COSMIC FFP functional size was obtained as 53 CFSU. The analysis was simple and straightfor-ward.

Figure 1 represents the relationship between the logarithmic transformations of functional size vs. ac-tual development effort in all projects.

0

1

2

3

4

5

6

7

0 2 4 6 8 10

Ln(Functional Size)

Ln(E

ffort) MkII FPA

Cosmic FFP

SLOC

Figure 1 Ln(Actual Effort) versus Ln(Functional Size)

The results show that Functional Size measured in

COSMIC FFP results yield a desirable, approxi-mately linear graph. MkII FP counts, on the other hand, show an irregular graph. In this case study, the number of data points is too small to draw conclu-sions on the strength of the relationship between functional size and effort. However, we derive the PDR values (see able 4) and evaluate the methods based on the PDR values.

Table 4. PDR values for Karel Arge

Min Med Max Mean Var Effort (person-hrs)/ Size (MkII FP) 0,62 2,43 7,08 3,37 11,13 Effort (person-hrs)/ Size (Cfsu) 0,75 1,67 4,52 2,32 3,86 Effort (person hrs)/ SLOC 0,09 0,11 0,26 0,16 0,01

The variance of the PDR values when measured

by MkII FPA is much greater that the COSMIC

measurement. However, the variance of the PDR val-ues is still large for COSMIC FFP measurements. This may be attributed to the fact that although all projects are real-time embedded, they represent dif-ferent complexities; Projects A and C are more algo-rithm intensive, while Projects B and C allows for greater user interaction, which amounts to different levels of total effort.

5. Conclusion

In this study, we evaluated MkII FPA and COSMIC FFP methods by applying to three real-time embedded projects. Applying MkII FPA typically re-quires at least a simple entity-relationship analysis. Forcing such a database oriented model on real-time projects’ requirement set is counterintuitive and the E-R models derived are highly subjective.

- 50 - SPACE 2007

COSMIC FFP, on the other hand, can be easily applied to real-time embedded projects, because fo-cusing on data movements instead of storage hierar-chy suits the embedded spirit better. However an ex-tension for algorithm intensive systems is still due and eagerly awaited.

The PDR values obtained show significant vari-ance. However, there also exists a strong correlation between measured sizes and actual efforts. More em-pirical studies are needed to further analyze the na-ture of this relationship.

For Karel Arge, we decided to collect more meas-urement data and form homogenous datasets for dif-ferent kinds of projects and use the average PDR val-ues for specific types of projects for estimation purposes.

6. References

[1] Albrecht, A.J.: Measuring Application Develop-ment Productivity. Proc. of the Joint SHARE/ GUIDE/IBM Application Development Sympo-sium (1979), 83-92

[2] Abran, A., Ndiaye, I., Bourque, P.: Contribution of Software Size in Effort Estimation. Research Lab. in Software Engineering, École de Technologie Supérieure, Canada, (2003)

[3] Boehm, B.W.: Software Engineering Economics. Prentice-Hall (1981) 487

[4] Boehm, B.W., Horowitz, E., Madachy, R., Reifer, D., Bradford K.C., Steece, B., Brown, A.W., Chu-lani, S., Abts, C.: Software Cost Estimation with COCOMO II. Prentice Hall, New Jersey (2000)

[5] Briand, L.C., El Emam K., Maxwell, K., Surmann, D., Wieczorek, I.: An Assessment and Comparison of Common Software Cost Estimation Models. In Proc. of the 21st Intern. Conference on Software Engineering, ICSE 99, Los Angeles (1999) 313-322

[6] Briand, L.C., Langley, T., Wieczorek, I.: A Repli-cated Assessment and Comparison of Software Cost Modeling Techniques. In Proc. of the 22nd Intern. Conf. on Software Engineering, ICSE 00, Limerick, Ireland (2000) 377-386

[7] De Rore, L., Snoeck, M. & Dedene, G.: COCOMO II Applied In A Banking And Insur-ance Environment: Experience Report. In Proc. of Software Measurement European Forum 2006 (2006) 247-257.

[8] Gencel, C. and Demirors, O.: Functional Size Measurement Revisited. to appear in ACM Trans-actions on Software Engineering (2007)

[9] Galorath Associates, Inc., SEER Technologies Di-vision: SEER User Manuals. Los Angeles CA (1993).

[10] Hastings, T.E., Sajeev, A.S.M.: A Vector-Based Approach to Software Size Measurement and Ef-

fort Estimation. IEEE Transactions on Software Engineering Vol. 27 No. 4 (2001) 337-350.

[11] ISO/IEC 14143-1: Information Technology - Software Measurement - Functional Size Meas-urement - Part 1: Definition of Concepts (1998)

[12] IEEE Std. 14143.1: Implementation Note for IEEE Adoption of ISO/IEC 14143-1:1998 - Information Technology- Software Measurement- Functional Size Measurement -Part 1: Definition of Concepts (2000)

[13] ISO/IEC 14143-2: Information Technology - Software Measurement - Functional Size Meas-urement - Part 2: Conformity Evaluation of Soft-ware Size Measurement Methods to ISO/IEC 14143-1:1998 (2002)

[14] ISO/IEC TR 14143-3: Information Technology - Software Measurement - Functional Size Meas-urement - Part 3: Verification of Functional Size Measurement Methods (2003)

[15] ISO/IEC TR 14143-4: Information Technology - Software Measurement - Functional Size Meas-urement - Part 4: Reference Model (2002)

[16] ISO/IEC TR 14143-5: Information Technology- Software Measurement - Functional Size Meas-urement - Part 5: Determination of Functional Domains for Use with Functional Size Measure-ment (2004)

[17] ISO/IEC FCD 14143-6: Guide for the Use of ISO/IEC 14143 and related International Stan-dards (2005)

[18] ISO/IEC 20968: Software Engineering - MkII Function Point Analysis - Counting Practices Manual (2002)

[19] ISO/IEC 20926: Software Engineering - IFPUG 4.1 Unadjusted FSM Method - Counting Practices Manual (2003)

[20] ISO/IEC.: ISO/IEC 19761: COSMIC Full Func-tion Points Measurement Manual, v. 2.2 (2003)

[21] ISO/IEC.: ISO/IEC 24570: Software Engineering - NESMA Functional Size Measurement Method v.2.1 - Definitions and counting guidelines for the application of Function Point Analysis (2005)

[22] Jeffery, R., Ruhe, M., Wieczorek, I.: A Compara-tive Study of Two Software Development Cost Modeling Techniques using Multi-organizational and Company-specific Data. Information and Software Technology Vol. 42 (2000) 1009-1016.

[23] Jones, T.C.: Estimating Software Costs, McGraw-Hill (1998)

[24] Jorgensen, M., Molokken-Ostvold, K.: Reasons for Software Effort Estimation Error: Impact of Respondent Role. Information Collection Ap-proach, and Data Analysis Method, IEEE Transac-tions on Software Engineering Vol. 30 No. 12 (2004) 993-1007

[25] Kitchenham, B., Mendes, E.: Software Productiv-ity Measurement Using Multiple Size Measures. IEEE Transactions on Software Engineering Vol. 30 No. 12 (2004) 1023-1035

[26] Martin M., PRICE Systems: PRICE Reference Manuals. Moorestown NJ (1993)

SPACE 2007 - 51 -

[27] Menzies, T., Chen, Z., Hihn, J., Lum, K.: Select-ing Best Practices for Effort Estimation. IEEE Transactions on Software Engineering Vol. 32 No. 11 (2006) 883-895

[28] International Software Benchmarking Standards Group.: ISBSG Dataset r10, http:// www.isbsg.org (2007)

[29] Neumann, R. & Santillo, L.: Experiences with the usage of COCOMOII. In Proc. of Software Meas-urement European Forum 2006 (2006) 269-280.

[30] Rollo, A.: Functional Size measurement and COCOMO – A synergistic Approach. In Proc. of Software Measurement European Forum 2006 (2006) 259-267.

[31] Symons, C.: Function Point Analysis: Difficulties and Improvements. IEEE Transactions on Soft-ware Engineering, Vol. 14, No. 1, (1988) 2-11

[32] Symons, C.: Come Back Function Point Analysis (Modernized) – All is Forgiven!). In Proc. of the 4th European Conference on Software Measure-

ment and ICT Control, FESMA-DASMA 2001, Germany (2001) 413-426

[33] Thayer, H.R.: Software Engineering Project Man-agement, Second Edition IEEE CS Press (2001)

[34] The United Kingdom Software Metrics Associa-tion (UKSMA).: MkII Function Point Analysis Counting Practices Manual v.1.3.1 (1998)

[35] The International Function Point Users Group (IFPUG): IFPUG Function Point CPM, Release. 4.1, IFPUG, Westerville, OH (1999)

[36] The Common Software Measurement International Consortium (COSMIC).: COSMIC FFP, version 2.2, Measurement Manual (2003)

[37] The Netherlands Software Metrics Association (NESMA).: NESMA Definitions and Counting Guidelines for the Application of Function Point Analysis, v.2.0 (1997)

[38] Valerdi, R.: Constructive Systems Engineering Cost Model (COSYSMO) http://www.valerdi.com/cosysmo/

- 52 - SPACE 2007

Evaluation of Ensemble Learning Methods for Fault-Prone Module Prediction

Sousuke Amasaki

Department of Information Systems

Tottori University of Environmental Studies

1-1-1 Wakabadai-kita, Tottori City, Tottori, 689–1111, Japan

[email protected]

SPACE 2007 - 53 -



Evaluation of Ensemble Learning Methods for Fault-Prone Module Prediction

Sousuke AmasakiDepartment of Information Systems

Tottori University of Environmental Studies1-1-1 Wakabadai-kita, Tottori City, Tottori, 689–1111, Japan

[email protected]

Abstract

Recently, achieving quality, cost, and duration becomesmore demanding. Although therefore many of quality pre-diction methods have been proposed, comparative evalu-ation studies of these methods sometimes show inconsis-tent results regarding predictive performance. Among manypossible causes, we aim at the influence of characteristics ofdataset. Because a dataset is often complex and uncertainso that a prediction model is sometimes unable to capture itscharacteristics sufficiently. In this paper, we show the effec-tiveness of employing ensemble learning methods for fault-prone module prediction. Ensemble learning methods aremeta-models that combine prediction or estimation modelsto improve overall predictive performance. With empiri-cal datasets, we examine four ensemble learning methods:Bagging, Boosting, Voting and Stacking. The results showsome methods are valuable because these improve predic-tive performance by reducing the influence of uncertaintyand complexity of a dataset. We believe this study con-tributes the way to improve predictive accuracy of conven-tional prediction models.

1. Introduction

Recently, achieving quality, cost, and duration becomesmore demanding at a software development project. Mak-ing an appropriate project plan is important to achieve thesegoals. For this purpose, a right recognition with regard toa software development project and an appropriate prospectbased on this recognition are desirable. Quality predictionmethods are utilized for recognizing quality of a productat that time and effort estimation methods are utilized fordrawing a prospect of feasibility with regard to a develop-ment project plan. Therefore, many methods for qualityprediction and effort estimation have been previously pro-posed. Many of these methods are based on statistical meth-ods, data mining methods, and machine learning methods

[4]. Because these approaches bring two advantages that adataset collected at past projects can be exploited and ob-jective prediction and estimation can be performed.

Comparative studies on proposed methods have alsobeen previously studied. One of the purposes is to answer aquestion which method is the most powerful one. Yet therehas been no definitive answer against this question becausesome of comparative studies show inconsistent results. Wethink that this inconsistency is partly due to two characteris-tics of dataset which may affect on predictive performance:uncertainty and complexity.

Uncertainty comes from random or uncertain factors in-volved with a software development project. These factorsappear by chance in a dataset as fluctuations. This meansthat it is unlikely that values of each metric collected at twoprojects in the same situation are identical. We think thatfluctuations cause unstable predictive performance becausea prediction model is created with parameters optimized forthat dataset.

Complexity comes from situations around a softwaredevelopment project. Today, each software developmentproject is carried out for different goals in different situa-tions. Collecting such metrics may be valuable for predic-tion and estimation but cannot be carried out thoroughly.If a prediction model learns complex characteristics insuffi-ciently from a training dataset, predictive performance willbecome worse. We thus think the influence of characteris-tics of dataset on predictive performance is not small andreducing it is needed.

In this paper, we examine the effectiveness of employ-ing ensemble learning methods for fault-prone module pre-diction. Ensemble learning methods are meta-models thatcombine prediction or estimation models to reduce a riskof poor prediction caused by characteristics of dataset. Insome area, superiority of these methods has been reported[2, 9]. We examine four ensemble learning methods: Bag-ging, Boosting, Voting and Stacking. Finally, we confirmedthat ensemble learning methods are valuable tool for im-proving predictive performance of prediction models.

- 54 - SPACE 2007

2. Ensemble learning methods

2.1. Overview

Ensemble learning methods are prediction or estimationmethods based on a strategy to combine multiple resultsfrom base learners. Here, a base learner means a realiza-tion of a certain prediction or estimation model gained byfeeding a training dataset. This strategy intends to mitigatepredictive performance degradation and to improve predic-tive performance.

In this paper, we examine four representative ensemblelearning methods: Bagging, Boosting, Voting, and Stack-ing. To confirm usefulness of this approach, we use thesemethods as is. Realizations of these methods have the com-mon structure shown in Figure 1. The difference amongthese ensemble learning methods comes from three typesof elements in Figure 1:Learner, Weight, andAggrega-tor. In prediction or estimation procedure, input values cor-responding to explanatory variables are distributed to eachbase learnerLearneri. EachLearneri then outputs an es-timate or a prediction. Some of ensemble learning methodsweight for these outputs withWeighti. Finally, Aggrega-tor aggregates all outputs and produces a final estimationor prediction. Features of these ensemble learning methodsare shown in Table 1. The following subsections describetheir details and differences among them.

2.2. Bagging

Bagging (bootstrap aggregating) algorithm was pro-posed by Breiman [1]. In the Bagging algorithm,Learnerisare created using different training data subsets and a certainestimation or prediction model. Each training data subset isobtained using random sampling with replacement from anentire training dataset. Training data subsets usually containthe same number of instances as the entire training dataset.Weight is not used in the Bagging algorithm.Aggregatoris defined as either an average or a majority voting of out-puts from base learners.

2.3. Boosting

The Boosting algorithm is similar in idea to the Baggingalgorithm. Although many variations of Boosting algorithmhave been proposed, our study adopts AdaBoost [3].

AdaBoost and Bagging differ in two points. First, Ad-aBoost usesWeight. In learning process, AdaBoost createsLearneri in the same manner as Bagging. Then, an entiretraining dataset is classified byLearneri and misclassifiedinstances are specified.Weighti is calculated from ratio ofthese misclassifications. Second, AdaBoost samples a train-ing data subset forLearneri in a different manner. Distri-

Figure 1. The structure of Ensemble LearningMethods

bution of an entire training dataset is modified so that mis-classified instances in the previous step are sampled morefrequently. Therefore, each training data subset is sampledfrom different distributions.

2.4. Voting

Voting[17] is the simplest way to combine outputs fromLearner. Weight is not used andAggregator is defined aseither an average or a majority voting. The difference fromthe previous two methods is that eachLearneri is createdby feeding an entire training dataset into multiple differentestimation or prediction models.

2.5. Stacking

The Stacking algorithm was proposed by Wolpert [18]. Itis different inAggregator than Voting algorithm. Stackingcan adopt an arbitrary classification or regression model asAggregator. Here,Aggregator is calledmeta-learner.

EachLearneri is created by feeding an entire trainingdataset into one ofL kinds of prediction or estimation mod-els employed in Stacking. A training dataset for a meta-learner is generated using cross-validation method. Becausea meta-learner receivesL outputs from base learners as in-puts, this training dataset should haveL kinds of explana-tion variables. In each cross-validation iteration, an entiretraining dataset is first divided into a training data subsetand a testing data subset. Next, this training data subsetis fed into each estimation or prediction model employedfor base learners and thenL kinds of learners are obtained.These learners are applied to this testing data subset andpredicted or estimated results are obtained. After cross-validation, these predicted or estimated results are gathered

SPACE 2007 - 55 -

Table 1. Features of Ensemble Learning MethodsEnsembleLearning Methods Bagging[1] Boosting Voting Stacking

Learner# of prediction model(s) 1 1 L L# of training subset(s) L L 1 1

Weight noneEstimatefromtraining subset

none none

Aggregator meanor voting mean or voting mean or voting meta-learnerL is the number of base learners.

Table 2. Description of datasets

Dataset Language # offp modules # of modulesCM1 C 49 498KC1 C++ 326 2109KC2 C++ 107 522PC1 C 77 1109

Table 3. Example of confusion matrixXXXXXXXXXactual

predictedfp nfp total

fp TP(70) FN(80) 150nfp FP(100) TN(750) 850total 170 830 1000

up and combined with the corresponding actual results. Fi-nally, a meta-learner is created with this combined dataset.

3. Experiment

3.1. Prediction Models

The experiment aims to evaluate the effectiveness ofensemble learning methods on predictive performance offault-prone module prediction. Thus, we need to select dif-ferent types of major fault-prone prediction models for gen-erality and coverage of the experiment. We investigated theliterature in area of fault-prone module prediction and havefound three popular approaches: linear models, tree models,and analogy-based models.

From each approach, we then select a prediction modelon the basis of its popularity: logistic regression, C4.5 [13],and analogy-based model [14]. Here, analogy-based modeltakes the following option: the number of neighborsk = 1,Euclidean distance, and no weighting for variables. Thesemethods have been widely used, for example, in [5, 7, 11].

3.2. Datasets

Our experiment uses four datasets collected at NASAas a part of Software Metric Data Program: CM1, KC1,KC2, and PC1. Because these datasets have widely been

used and can be obtained easily and freely from NASAMDP Data Repository or PROMISE Engineering Reposi-tory [15]. Each instance in a dataset corresponds to a mod-ule and contains 21 metrics such as McCabe’s and Hal-stead’s software metrics. As same as other studies, eachinstance is labeled asfault-prone (fp)in the case that thecorresponding module had one or more reported defects.Otherwise, this instance is labeled asnot-fault-prone (nfp).Module defects were detected during its development or inits deployment.

In this paper, we employed dataset files provided atPROMISE Engineering Repository. These dataset files in-clude labels offp or nfp. Table 2 shows the number of allmodules and offp modules of these datasets and program-ming languages used. The other details are described incomments of the dataset files.

3.3. Performance measures

Ma et al. said that overall accuracy rate is not appro-priate for evaluating a fault-prone prediction model becausefault-prone modules are usually rare in a dataset and thisimbalance causes misleading interpretation for model per-formance [10]. Therefore this experiment adopts more ap-propriate measures: F-measure, G-mean1 and G-mean2.

F-measure, G-mean1 and G-mean2 are constructed onbasic measures: precision, recall, and specificity. Precisionrepresents accuracy of fault-prone module detection. Recallrepresents coverage of fault-prone module detection. Speci-ficity represents coverage of identification of not fault-pronemodule. There is a trade-off among these basic measures.Table 3 represents an example of confusion matrix to ex-emplify definitions of these basic measures. Here, assumethere are 1000 modules and 150 modules of these are actu-ally fault-prone. The number in the cell labeled asTPmeans70 fault-prone instances are correctly predicted asfp. Thesebasic measures are defined as follows:

precision =TP

TP+FP= 70/170 ≈ 0.41

recall =TP

TP+FN= 70/150 ≈ 0.47

specificity =TN

TN+FP= 750/850 ≈ 0.88

- 56 - SPACE 2007

Table 4. Base learner combinations for experimental valuation

No. Ensemble learning methods (Implementation in Weka) Prediction models (Implementation in Weka)1 logistic regression(Logistic)2 None C4.5(J48)3 Analogy-based model(IB1)4 Naive Bayesian Classifier(NaiveBayes)5 logistic regression6 Bagging(Bagging) C4.57 Analogy-based model8 logistic regression9 Boosting(AdaBoostM1) C4.5

10 Analogy-based model11 Voting(Voting) logistic regression + C4.5 + Analogy-based model

12Stacking(Stacking)

logistic regression + C4.5 + Analogy-based model[meta-learner:Naive Bayesian Classifier]

F-measure,G-mean1, and G-mean2 are defined as follows:

F-measure =2 × precision× recall

recall+ precision≈ 0.44

G-mean1 =√

recall× precision≈ 0.44

G-mean2 =√

recall× specificity≈ 0.64

Thesemeasures are designed to balance two of three ba-sic measures by taking either harmonic mean or geometricmean.

3.4. Experimental design

Table 4 shows 12 kinds of combinations of an ensemblelearning method and a prediction model. This experimentuses the data mining software Weka [16]. Each name ofimplementations in Weka is shown in parentheses after thisname.

No.1-3 are used for baselines. Thus these are not com-bined with any ensemble learning method. No.5-10 corre-spond to combinations employing either Bagging or Boost-ing. Three combinations are prepared for each ensemblelearning method because Bagging and Boosting can takeonly single classification method. Although Bagging andBoosting can take arbitrary number of base learners, thisexperiment adopts10, which is a default setting in Weka.No.11 and 12 are combinations employing either Voting orStacking. Both Voting and Stacking take all three predic-tion models for making base-learners. In this experiment,Stacking takes Naive Bayesian Classifier for a meta-learner.No.4 is used for examining the effect of a meta-learner forpredictive performance of Stacking.

This experiment performs 10 times of stratified 10-foldcross-validation for all combinations and for all datasets.Then, the experimental results are evaluated using F-measure, G-mean1, and G-mean2.

4. Results

The experimental results are shown in Table 5–7.Analogy-based model (No.3) is consistently comparable orsuperior to the other baselines (No.1–2). On most datasets,C4.5 (No.2) is comparable or superior to logistic regressionbut it is overly inferior to logistic regression on KC1. Theseresults are inconsistent.

We also observed two things from comparisons amongthe baselines. First, each performance measure sometimesshows different tendency. For example, the comparison be-tween logistic regression and analogy-based model on CM1resulted in almost equivalent F-measure and G-mean1 but inquite different G-mean2. Second, each comparison on dif-ferent datasets sometimes shows different tendencies. Forexample, in terms of all performance measures, the base-lines are competitive on KC2 but quite different and areranked clearly on PC1. Those observations imply that adataset is relatively complex and each prediction modellearns different characteristics of a dataset according to itsability and/or its theoretical background.

The major improvements by ensemble learning methodsare summarized in Table 8. Table 8 shows that all meth-ods but Voting worked well for some situations. Accord-ing to Table 5–7, it is found that both Bagging (No.6) andBoosting(No.9) improved C4.5 (No.2) but did not improvethe other prediction models (No.5 and 8 against No.1, andNo.7 and No.10 against No.3.) Bagging and Boosting aredesigned to use anunstablemodel which is sensitive to dif-ference of a training dataset. This design aims to learn com-plex characteristics of a dataset and to diminish uncertainfluctuations in a dataset. However, this design also causesan unstable model to be more stable. C4.5 is a kind of treemodels known asunstablemodel and is often used for baselearners of Bagging and Boosting. These observations sug-gest that Bagging and Boosting are useful for improving an

SPACE 2007 - 57 -

Table 5. Experimental results(F-measure)

PPPPPPPDatasetNo.

1 2 3 4 5 6 7 8 9 10 11 12

CM1 0.18 0.07 0.18 0.26∗ 0.17 0.09 0.18 0.18 0.17 0.20 0.06 0.24KC1 0.32 0.36 0.43 0.39 0.32 0.40 0.43 0.32 0.41 0.42 0.41 0.46∗

KC2 0.48 0.49 0.50 0.50 0.49 0.52 0.50 0.48 0.51 0.50 0.51 0.55∗

PC1 0.13 0.32 0.42∗ 0.18 0.15 0.32 0.42∗ 0.13 0.37 0.41 0.32 0.37∗ thebest result for the dataset.

Table 6. Experimental results(G-mean 1)

PPPPPPPDatasetNo.

1 2 3 4 5 6 7 8 9 10 11 12

CM1 0.20 0.07 0.18 0.27∗ 0.18 0.10 0.18 0.20 0.18 0.21 0.07 0.25KC1 0.36 0.37 0.43 0.40 0.36 0.42 0.43 0.36 0.42 0.42 0.43 0.47∗

KC2 0.50 0.50 0.51 0.52 0.50 0.53 0.51 0.50 0.52 0.51 0.53 0.56∗

PC1 0.17 0.37 0.43∗ 0.27 0.18 0.37 0.42 0.17 0.39 0.42 0.36 0.38∗ thebest result for the dataset.

unstable prediction model and for checking stability of aprediction model.

Bagging and Boosting showed different degree of im-provements (No.6 and No.9 in Table 5–7.) Boosting im-proved C4.5 more obviously and made C4.5 to be con-sistently superior to logistic regression. We think that thedifference between Bagging and Boosting of random sam-pling policy causes this result. While Bagging samples in-stances randomly, Boosting tries to more frequently sampleinstances learned insufficiently.

Table 8 shows that Voting (No.11) did not improve anyprediction models with any datasets. Here, an improvementmeans that a method achieve higher performance than thebaselines. Thus, ’Model’ column has no value. Table 5–7 indicate that Voting shows moderate performance againstthe baselines on all datasets except CM1. The experimentresults on CM1 suggests that predictions of analogy-basedmodel and logistic regression rarely met because Voting isinferior to these two models showing similar performance.These observations indicate that Voting can reduce the riskof selecting a poor prediction model when appropriate pre-diction models are combined.

Table 8 shows that Stacking (No.12) is superior to allbaselines on all datasets except PC1. Stacking shows rel-atively low performance on PC1 and relatively high per-formance on CM1 compared to the baselines. NaiveBayesian Classifier shows similar tendencies. These re-sults suggest the possibility that Naive Bayesian Classi-fier may have a dominant effect on performance of Stack-ing. We examined Stacking with all combinations oftwo of three prediction models on PC1 and CM1. Forexample, on PC1, while Stacking using three predictionmodels (No.12) shows lower performance than analogy-

based model (No.3), Stacking using C4.5 and analogy-based model shows slightly higher score than analogy-based model. We concluded from such examinations thatthe improvements shown in No.12 is due to Stacking. Thereason why Stacking can improve predictive performancebut Voting cannot seems due to the difference of a methodsynthesizing base learners’ outputs. That is, we think thatbase learners may reduce complexity of a dataset but baselearners’ outputs are still so complex that simple voting can-not deal with.

Overall, we confirmed that ensemble learning methodscan improve predictive performance.

5. Related work

Many of prediction and estimation methods have beenproposed and many comparative studies among these meth-ods have been also performed. Nevertheless, it is sometimesdifficult to interpret results of these studies because someof comparative studies show inconsistent results. This lackof convergence of comparative studies on effort estimationmethod was pointed out by Myrtveit et al [12]. For fault-prone models, Ma et al. showed effects of varying perfor-mance measures [10]. They also investigated characteristicsof dataset, but neither of them investigated the influence ofcharacteristics of dataset on predictive performance.

Some studies evaluated predictive performance of someof ensemble learning methods. Khoshgoftaar et al. evalu-ated ensemble learning methods for fault-prone module pre-diction [6, 8]. In [8], AdaBoost was applied for fault-pronemodule prediction. In [6], Bagging and Boosting based onC4.5 and decision stump were evaluated. However, otherprediction models used in this paper were not employed in

- 58 - SPACE 2007

Table 7. Experimental results(G-mean 2)

PPPPPPPDatasetNo.

1 2 3 4 5 6 7 8 9 10 11 12

CM1 0.25 0.11 0.32 0.44∗ 0.24 0.12 0.32 0.25 0.27 0.37 0.09 0.40KC1 0.45 0.52 0.60 0.58 0.46 0.54 0.60 0.45 0.57 0.60 0.55 0.64∗

KC2 0.61 0.63 0.66 0.62 0.61 0.64 0.66 0.61 0.64 0.66 0.65 0.70∗

PC1 0.21 0.44 0.62∗ 0.50 0.24 0.44 0.62∗ 0.21 0.53 0.62∗ 0.44 0.57∗ thebest result for the dataset.

Table 8. The improvements by ensemblelearning methods

Methods Model DatasetsBagging C4.5 CM1, KC1, KC2Boosting C4.5 AllVoting — none.

Stacking — KC1, KC2, PC1

their experiment. Furthermore, to the best of our knowl-edge, Stacking and Voting have not yet been evaluated inboth area of effort estimation and quality prediction.

6. Conclusion

This study examined the effectiveness of four ensem-ble learning methods on fault-prone prediction problem byusing three major fault-prone prediction models. As a re-sult, we confirmed that ensemble learning methods is usefulfor achieving higher predictive performance. Bagging andBoosting improved predictive performance of unstable C4.5by reducing the influence of random fluctuations in metricvalues of a dataset which come from uncertainty. Stackingmarked the highest predictive performance for all datasetsexcept CM1 by summarizing complex characteristics of adataset with multiple prediction models which learn differ-ent characteristics of the dataset.

The major limitation is that datasets were collected ina single organization. Further experiment with variousdatasets is a future work. For Stacking, it should be also in-vestigated how to select prediction models which improvepredictive performance. In order to improve accuracy, weshould make these methods more suitable for fault-proneprediction problem.

References

[1] L. Breiman. Bagging predictors. Machine Learning,24(2):123–140, 1996.

[2] M. Doumpos and C. Zopounidis. Model combination forcredit risk assessment: A stacked generalization approach.Annals of Operations Research, 2006.

[3] Y. Freund and R. E. Schapire. A decision-theoretic gener-alization of on-line learning and an application to boosting.Journal of Computer and System Sciences, 55(1):119–139,1997.

[4] M. Jørgensen and M. Sheppard. A systematic review of soft-ware development.tse, 2007.

[5] T. M. Khoshgoftaar, E. B. Allen, and J. Deng. Using regres-sion trees to classify fault-prone software modules.IEEETransactions on Reliability, 51(4):455–462, 2002.

[6] T. M. Khoshgoftaar, E. Galeyn, and L. Nguyen. Empiri-cal case studies of combining software quality classifictaionmodels. InProc. of the 3rd International Conference onQuality Software(QSIC’03), pages 40–49, 2003.

[7] T. M. Khoshgoftaar, K. Ganesan, E. B. Allen, F. D. Ross,R. Munikoti, N. Goel, and A. Nandi. Predicting fault-pronemodules with case-based reasoning. InProc. of 8th In-ternational Symposium on Software Reliability Engineering(ISSRE ’97), pages 27–35, Los Alamitos, CA, USA, 1997.IEEE Computer Society.

[8] T. M. Khoshgoftaar, E. Geleyn, L. Nguyen, and L. Bullard.Cost-sensitive boosting in software quality modeling. InProc. of the 7th IEEE International Symposium on HighAssurance Systems Engineering (HASE’02), pages 51–59,2002.

[9] A. Lemmens and C. Croux. Bagging and boosting classifica-tion trees to predict churn.Journal of Marketing Research,2006.

[10] Y. Ma and B. Cukic. Adequate and precise evaluationof quality models in software engineering studies. Inpromise07, pages 1–9, 2007.

[11] J. Munson and T. Khoshgoftaar. The detection of fault-proneprograms. IEEE Trans. on Software Engineering, 18(5),1992.

[12] I. Myrtveit, E. Stensrud, and M. Shepperd. Reliability andvalidity in comparative studies of software prediction mod-els. IEEE Trans. on Software Engineering, 31(5):380–391,2005.

[13] J. R. Quinlan.C4.5: Programs for Machine Learning. Mor-gan Kaufmann Publishers, 1993.

[14] M. Shepperd, C. Schofield, and B. Kitchenham. Effort es-timation using analogy. InProc. of 18th International Con-ference on Software Engineering, pages 170–178, 1996.

[15] J. S. Shirabad and T. J. Menzies. The PROMISE repositoryof software engineering databases. School of InformationTechnology and Engineering, University of Ottawa, Canada,2005.

[16] Weka Machine Learning Project. Weka 3: Data mining soft-ware in java. http://www.cs. waikato.ac.nz/˜ml/weka/.

[17] I. H. Witten and E. Frank.Data Mining: Practical MachineLearning Tools And Techniques. Morgan Kaufmann, 2005.

[18] D. H. Wolpert. Stacked generalization.Neural Networks,5(2):241–259, 1992.

SPACE 2007 - 59 -

n3nad45fg

- 60 - SPACE 2007

Copyright (C) 2007 by Information Processing Society of Japan. All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or any means, without permission in writing from the publisher. ISBN 978-4-915256-72-1 C3040 Publisher : Information Processing Society of Japan Kagaku-kaikan (Chemistry Hall) 4F 1-5 Kanda-Surugadai, Chiyoda-ku, Tokyo 101-0062 JAPAN Tel:+81-3-3518-8374 Fax:+81-3-3518-8375

proceedings - pdfs.semanticscholar.org · software engineering research group of simula research...

Documents