difficulty and complexity as factors in software effort estimation

3
Discussion Difficulty and complexity as factors in software effort estimation Fred Collopy Why do models not perform as well as, or better than, humans at software estimation? In general, the results have gone the other way. Two explanations come to mind. One is that this task is just so difficult that no one can do it very well, so that the person vs. modelcomparison is less relevant than for tasks that are closer to the margin of human performance. Another is that because each software task is unique, and programmer productivity is so variable, there is not really a single task to analyze. 1. An intrinsically difficult task In a study comparing four popular algorithmic models for software estimation across 15 large data- processing projects developed in one firm, Kemerer (1987) found average percentage errors of 772% (for SLIM), 601% (for COCOMO), 102% percent (for Albrecht's Function Points) and 85% (for ESTI- MACS). Many of the projects had errors in the 500600% range. Interestingly, the errors from SLIM and COCOMO were consistently biased, with effort being overestimated in all 15 cases for both models. Why do these widely used models have such a difficult time estimating a project's complexity? Kemerer suggests that it may be because the models do not seem to capture productivity factors very well(p. 428). The COCOMO-Intermediate and Detailed models, which include productivity factors, did no better than the Basic models. Raw Function Counts correlated as well as Function Point numbers which included processing complexity adjustments. ESTI- MACS, which includes 20 additional productivity- related questions, did less well than Function Counts, and SLIM consistently generated productivity factors that were too low. 2. A unique assemblage of tasks Each piece of software is a new creation. As Bollinger (1997) puts it: The creation of genuinely new software has far more in common with developing a new theory of physics than it does with producing cars or watches on an assembly line(p. 125). This is a problem for software developers and those who manage them. Many things explain software's complexity. Variations in programmer productivity are the stuff of legend. Those attempting to put a number to these differences propose that some programmers are between 5 and 100 times as productive as others. In any case, everyone who has worked around program- mers knows that there are large productivity differ- ences among them. DeMarco and Lister (1989) compared programs produced by 118 programmers implementing the same functionality using the same language (COBOL). The number of lines used to solve the problem varied as much as ten times. In a comparison of 41 students in an advanced program- ming course at Yale, the time students took to complete twelve programming assignments ranged from four to International Journal of Forecasting 23 (2007) 469 471 www.elsevier.com/locate/ijforecast E-mail address: [email protected]. 0169-2070/$ - see front matter © 2007 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved. doi:10.1016/j.ijforecast.2007.05.011

Upload: fred-collopy

Post on 07-Sep-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Difficulty and complexity as factors in software effort estimation

International Journal of Forecasting 23 (2007) 469–471www.elsevier.com/locate/ijforecast

Discussion

Difficulty and complexity as factors in software effort estimation

Fred Collopy

Why do models not perform as well as, or betterthan, humans at software estimation? In general, theresults have gone the other way. Two explanationscome to mind. One is that this task is just so difficultthat no one can do it very well, so that the “person vs.model” comparison is less relevant than for tasks thatare closer to the margin of human performance.Another is that because each software task is unique,and programmer productivity is so variable, there isnot really a single task to analyze.

1. An intrinsically difficult task

In a study comparing four popular algorithmicmodels for software estimation across 15 large data-processing projects developed in one firm, Kemerer(1987) found average percentage errors of 772% (forSLIM), 601% (for COCOMO), 102% percent (forAlbrecht's Function Points) and 85% (for ESTI-MACS). Many of the projects had errors in the 500–600% range. Interestingly, the errors from SLIM andCOCOMO were consistently biased, with effort beingoverestimated in all 15 cases for both models.

Why do these widely used models have such adifficult time estimating a project's complexity?Kemerer suggests that it may be because the models“do not seem to capture productivity factors very well”(p. 428). The COCOMO-Intermediate and Detailedmodels, which include productivity factors, did no

E-mail address: [email protected].

0169-2070/$ - see front matter © 2007 International Institute of Fdoi:10.1016/j.ijforecast.2007.05.011

orecaste

better than the Basic models. Raw Function Countscorrelated as well as Function Point numbers whichincluded processing complexity adjustments. ESTI-MACS, which includes 20 additional productivity-related questions, did less well than Function Counts,and SLIM consistently generated productivity factorsthat were too low.

2. A unique assemblage of tasks

Each piece of software is a new creation. AsBollinger (1997) puts it: “The creation of genuinelynew software has farmore in commonwith developing anew theory of physics than it doeswith producing cars orwatches on an assembly line” (p. 125). This is a problemfor software developers and those who manage them.Many things explain software's complexity.

Variations in programmer productivity are the stuffof legend. Those attempting to put a number to thesedifferences propose that some programmers arebetween 5 and 100 times as productive as others. Inany case, everyone who has worked around program-mers knows that there are large productivity differ-ences among them. DeMarco and Lister (1989)compared programs produced by 118 programmersimplementing the same functionality using the samelanguage (COBOL). The number of lines used to solvethe problem varied as much as ten times. In acomparison of 41 students in an advanced program-ming course at Yale, the time students took to completetwelve programming assignments ranged from four to

rs. Published by Elsevier B.V. All rights reserved.

Page 2: Difficulty and complexity as factors in software effort estimation

470 F. Collopy / International Journal of Forecasting 23 (2007) 469–471

77 hours. What is more, there was no relationshipbetween the amount of time that students spent and thequality of the resulting programs as assessed by howmany automated tests they passed (Spolsky, 2005).

Software is more malleable than most other thingsthat we design. In many cases of software design, boththe goals and the criteria we use to judge progresschange throughout the project. And despite what isoften assumed while making estimates, such featuresas reliability, changeability, testability and structure aremore likely to determine how long it takes to completea module than function or performance (Brooks,2000). It is not uncommon for programmers to findthemselves spending a lot of time on seemingly minorelements of a program's construction, long after itsbasic functionality is well under control. And ifprogrammers finish work on the functionality of amodule before the scheduled completion date, theyoften turn their attention to improving its structure,reliability or robustness.

3. How studies in Jorgensen relate to theseexplanations

When I first examined the results in Jorgensen'sAppendix A, I was struck by the number of largeerrors, reminiscent of the errors reported in Kemerer(1987). These suggested that software estimationmight just be a difficult task. Across the 12 studiesfor which percentage errors were presented or easilycalculated, the average percentage errors for methodswithin a study ranged from 13% to 413%. These aver-ages cluster into three groups. There are five studieswith relatively large errors (67% to 413%), three forwhich the errors are quite small (13% to 17%) and fourin an intermediate range between those groups (29% to56%). The five studies with large errors are lab studies.The three with small errors are field studies. And thefour in the intermediate range are hybrid studies thatinclude both lab and field elements.

Though lab studies lack the realism of field studies,they have the benefit of controlling for many factorsthat cannot be controlled in field settings. It seemsimportant, therefore, that the errors in the lab studiesare the highest of the 12 studies, with average errors of413% (study 1), 337% (study 2), 201% (study 3), 67%(study 11) and 157% (study 12). These results suggest

that in the lab, it is difficult to estimate completiontimes. Jorgensen suggests that this may be becauserelevant information is lacking. Alternatively, thesestudies may reflect the intrinsic difficulty of the task.

When the lab studies are set aside, the intrinsicdifficulty issue seems to go away, at least for the studiesreviewed here. Once the lab studies are excluded, therange of average errors in the remaining studies – from13% to 57% – looks much more reasonable.

Of the seven studies that remain, the models in twostudies (7 and 9) were not applied in an ex antefashion. Because the models had access to informationabout completed projects that was not available to theexperts (and that would not be available when makingreal estimates) these results should be discounted.

This leaves five studies. Study 8 is a fairly straight-forward single company field study. The model per-formed better than the experts (10% vs. 20%), thoughit appears that the model received some help. A moretextbook application of Function Points did con-siderably less well (58%). Study 10 is a similarstudy, but of student projects. Data from the previousyear's projects were available both to student groupsand to the model, and the average difference betweenthem was small (16% vs. 19%). In the remaining fieldstudy (16), the model was more accurate (7% vs.18%). The results from the two remaining hybridstudies (13 and 15) are also divided, with the modelclearly doing better in one and worse in the other. So inthese more realistic settings, the model has a slightadvantage over the judges, even given the complexityof the tasks. These comparisons are probably the mostlike the earlier “person vs. model” comparisons, in thatthey are within a useful range of performance and areestimating the same thing based largely on the sameinformation.

Acknowledgments

I wish to thank Scott Armstrong, Frederick Brooks,Jason Dana, Robin Hogarth, Chris Kemerer, J. P.Lewis and Joel Spolsky for useful comments on earlierdrafts.

References

Bollinger, T. (1997). The interplay of art and science in software.IEEE Computer, 128, 125−127.

Page 3: Difficulty and complexity as factors in software effort estimation

471F. Collopy / International Journal of Forecasting 23 (2007) 469–471

Brooks, F. P., Jr. (2000). The design of design. Turing Award Lecture,at http://terra.cs.nps.navy.mil/DistanceEducation/online.siggraph.org/2001/SpecialSessions/2000TuringLecture-DesignOfDesign/session.html (accessed on October 1, 2006).

DeMarco, T., & Lister, T. (1989). Software development: State of theart vs. state of the practice. Proceedings of the 11th internationalconference on software engineering (pp. 271−275).

Kemerer, C. F. (1987). An empirical validation of software costestimation models. Communications of the ACM, 30, 416−429.

Spolsky, J. (2005).Hitting the high notes.At http://www.joelonsoftware.com/articles/HighNotes.html (accessed on October 1, 2006).