on predicting software development effort using machine learning techniques and local data.pdf

Download ON PREDICTING SOFTWARE DEVELOPMENT EFFORT USING MACHINE LEARNING TECHNIQUES AND LOCAL DATA.pdf

Post on 14-Apr-2018

214 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • 7/29/2019 ON PREDICTING SOFTWARE DEVELOPMENT EFFORT USING MACHINE LEARNING TECHNIQUES AND LOCAL DATA.pdf

    1/14

    On Predicting Software Development Effort using Machine Learning Techniques and Local Data

    Vol. 2, No. 2, July-December 2010 12

    * Institute of Information Technology in Management, Faculty of Economics and Management, University of Szczecin, PolanE-mail: lukrad@uoo.univ.szczecin.pl, woodieh@gmail.com

    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND COMPUT

    International Science Press, ISSN: 2229-7

    ON PREDICTING SOFTWARE DEVELOPMENT EFFORT USING

    MACHINE LEARNING TECHNIQUES AND LOCAL DATA

    Lukasz Radlinski and Wladyslaw Hoffmann

    This paper analyses the accuracy of predictions for software development effort using various machine learning techniques. Tmain aim is to investigate the stability of these predictions by analyzing if particular techniques achieve a similar level accuracy for different datasets. Two key assumptions are that (1) predictions are performed using local empirical data and very little expert input is required. The study involves using 23 machine learning techniques with four publicly availabdatasets: COCOMO, Desharnais, Maxwell and QQDefects. The results show that the accuracy of predictions for each techniqvaries depending on the dataset used. With feature selection most techniques provide higher predictive accuracy and this accurais more stable across different datasets. The highest positive impact of feature selection on the accuracy has been observed for tK* technique, which has generated the most accurate predictions across all datasets.

    Keywords: effort prediction, local data, process factors, machine learning, prediction accuracy

    1. INTRODUCTION

    Software development effort can be predicted usingvarious approaches. Some of them require largedataset of past projects while others require stronginput from domain expert. Some approaches areeasy to use while others are difficult to follow andtime consuming. Jrgensen and Shepperd [10]performed a systematic review of softwaredevelopment effort estimation studies. They haveprovided and analyzed an extensive list of relevantpublications in their paper. They have alsoperformed various classifications for this subject.

    In this study, we focus on the subfield of thispopular research field by attempting to answer thefollowing main question: Is it possible to easilypredict software development effort from localdata?

    There are two key constraints limiting the scope

    of this study: easy prediction and using local data.By easy prediction we mean using such procedure,which can be followed even without havingbackground in quantitative statistical techniques.Some of such techniques can often be properlyapplied only after performing various checksrelated to input data, for example about the

    normality of distributions or linear relationshibetween variables to name just a couple of thsimplest. The procedures of using such techniqumay be complex and time-consuming. Thus, in thstudy we examine a performance of a set of machinlearning techniques that can be applied easily the empirical input data. Naturally, they alsrequire some basic data preparation but this step

    very simple.The second constraint is to use local data. B

    this, we mean data from a single software companor a software development division of a largorganization. Various researchers have analyzeeffort predictions from local and cross-compandata. In [12, 16] the authors compare the accuracof predictions from local and cross-company datincluding results achieved in earlier work by othauthors. This analysis revealed that in somexperiments similar accuracy have been achieve

    while in other experiments predictions based olocal data have been more accurate than based ocross-company data. When gathering croscompany data there are typically more peopinvolved than when gathering local datFurthermore, for cross-company data there islarger risk in different understanding the data b

    mailto:lukrad@uoo.univ.szczecin.plmailto:woodieh@gmail.comINTERNATIONALmailto:woodieh@gmail.comINTERNATIONALmailto:woodieh@gmail.comINTERNATIONALmailto:lukrad@uoo.univ.szczecin.pl
  • 7/29/2019 ON PREDICTING SOFTWARE DEVELOPMENT EFFORT USING MACHINE LEARNING TECHNIQUES AND LOCAL DATA.pdf

    2/14

    Lukasz Radlinski and Wladyslaw Hoffmann

    124 International Journal of Software Engineering and Computi

    people who gather them. For example, theassessment very high for team experience ishighly subjective not only for different individualsbut also from a perspective of specific company, itsenvironment and culture. In addition, there arevarious ways of calculating project effort depending

    on the types of activities included, people involvedin development, and many others. For thesereasons, to reduce the difficulties associated withusing cross-company data, we have decided to useonly local data of software projects.

    To support answering the main researchquestion of this paper we analyze the following setof more detailed questions:

    What is the accuracy of predictionsobtained using various machine learningtechniques?

    What are the differences betweenpredictions using these techniques?

    What is the impact of feature selection onachieved accuracy of predictions?

    How stable are predictions from particulartechnique across various datasets?

    This paper is organized as follows: Section 2provides information on the procedure, techniquesand datasets used in this experiment. Main pointsof the data preparation stage have beensummarized in Section 3. Section 4 discusses

    achieved results. An overview of the threats tovalidity has been included in Section 5. We sum upthe whole study in Section 6.

    2. RESEARCH APPROACH

    2.1.Dataset Used

    In this study we have used four publicly availabledatasets: COCOMO [2, 3], Desharnais [5], Maxwell

    Table 1Summary of Datasets Used

    Datasets Number of Project Effort Number of numeric Number of ordinal Number of nomin

    cases size predictors predictors predict

    COCOMO 93 KLOC person- 2 15[2, 3] months

    Desharnais 81 function person- 7 0[5] points hours

    Maxwell 62 function person- 3 15[15] points hours

    QQDefects 29 KLOC person- 1 27[6, 7] hours

    [15], and QQDefects [6, 7]. They have beesummarized in Table 1. All datasets are availabin the PROMISE repository [4]. Most of thedatasets, i.e. COCOMO, Desharnais, and Maxwehave been widely used in earlier experimentSelected results from these studies have bee

    discussed in Section 4.3.QQDefects has been originally used in [6, 7]

    predict the number of defects. In this study we hanot used a variable number of defects and attemto use the dataset for effort prediction. However,slightly modified version of this dataset has beeused. This includes an additional predictor projetype and additional cases for which the data hnot been previously published. It should be notethat QQDefects is the only dataset where thnumber of variables (31) exceeds the number

    cases (29).Tables 25 list the variables in particular datas

    used in this study. Most original datasets contaother variables. However, since they are nrelevant to this study, we have removed them annot listed in these tables. Detailed descriptions datasets and variables are available in relevaoriginal studies.

    We can observe that there are high differencbetween these datasets, related to the number cases and number of predictors of specific type

    Additionally, projects in different datasets atypically described from different point of view. Fexample, Maxwell dataset contains data on thdevelopment process but is more focused on thnature of projects. On the other hand, in QQDefecdataset almost all variables describe thdevelopment process. Furthermore, very fevariables have been included in all datasets,

  • 7/29/2019 ON PREDICTING SOFTWARE DEVELOPMENT EFFORT USING MACHINE LEARNING TECHNIQUES AND LOCAL DATA.pdf

    3/14

    On Predicting Software Development Effort using Machine Learning Techniques and Local Data

    Vol. 2, No. 2, July-December 2010 12

    Table 2List of Variables in COCOMO Dataset

    Symbol Name Type Symbol Name Type

    projectname name of project nominal acap analysts capability ordina

    cat category of application nominal aexp application experience ordina

    forg flight or ground system nominal pcap programmers capability ordina

    center which NASA center nominal vexp virtual machine experience ordina

    mode semidetached, embedded, organic nominal lexp language experience ordina

    rely required software reliability ordinal modp modern programing ordinapractices

    data data base size ordinal tool use of software tools ordina

    cplx process complexity ordinal sced schedule constraint ordina

    time time constraint for cpu ordinal year year of development numer

    stor main memory constraint ordinal kloc number of KLOC numer

    virt machine volatility ordinal effort development effort numer

    turn turnaround time ordinal

    Table 3List of Variables in Desharnais Dataset

    Symbol Name Type Symbol Name Type

    TeamExp team experience numeric Entities number of entities numer

    Language programming language nominal FPNon function points non- numerAdjust adjusted

    ManagerExp manager experience numeric Adjustment function point complexity numeradjustment factor

    YearEnd year project ended numeric Effort actual effort numer

    Transactions number of transactions numeric

    Table 4List of Variables in Maxwell Dataset

    Symbol Name Type Symbol Name Type

    App application type nominal Complex software logical complexity ordina

    Har hardware platform nomina