Download - Data Mining Application in a Software Project Management Process

Data Mining Application in a Software Project Management Process

Richi NayakSchool of Information Systems, QUT, Brisbane

Tian QiuEDS Credit Services, Adelaide, Australia

Agenda

Motivation Data Pre-processing Data Modelling Analysis of Results Conclusion

Motivation

A software engineering project leader manages a project with several issues involved such as project planning and scheduling, code implementation, testing and release.

It is difficult to precisely estimate the project duration in advance.

However, he can utilise the data Mining methods and make the accurate estimation by learning from the information gained by previous projects.

He can also eliminate several potential problems when a pattern appears in his current project similar to the one that have caused problems in previous projects.

Data mining techniques have been successfully applied to various areas such as marketing, medical, and financial.

However, few of them can be currently seen in software engineering domain.

Data Acquisition

Every year, more than 50 software projects are carried out in MASC, more than ten thousands lines of code are created, and thousands pages of documents are released to various customers.

In order to control the problems appearing during a project life cycle and to improve the working efficiency, MASC has set up a series of processes such as Software Configuration Management, Software Risk Management, Software Project Metric Report and Software Problem Report Management.

The Software Problem Report (PR) management data is chosen as the focus of this research.

A software bug-tracking system, GNATS (A Tracking System by GNU), is set up on MASC Intranet to collect and maintain all PRs raised from every department and individual within MASC.

Currently the GNATS system stores more than 40,000 of problem reports.

Structure of a PR

Data Pre-processing: Defining Goals (1)

Finding a precise estimation figures on bug fixing or estimation work at the early stage of a project There is few or even no tool exists that can be used for

both estimation and project problem reasoning stage. Project team members can only give estimations based

on their own experience from previous projects. If the current project is not within their familiar topics,

the accuracy of the estimation becomes worse.

A PR (problem report) fixing work also becomes more tedious when the responsible person can not estimate the time to fix the problem.

It will bring great cost savings and accurate progress control to the development team and to the organization.

Data Pre-processing: Defining Goals (2)

Improving control over the PR fixing and deduction of the project cycle time Currently the GNATS system has no actual database

management system implemented. This limits the potential benefits to software engineers

who can obtain valuable information if the existing PR data is being analysed.

This is useful especially when a programmer is struggling with a bug while a resolution may already hide behind the knowledge that can be derived from the previous similar problems.

Data Pre-processing: Field Selection (1)

Whenever a PR is raised, a project leader will have to find answers

How severe the problem is (customer impact)? What is the impact of the problem on project schedule (Cost &

Team priority)? and What type of the problem it is (a Software bug or a design

flaw)? How long it will take to fix? How many people are available?

Attributes Severity, Priority, Class, Arrival-Date, Closed-Date, Responsible, and Synopsis are considered for mining.

Several fields such as Confidential, Submitter-ID, Environment, Fix, Release Note, Audit Trail, the associated project name and PR number are not considered during mining.

All PRs with a closed value in their State field are chosen.

Data Pre-processing: Field Selection (2)

The attribute ‘class’ is chosen as the target attribute in order to find out any valuable knowledge among the type of a problem and the rest of the PR attributes. Association or characteristics rules can reveal the

relationship between the fix effort (in time) and the PR class.

A project leader can analyse the fix effort versus the human resources available, and put it in the schedule and resource plan.

The Synopsis field is pure text briefly describing what the problem is in the associated project. Text mining is used to analyse this data.

Data Pre-processing: Data Transformation

Attributes Arrival-Date, Closed-Date and Responsible are used to calculate the time spent to fix a PR. assuming that the derived attribute is total time spent

to fix a problem if there is only one person involved. This attribute is discretized with cutting points be

one day (1), half week (3 days), one week (7), two weeks (14), one month (30) and one quarter (90 days), half year (180 days) and more than one year (360 days). So that the mining results are not very highly

depended on the exact human resource involved but gives an approximate estimate, allowing a minor change in human resource.

Data Modelling and Mining

Prediction modelling on the time consuming patterns of the PR data to make estimation. C5 http://www.rulequest.com/see5-info.html CBA, http://www.comp.nus.edu.sg/~dm2/

Link analysis to discover association among the contents/values of the variables being selected. CBA

Text Mining to analyse Synopsis field. TextAnalyst http://

www.megaputer.com/company/index.html

Classification and Association Rule Mining

Initial Data set: 40,00 PRs Pre-processed Data set: 10,765 PRs

Covering more than 120 projects from 1996 to 2000 Different number of data sets are used in training

Data Mining result example in CBA

PR Number in the Training set 1224

PR Number in the Testing set 9541

Number of

rules

Prediction

Inaccuracy on

training data (%)

Training time cost (seconds)

Prediction Inaccuracy on testing data (%)

Testing time cost (seconds)

SSMD 10 45.180 1.01 47.56 0.07

SSAD 15 46.16 1.00 52.94 0.08

MSMD 9 45.18 1.01 47.56 0.10

MSAD 11 47.059 1.04 47.49 0.09

CBA Results: All PRs coming from one project

PR Number in the Training set5381

PR Number in the Testing set 5384

Number of rules

Prediction

Inaccuracy on

training data (%)




SSMD 41 57.04 0.41 51.95 1.1

SSAD 18 57.39 0.44 58.09 1.3

MSMD 21 59.10 0.44 58.25 1.0

MSAD 12 58.45 0.45 58.91 1.2

CBA Results: large size training set

* large size of 5381 PRs from all software projects

CBA Results: equally distributed target value

•342 PRs with change-request value of ‘Class’ from all software projects•1000 PRs with sw-bug, doc-bug and support values.

PR Number in the Training set 3342

PR Number in the Testing set7423

Number of rules

Prediction

Inaccuracy on

training data (%)




SSMD 20 43.51 2.2 43.5 2.0

SSAD 15 43.6 1.6 43.8 2.1

MSMD 15 46.5 1.6 45.1 1.9

MSAD 15 46.5 1.6 46.9 1.6

PR Number in the data set 10765

Prediction Inaccuracy(%)Mining time (seconds)

CV-SSMD 46.05 25.00

CV-SSAD50.5

25.345

CV-MSMD45.02

28.944

CV-MSAD 48.87 25.365

CBA Results: Cross-validation

CBA Results: Rules

If severity= non-critical and Time-to-fix = 3 to 30 days and priority= medium

Then class = doc-bug. Confidence = 76.6%, Support = 2.7%

If severity= critical and Time-to-fix = less than 3 days and priority = high

Then class = sw-bug. Confidence = 75.2%, Support = 2.3%

Conclusion: All software related bugs can be fixed within 3 days with above 75% confidence if they have high priority and are in critical condition.

It may take 3 months to fix the problem if the corresponding priority and severity are graded as medium and serious.

There is no rule that has confidence value larger than 80%, however they do describe some characters of the PR fixing patterns. Therefore they are useful for software project management in estimation bug fixing related time issues.

CBA Results: Summary

In general, around 46% prediction mismatch in training data set the lowest is 43.51%, the highest is above 59.10%.

Above 51% prediction mismatch is in testing data set the lowest is 43.51%, the highest is 58.25%.

Attempt to improve the accurate prediction in the way of equal-distributed target-value samples does not lead much change.

The error rates from using multiple supports are higher and the number of extracted rules is lower than those from using single support mining engine.

The error rates from Multiple Support and Single Support mining engine on Automatic Discretized value are usually higher than the manually discretization.

C5 results: Original Data Set

TOTAL NUMBER OF PR RECORDS 10765

NORMAL BOOSTING CROS_VAL

Number of Created rules

51 N/A. 57.7

Error Rate of rules (%)

41.5 41.3 43.9

Size of Created tree

141 N/A. 121.9

Error Rate of trees (%)

40.3 39.4 44.1

Process Time (seconds)

5.6 37.7 41.1

C5 results: equally distributed target value

TOTAL NUMBER OF PR RECORDS5681 (3342 +2339)

NORMAL BOOSTING CROS_VAL

Number of Created rules

11 N/A 12.4

Error Rate of rules (%)

42.6 42.6 42.8

Size of Created tree

21 N/A 17.4

Error Rate of trees (%)

42.5 42.6 43.1

Process Time (seconds)

0.2 0.4 1.1

C5 Results: Rules

When a PR is in low priority and the time spent is around half a day (0.5 day) Then the rule has a high probability (87.5% Confidence) to classify a bug to be a document related bug.

When a PR is in medium priority with non-critical severity and the time spent is around 1.1 day Then the rule has 84.6% Confidence to classify a bug to be a document related bug.

When a PR is in low priority and the time spent for fixing is around 1 weekThen the rule has 83.3% Confidence to classify a bug to be a software bug.

In general, around 42% error rate in rules from the training set the lowest is 40.3%, the highest is 43.9%.

Similar error rate value is for the generated trees in testing data set the lowest is 39.4%, the highest is 44.1%.

Both of the rates are better than CBA results. The time efficiency of C5 is also better than CBA.

Text Mining

The analysis of the text together with the rules obtained from classification and association can more accurately predict the time and cost of fixing PRs.

automatically summarise the pure text data and extract some valuable rules.

TextAnalyst automatically creates the semantic network based on the structure, vocabulary and volume of the analysed text, without any predefined rules.

Existing problems

The error rates of testing data sets in both CBA and C5 are higher than expected.

non-uniform value distribution (57, 25, 15, 3%) indicating some amount of noise is still existent in data.

The relationship between PRs and human resources within a particular project plays a great impact.

The time needed to fix a bug is different for each project depending upon the actual human resources available.

Only the attribute ‘Responsible’ is used to indicate the human resource available.

Truly, the relationship with the human resources available for past projects whose data was analysed is needed to use time patterns to help project leaders to predict time consummations more accurately.

The use of additional data source ‘Change Request data set’ that records all customer request process data may rectify this problem.

Conclusion

use of data mining techniques on a set of data collected from the software engineering process under a real software business environment.

useful rules are inferred on the time patterns of the PR fixing and the relationship between the content and the type of a PR in the form of association rules, classification rules or semantic trees.

the scale of this data mining task is limited.

Future work: apply data mining to different phases of software development such as software quality data, etc.

Thank You!!

Questions?

Download - Data Mining Application in a Software Project Management Process

Top Related