introduction and overview to mining software repository zoltan karaszi zkaraszi (at) kent.edu ms/phd...

Introduction and Overview to Mining Software Repository

Zoltan Karaszizkaraszi (at) kent.edu

MS/PHD seminar (cs6/89191)

November 9th, 20111

Abstract

November 9th, 20112

Based on the following survey paper: “ A survey and taxonomy of approaches for mining software repositories in the context of software evolution” by Huzefa Kagdi, Michael L. Collard and Jonathan I. Maletic, 2007

After defining MSR, giving background and different classifications, my main goal is - give a general picture about MSR

After showing the different MSR approaches I will focus on one example of Frequent-pattern mining that examines the changes and evolution of software

Outline

November 9th, 2011

.

1. Introduction

2. Dimensions of survey

3. A layered taxonomy of MSR

4. Software repository mining overview

5. Example: Frequent-pattern mining

6. Discussion and open issues

7. Concluding remarks

8. References

November 9th, 2011

.

November 9th, 20113

1. Introduction

1.1. Terms Mining Software Repositories (MSRs): created to describe a broad class of

investigations into the examination of software repositories Software Repositories (SRs): produced and archived during software evolution Concurrent Versions System (CVS): client-server free software revision control

system, track of all changes in a set of files

1.2. Premise Empirical and systematic investigations of repositories Identify uncovered information, relationships or trends Bring new light on the process of software evolution and the changes

1.3. Scope, background and history

Scope Survey the literature until June, 2006 Specifically investigates evolutionary changes of software artifacts

Background No survey of investigation examined the changes and evaluation of software and use data mining and other similar techniques before

In the past MSR investigations were subjected on industrial Systems research efforts were limited for few software systems

Currently Large increase in open-source software how to manage this challenge

November 9th, 20114

1.4. Goals of the survey

Form a basis for researchers interested in MSR to better understand the evolution of software systems

Create a taxonomy assist in the continued advancement of the field Clearer understanding support the development of tools, methods, processes

More precisely reflect the actual nature of software evolution

November 9th, 20115

.

November 9th, 2011

.

2. Dimensions of the survey

2.1. Information sources

Categories of information in SR

Metadata about the software change: comments, user-ids, timestamps Differences between the versions: addition, deletion or modification Classification of different software versions (artifacts)

Version control systems

CVS – doesn’t maintain explicit branch and merge points Subversion (more modern) – build the change-set Bugzilla – bug-tracking system - history of the entire lifecycle of a bug (bug report)

November 9th, 20116

2.2. Purpose

Extract information and uncover relationships or trends in source code evolution Two classes of answers of MSR questions

Market-Basket Question (MBQ) formulated as If A occurs then what else occurs on a regular basis?

Prevalence Questions (PQ) formulated as Was a particular function added/deleted/modified? How many and which of the functions are reused?

November 9th, 20117

2.3. Methodology

Researchers utilize software repositories in multiple ways

Limit the studies to the metadata directly available from the repositories using the semantic manner, traditionalUse directly the functionality of source code repositories (CVS commands) to get a particular version of the code using the adopted/invented methodology

2.4. Evaluation Assessment metrics

Precision: how much of the information found is relevant Recall: how much of all of the relevant information is found

November 9th, 20118

November 7th, 2011

.

3. Layered taxonomy of MSR approaches

3. Layered taxonomy of MSR approaches

All the investigated survey paper works: on version-release histories, on the same level of granularity, ask and answer very similar type of MSR questions, analyze the information and derive conclusions within the context of software evolution

The four-layer taxonomic description [1]

November 9th, 20119

November 9th, 2011

.

4. Software repository mining overview

4.1. Metadata analysis

Lightweight methodology to analyze metadata Utilize the metadata stored in software repositories Straightforward first choice – accessible (CVS log)

4.2. Static source code analysis

Good approach to extract facts and other information from versions of a system Bug finding and fixing

4.3. Source code differencing and analysis

Further extension of MSR with regards to source code changes More source code ‘aware’manner

November 9th, 201110

4.4. Software metrics

Quantitatively measures various aspects of software products and projects Include size, effort, cost, functionality, quality, complexity and efficiency

4.5. Visualization

Interactive visual representation of data to amplify cognition and to supportsoftware maintenance and evolution Very task specific Based on the mined data and how one separates approach categories

4.6. Clone-detection methods

Approaches for identify both exact and near-miss clones Source code entities with similar textual, structural and semantic composition


4.7. Information-Retrieval (IR) methods

Classification and clustering of textual units Applied to many software engineering problems

Traceability, program comprehension, and software reuse CVS comments, textual descriptions of bug reports, and e-mails

4.8. Classification with supervised learning

Supervised learning: technique creating cause–effect function from training data

4.9. Social network analysis

For deriving and measuring‘invisible’ relationships between social entities To discover developer roles, contributions, associations in the software development



.

5. Example: Frequent-pattern mining

Discover implicit knowledge from large datasets (patterns, trends, rules)

Encompasses IR, statistical analysis and modeling and machine learning

Applied to uncover frequently co-change (frequent patterns) software entities

Include the ordering information

[34]


5.1. Evolutionary couplings and change predictions

Zimmermann et al. [15] aimed to identify co-occurring changes in a software system Purpose: find changes ? source code entity(function A) modifiedother entities(functions B and C)modified

Use ROSE (parser tool) for SC (C++, Java, Python) Association-rule mining technique to determine rules of the form B A

Derived association rules such as a particular ‘type’ definition changes leads to changes

In instances of variables of that ‘type’ In coupling between interface and implementation


5.2. Capabilities of technique

Ability to identify addition, modification and deletion of syntactic entities Handles various programming languages and HTML documents Detection of hidden dependencies

Figure 1.2: Programmers who Changed this Function also Changed…[15]


5.3. Extension of their work [33]

Allows prediction of additions to and deletions from entities

ROSE was evaluated for

Navigation (recommendation of other affected entities)

Closure (false suggestions for missing entities)

Granularity (fine versus coarse)

Maintenance (modified only)


5.4. Evaluation (‘interactive power’ of ROSE tool)

Period: at least one month selected for eight open-source projects

Prediction - based on previous versions: changes occurred during the evaluation

New additional measure feedback: percentage of queries

Average precision, recall, and feedback values

Navigation and prevention support is better with coarse level than with fine level granularity

Average feedback values in the case of closure: 1.9% in the case of fine and coarse granularity: 3%


5.5. Advantages of extended ROSE tool

Needs only a few weeks of history to make suggestions

Results can be improved by assigning higher weight to rapid renames and moves

Similar approach

Ying et al. [34] - approach for source code change prediction at a file level

Use: association-mining technique based on FP-tree item-set mining

Evaluated: version histories of Mozilla and Eclipse projects


November 9th, 2011

.

6. Discussion and open issues

Need to be able to perform MSR on fine-grained entities Standards for validation must be developed

7. Concluding remarks

Over 80 investigations were surveyed Layered taxonomy was derived

MSR investigations are promising avenue to help support and understand software evolution !


8. References

[1]. Kagdi, H., Collard, M.L., Maletic, J.I., "A Survey and Taxonomy of Approaches for Mining Software Repositories in the Context of Software Evolution", in the Journal of Software Maintenance and Evolution: Research and Practice (JSME), Vol. 19, No. 2, 2007, pp. 77-131.

[15]. Zimmermann T, Weißgerber P, Diehl S, Zeller A. Mining version histories to guide software changes. Proceedings 26th International Conference on Software Engineering (ICSE’04). IEEE Computer Society Press: Los Alamitos CA, 2004;

[33]. Zimmermann T, Zeller A,Weißgerber P, Diehl S. Mining version histories to guide software changes. IEEE Transactions on Software Engineering 2005; 31(6):429–445.

[34]. Ying ATT, Murphy GC, Ng R, Chu-Carroll MC. Predicting source code changes by mining change history. IEEE Transactions on Software Engineering 2004; 30(9):574–586.

Thank you for your time !November 9th, 201120

introduction and overview to mining software repository zoltan karaszi zkaraszi (at) kent.edu ms/phd...

Documents

evolution of software

software change

evaluation of software

evolution of software

context of software

process of software

opensource software

changes slide