model-based analysis of large scale software repositories
TRANSCRIPT
Dr. Markus Scheidgen
Model-based Analysis of Large Scale Software Repositories
■ problem■ creating models of software repositories■ the means for analyzing such models■ example analysis
1
Problem
2
Is Software Engineering a Science?
■ Def.: Science (from Latin scientia) is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe.■ Testable? Example theses:
★ DSLs allow domain experts to develop software effectively and more efficiently as with GPLs.
★ Static type systems lead to safer programming and fewer bugs.★ Functional programming leads to less performant programs. ★ Scrum allows to develop programs faster.★ My framework allows to develop ... more, faster ... with less, fewer ...
■ Methods for quantitative measures of software properties (metrics) are mostly used to assess the state of software projects, and rarely for empirical studies on software engineering itself
3
Reasons
4
inaccessibility • new methods have to be used first to produce data• industry cooperations necessary• open-source repositories are a possibility
data quality • not easy to distinguish between written code, generated code, test code
• there are maintained projects, developed projects, aborted projects
heterogeneity • different project structures• different paradigms• different languages• different APIs
amounts of data • source forge hosts >350.000 projects• current snap-shop of linux kernel contains 108 AST-nodes• EMF´s 50 MB Git repository, takes 20 GB of binary encoded
AST data
Relevant Fields with Partial Solutions
5
Mining Software Repositories (MSR)
Software Metrics Reverse Engineering
analyzing of rich data contained in software engineering related repositories such as version control systems, mailing list, bug-tracking systems
definition, acquisition, and analysis of quantitative measures of certain software properties
analyzing existing code bases to create representations at a higher level of abstraction (models)
• guiding software development• defect detection, prediction,
resolution• gaining actionable knowledge about
software projects and software engineering methodologies
• assessment of engineering costs for development, change, maintenance, etc.
• comparative analysis of software systems or analysis of software evolution
• comparative analysis of software engineering methodologies
• understanding existing software for development, change, maintenance, etc.
• derive AST, UML, or KDM models from software
• static language independent• syntax based• scale: single projects, large scale
(eclipse, apache), ultra large scale (source forge, git-hub)
• language independent (e.g. LOC)• syntax based (e.g. McCabe)• static, dynamic (evolution)
• syntax (structure, behavior)• semantics
Problem Statement: Everything is there, but ...
1.Missing abstractions:■ no general abstractions to cover multiple languages/
repositories are used■ only proprietary solutions and systems tailored for specific
algorithms/databases, languages, repositories
2.Scalability is an issue: ■ for ultra large scale repositories only VCS meta-data is used■ for large scale repositories only language independent analysis
on file-based granularity possible■ only for single software projects language dependent analysis
on AST-level detail are feasible
6
Proposed Solution: Scalable Model-based Framework
■Meta-model and reverse engineering based approach to analyze code-models on different and well-defined levels of abstractions instead of the code itself.■Query and transformation languages as well as model
persistence based on the Map/Reduce BigData paradigm.■ Target: AST-level analysis of large-scale repositories, e.g.
git.eclipse.org (>300 projects)
7
SrcRepo: A Framework for Large Scale Repository Analysis
8
Model-based Analysis of Large Scale Software Repositories
9
Model-based Analysis of Large Scale Software Repositories
9
VCS
Model-based Analysis of Large Scale Software Repositories
9
1):Reverse Engineering to create AST-level models of software and its evolution
VCS Model
Model-based Analysis of Large Scale Software Repositories
9
1):Reverse Engineering to create AST-level models of software and its evolution
VCS Model
2):Transformations based on MSR Algorithmsto derive implicit dependencies
Model-based Analysis of Large Scale Software Repositories
9
1):Reverse Engineering to create AST-level models of software and its evolution
VCS Model
2):Transformations based on MSR Algorithmsto derive implicit dependencies
Metrics
2):Queries to perform measurements based on structural, causal, and implicit dependencies
VCS Model MetricsVCS Model Metrics
Model-based Analysis of Large Scale Software Repositories
9
1):Reverse Engineering to create AST-level models of software and its evolution
VCS Model
2):Transformations based on MSR Algorithmsto derive implicit dependencies
Metrics
2):Queries to perform measurements based on structural, causal, and implicit dependencies
3): Statistical analysis
BetterUnderstanding
Software Engineering
1) Reverse Engineering Software in Version Control Systems (VCS)
10
code codecode
code codecodecode code code
revi
sion
s
files
caus
al r
elat
ions
structural relations
Code in a VCS Software Model
1) Models of Source Repositories(github.com/markus1978/srcrepo)
11
SrcRepoSrcRepo
EMF/EMF-Fragments
EMF CompareEMF CompareEMF/EMF-Fragments
jGit MoDisco
EMF/EMF-Fragments
git repository with Java sourcesgit repository with Java sourcesgit repository with Java sources
1) Models of Source Repositories(github.com/markus1978/srcrepo)
12
A B C
A
A B
A D
PB1.R1
B1.R2
B1.R3
B1.R4
B2.R1
B2.R2
A
A B
Repository
Revision Diff
CompilationUnit
Model
Package Class
...
* * * *
*
1
prevnext
JGit MoDisco
model
metamodel
usageInPackageAccess
*
package1
«relation,fragmentation»
«fragmentation» «relation,fragmentation»
«relation»
«fragmentation»
* * extends1
1) Models of Source Repositories: ScalabilitySrcRepo is based on EMF-Fragments(https://github.com/markus1978/emf-fragments)
13
map/reduce(hadoop)
“Share Nothing” Nodes Cluster
DFS (HDFS)
key-value-store (EMF-resources)(hbase)
structured data (EMF-model)model transformations
2) Scala for queries and transformations: Syntax (internal DSL: from OCL to Scala)
14
Filip Krikava: Enriching EMF Models with Scala (quick overview), Eclipse Summit, Oct 24 2012
2) Scala for Queries: Syntax
def exists(predicate: (E) => Boolean): Booleandef forAll(predicate: (E) => Boolean): Boolean
def select(predicate: (E) => Boolean): Collection[E]def reject(predicate: (E) => Boolean): Collection[E]def collect[R](expr: (E) => R): Collection[R]def collectAll[R](expr: (E) => Collection[R]): Collection[R]def closure(expr: (E) => Collection[E]): Collection[E]
def aggregate[R](expr: (E) => R, start: () => R, aggr: (R, R) => R): Rdef sum(expr: (E) => Double): Doubledef product(expr: (E) => Double): Doubledef max(expr: (E) => Double): Doubledef min(expr: (E) => Double): Doubledef average(expr: (E) => Double): Double...
def run(runnable: (E) => Unit): Unit
15
2) Scala for Queries: Syntax
■ example SrcRepo query: “average number of methods per class”
def avgMethodsPerClass(self: Model) = { val packages = self.getOwnedPackages(). closure((p)=>p.getOwnedPackages());
val classes = packages.collect((p)=>p.getOwnedClasses()). closure((c)=>c.getInnerClasses()); return classes.average((c)=>c.getOwnedMethods().size());}
16
2) Scala and internal DSLs: Semantics
■Three different semantics, one interface■ immediate collection■ lazy iterator■Map/Reduce database
17
Example Analysis
18
First Example Case Study: Structured Design Matrices (DSM) and Propagation costs
19
Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software Designs, Journal of the institute of operations research and management science, 2006
First Example Case Study: Structured Design Matrices (DSM) and Propagation costs
19
Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software Designs, Journal of the institute of operations research and management science, 2006
Second Example Case Study: Detecting Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006
■ The same set of methods called from different locations within the same transaction (commits in a small time-window by the same committer) indicate the introduction for a cross-cutting concern.
Second Example Case Study: Detecting Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006
■ The same set of methods called from different locations within the same transaction (commits in a small time-window by the same committer) indicate the introduction for a cross-cutting concern.
Second Example Case Study: Detecting Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006
■ The same set of methods called from different locations within the same transaction (commits in a small time-window by the same committer) indicate the introduction for a cross-cutting concern.
Second Example Case Study: Detecting Cross-Cutting Concerns
20
Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006
■ The same set of methods called from different locations within the same transaction (commits in a small time-window by the same committer) indicate the introduction for a cross-cutting concern.
Summary
21
VCS Model MetricsVCS Model Metrics
1):Reverse Engineering to create AST-level models of software and its evolution
VCS Model
2):Transformations based on MSR Algorithmsto derive implicit dependencies
Metrics
2):Queries to perform measurements based on structural, causal, and implicit dependencies
Statistical analysis
BetterUnderstanding
Software Engineering