model-based analysis of large scale software repositories

30
Dr. Markus Scheidgen Model-based Analysis of Large Scale Software Repositories problem creating models of software repositories the means for analyzing such models example analysis 1

Upload: markus-scheidgen

Post on 03-Jul-2015

286 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Model-based Analysis of Large Scale Software Repositories

Dr. Markus Scheidgen

Model-based Analysis of Large Scale Software Repositories

■ problem■ creating models of software repositories■ the means for analyzing such models■ example analysis

1

Page 2: Model-based Analysis of Large Scale Software Repositories

Problem

2

Page 3: Model-based Analysis of Large Scale Software Repositories

Is Software Engineering a Science?

■ Def.: Science (from Latin scientia) is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe.■ Testable? Example theses:

★ DSLs allow domain experts to develop software effectively and more efficiently as with GPLs.

★ Static type systems lead to safer programming and fewer bugs.★ Functional programming leads to less performant programs. ★ Scrum allows to develop programs faster.★ My framework allows to develop ... more, faster ... with less, fewer ...

■ Methods for quantitative measures of software properties (metrics) are mostly used to assess the state of software projects, and rarely for empirical studies on software engineering itself

3

Page 4: Model-based Analysis of Large Scale Software Repositories

Reasons

4

inaccessibility • new methods have to be used first to produce data• industry cooperations necessary• open-source repositories are a possibility

data quality • not easy to distinguish between written code, generated code, test code

• there are maintained projects, developed projects, aborted projects

heterogeneity • different project structures• different paradigms• different languages• different APIs

amounts of data • source forge hosts >350.000 projects• current snap-shop of linux kernel contains 108 AST-nodes• EMF´s 50 MB Git repository, takes 20 GB of binary encoded

AST data

Page 5: Model-based Analysis of Large Scale Software Repositories

Relevant Fields with Partial Solutions

5

Mining Software Repositories (MSR)

Software Metrics Reverse Engineering

analyzing of rich data contained in software engineering related repositories such as version control systems, mailing list, bug-tracking systems

definition, acquisition, and analysis of quantitative measures of certain software properties

analyzing existing code bases to create representations at a higher level of abstraction (models)

• guiding software development• defect detection, prediction,

resolution• gaining actionable knowledge about

software projects and software engineering methodologies

• assessment of engineering costs for development, change, maintenance, etc.

• comparative analysis of software systems or analysis of software evolution

• comparative analysis of software engineering methodologies

• understanding existing software for development, change, maintenance, etc.

• derive AST, UML, or KDM models from software

• static language independent• syntax based• scale: single projects, large scale

(eclipse, apache), ultra large scale (source forge, git-hub)

• language independent (e.g. LOC)• syntax based (e.g. McCabe)• static, dynamic (evolution)

• syntax (structure, behavior)• semantics

Page 6: Model-based Analysis of Large Scale Software Repositories

Problem Statement: Everything is there, but ...

1.Missing abstractions:■ no general abstractions to cover multiple languages/

repositories are used■ only proprietary solutions and systems tailored for specific

algorithms/databases, languages, repositories

2.Scalability is an issue: ■ for ultra large scale repositories only VCS meta-data is used■ for large scale repositories only language independent analysis

on file-based granularity possible■ only for single software projects language dependent analysis

on AST-level detail are feasible

6

Page 7: Model-based Analysis of Large Scale Software Repositories

Proposed Solution: Scalable Model-based Framework

■Meta-model and reverse engineering based approach to analyze code-models on different and well-defined levels of abstractions instead of the code itself.■Query and transformation languages as well as model

persistence based on the Map/Reduce BigData paradigm.■ Target: AST-level analysis of large-scale repositories, e.g.

git.eclipse.org (>300 projects)

7

Page 8: Model-based Analysis of Large Scale Software Repositories

SrcRepo: A Framework for Large Scale Repository Analysis

8

Page 9: Model-based Analysis of Large Scale Software Repositories

Model-based Analysis of Large Scale Software Repositories

9

Page 10: Model-based Analysis of Large Scale Software Repositories

Model-based Analysis of Large Scale Software Repositories

9

VCS

Page 11: Model-based Analysis of Large Scale Software Repositories

Model-based Analysis of Large Scale Software Repositories

9

1):Reverse Engineering to create AST-level models of software and its evolution

VCS Model

Page 12: Model-based Analysis of Large Scale Software Repositories

Model-based Analysis of Large Scale Software Repositories

9

1):Reverse Engineering to create AST-level models of software and its evolution

VCS Model

2):Transformations based on MSR Algorithmsto derive implicit dependencies

Page 13: Model-based Analysis of Large Scale Software Repositories

Model-based Analysis of Large Scale Software Repositories

9

1):Reverse Engineering to create AST-level models of software and its evolution

VCS Model

2):Transformations based on MSR Algorithmsto derive implicit dependencies

Metrics

2):Queries to perform measurements based on structural, causal, and implicit dependencies

Page 14: Model-based Analysis of Large Scale Software Repositories

VCS Model MetricsVCS Model Metrics

Model-based Analysis of Large Scale Software Repositories

9

1):Reverse Engineering to create AST-level models of software and its evolution

VCS Model

2):Transformations based on MSR Algorithmsto derive implicit dependencies

Metrics

2):Queries to perform measurements based on structural, causal, and implicit dependencies

3): Statistical analysis

BetterUnderstanding

Software Engineering

Page 15: Model-based Analysis of Large Scale Software Repositories

1) Reverse Engineering Software in Version Control Systems (VCS)

10

code codecode

code codecodecode code code

revi

sion

s

files

caus

al r

elat

ions

structural relations

Code in a VCS Software Model

Page 16: Model-based Analysis of Large Scale Software Repositories

1) Models of Source Repositories(github.com/markus1978/srcrepo)

11

SrcRepoSrcRepo

EMF/EMF-Fragments

EMF CompareEMF CompareEMF/EMF-Fragments

jGit MoDisco

EMF/EMF-Fragments

git repository with Java sourcesgit repository with Java sourcesgit repository with Java sources

Page 17: Model-based Analysis of Large Scale Software Repositories

1) Models of Source Repositories(github.com/markus1978/srcrepo)

12

A B C

A

A B

A D

PB1.R1

B1.R2

B1.R3

B1.R4

B2.R1

B2.R2

A

A B

Repository

Revision Diff

CompilationUnit

Model

Package Class

...

* * * *

*

1

prevnext

JGit MoDisco

model

metamodel

usageInPackageAccess

*

package1

«relation,fragmentation»

«fragmentation» «relation,fragmentation»

«relation»

«fragmentation»

* * extends1

Page 18: Model-based Analysis of Large Scale Software Repositories

1) Models of Source Repositories: ScalabilitySrcRepo is based on EMF-Fragments(https://github.com/markus1978/emf-fragments)

13

map/reduce(hadoop)

“Share Nothing” Nodes Cluster

DFS (HDFS)

key-value-store (EMF-resources)(hbase)

structured data (EMF-model)model transformations

Page 19: Model-based Analysis of Large Scale Software Repositories

2) Scala for queries and transformations: Syntax (internal DSL: from OCL to Scala)

14

Filip Krikava: Enriching EMF Models with Scala (quick overview), Eclipse Summit, Oct 24 2012

Page 20: Model-based Analysis of Large Scale Software Repositories

2) Scala for Queries: Syntax

def  exists(predicate:  (E)  =>  Boolean):  Booleandef  forAll(predicate:  (E)  =>  Boolean):  Boolean

def  select(predicate:  (E)  =>  Boolean):  Collection[E]def  reject(predicate:  (E)  =>  Boolean):  Collection[E]def  collect[R](expr:  (E)  =>  R):  Collection[R]def  collectAll[R](expr:  (E)  =>  Collection[R]):  Collection[R]def  closure(expr:  (E)  =>  Collection[E]):  Collection[E]

def  aggregate[R](expr:  (E)  =>  R,  start:  ()  =>  R,  aggr:  (R,  R)  =>  R):  Rdef  sum(expr:  (E)  =>  Double):  Doubledef  product(expr:  (E)  =>  Double):  Doubledef  max(expr:  (E)  =>  Double):  Doubledef  min(expr:  (E)  =>  Double):  Doubledef  average(expr:  (E)  =>  Double):  Double...

def  run(runnable:  (E)  =>  Unit):  Unit

15

Page 21: Model-based Analysis of Large Scale Software Repositories

2) Scala for Queries: Syntax

■ example SrcRepo query: “average number of methods per class”

def  avgMethodsPerClass(self:  Model)  =  {  val  packages  =  self.getOwnedPackages().    closure((p)=>p.getOwnedPackages());

   val  classes  =  packages.collect((p)=>p.getOwnedClasses()).        closure((c)=>c.getInnerClasses());    return  classes.average((c)=>c.getOwnedMethods().size());}

16

Page 22: Model-based Analysis of Large Scale Software Repositories

2) Scala and internal DSLs: Semantics

■Three different semantics, one interface■ immediate collection■ lazy iterator■Map/Reduce database

17

Page 23: Model-based Analysis of Large Scale Software Repositories

Example Analysis

18

Page 24: Model-based Analysis of Large Scale Software Repositories

First Example Case Study: Structured Design Matrices (DSM) and Propagation costs

19

Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software Designs, Journal of the institute of operations research and management science, 2006

Page 25: Model-based Analysis of Large Scale Software Repositories

First Example Case Study: Structured Design Matrices (DSM) and Propagation costs

19

Alan MacCormack, John Rusnak, Carliss Baldwin: Exploring the Structure of Complex Software Designs, Journal of the institute of operations research and management science, 2006

Page 26: Model-based Analysis of Large Scale Software Repositories

Second Example Case Study: Detecting Cross-Cutting Concerns

20

Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006

■ The same set of methods called from different locations within the same transaction (commits in a small time-window by the same committer) indicate the introduction for a cross-cutting concern.

Page 27: Model-based Analysis of Large Scale Software Repositories

Second Example Case Study: Detecting Cross-Cutting Concerns

20

Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006

■ The same set of methods called from different locations within the same transaction (commits in a small time-window by the same committer) indicate the introduction for a cross-cutting concern.

Page 28: Model-based Analysis of Large Scale Software Repositories

Second Example Case Study: Detecting Cross-Cutting Concerns

20

Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006

■ The same set of methods called from different locations within the same transaction (commits in a small time-window by the same committer) indicate the introduction for a cross-cutting concern.

Page 29: Model-based Analysis of Large Scale Software Repositories

Second Example Case Study: Detecting Cross-Cutting Concerns

20

Silvia Breu, Thomas Zimmermann, Christian Lindig: Mining Eclipse for Cross-Cutting Concerns, MSR´06, Shanghai, 2006

■ The same set of methods called from different locations within the same transaction (commits in a small time-window by the same committer) indicate the introduction for a cross-cutting concern.

Page 30: Model-based Analysis of Large Scale Software Repositories

Summary

21

VCS Model MetricsVCS Model Metrics

1):Reverse Engineering to create AST-level models of software and its evolution

VCS Model

2):Transformations based on MSR Algorithmsto derive implicit dependencies

Metrics

2):Queries to perform measurements based on structural, causal, and implicit dependencies

Statistical analysis

BetterUnderstanding

Software Engineering