a data driven journey through research on software engineering
TRANSCRIPT
A DATA-DRIVEN JOURNEY THROUGH RESEARCH ON SOFTWARE ENGINEERING
Mario Sangiorgio
MOTIVATION
Getting a better idea of what’s going on in software engineering research community
through a quantitative approach
RELATED WORKS•C. Ghezzi - Keynote at ICSE 2008
Reflections on 40+ years of software engineering research and beyond
•L. Briand - Keynote at ICSM 2011Useful software engineering research: leading a double agent life
•D. Rosemblum - Keynote at ASE 2012Whither software engineering research?
SUBJECTS OF OUR STUDY
researchers
affiliations geographical areas
research topics
DATA
ACADEMIC LITERATURE
SELECTED PUBLICATIONS
REPRESENTATIVENESS
AUTHORITATIVENESS
DATA SOURCES
Articles published and their authors
Citations, authors and affiliation details
COMPLETE XML DATABASE
APIs
COLLECTED DATAVenue Number of papers From To
TSE 3043 1975 2012TOSEM 295 1992 2012
ICSE 2907 1976 2012ASE 1116 1997 2012
ESEC/FSE 416 1987 2012TOTAL 7777 1975 2012
9865 researchers 278794 citations
ANALYSIS
AUTHOR ANALYSIS
Who published the most?
Are there sub-communities?
MOST PROLIFIC AUTHORSSoftware
EngineeringICSE ASE ESEC/FSE TSE TOSEM
Basili60
Bohem28
Xie24
Clarke8
Basili33
Notkin13
Notkin56
Basili26
Grundy18
D. Jackson8
Briand26
Rothermel8
Kramer49
Osterweil23
Hosking16
Ernst7
Weyuker18
Roman6
Harrold46
Kramer21
Egyed16
Notkin7
Knight17
Wolf6
Xie46
Notkin21
Lo16
Uchitel7
Kramer16
Harrold6
SUB-COMMUNITY DETECTION
For each venue we consider the top most
prolific authors
We compute the set similarity between all
the pair of venuesJ(A,B) =
|A \B||A [B|
SUB-COMMUNITIES
−0.2 0.0 0.2 0.4 0.6
−0.2
0.0
0.2
0.4
mds[,1]
mds[,2]
TSE
TOSEM
ICSE
ASE
FSE
TOPIC ANALYSIS
What is the topic of a paper?
What are the hot topics in software engineering?
How have they evolved?
CITATION NETWORK
Papers in the dataset
CITATION NETWORK
Internal citations
CITATION NETWORK
Complete citations
Citations from specific venues
EXAMPLE
What is the topic of the yellow paper?
EXAMPLEWhat is the topic of the yellow paper?
Topic Direct citationsTopic A 2Topic B 0General 1
What is the topic of the general paper?
EXAMPLEWhat is the topic of the yellow paper?
Topic Direct citationsTopic A 2Topic B 1General 1
Topic profileTopic profile
Topic A 66%
Topic B 33%
SOFTWARE ENGINEERING TOPICS
Topic Fraction of papersProgramming Languages 9.34%
Formal Methods 8.49%Software Reliability 6.13%Distributed Systems 5.96%
Software Maintenance 5.92%Testing 4.64%
Software Quality 4.53%Models 4.36%
Software Architectures 4.36%
TOPICS IN THE ‘70STopic Fraction of papers
Programming Languages 16.71%Performance 7.95%
Operating Systems 7.29%Database Systems 6.84%Formal Methods 6.65%
Software Architectures 6.14%Knowledge Engineering 5.69%
Distributed Systems 4.94%Software Maintenance 4.18%
By far the most represented
Topics from other fields
TOPICS IN THE ‘80STopic Fraction of papers
Programming Languages 10.48%Distributed Systems 9.30%
Knowledge Engineering 8.47%Software Reliability 6.68%Formal Methods 6.51%
Information Systems 5.55%Software Maintenance 5.04%
Models 4.35%Artificial Intelligence 3.74%
Significant rise
Other fields, related to
distributed systems
Not only code
TOPICS IN THE ‘90STopic Fraction of papers
Formal Methods 8.29%Programming Languages 8.13%
Distributed Systems 6.80%Software Maintenance 6.55%Software Architectures 5.34%
Software Quality 4.80%Knowledge Engineering 4.67%
Models 4.65%Information Systems 4.40%
Change of the most published
topic
Focus on software quality
TOPICS IN THE 2000STopic Fraction of papers
Formal Methods 9.93%Programming Languages 8.37%
Testing 6.86%Software Maintenance 6.58%
Software Reliability 6.22%Software Quality 5.72%
Models 4.80%Empirical Studies 4.76%
Software Architectures 4.38%
Analysis of open source repositories
Still lot of emphasis on
software quality
NEED FOR A FINER ANALYSIS
SOLUTION: sliding window instead of fixed subdivision
Topics change constantly, not once in a decade
TESTING
0
0.05
0.09
0.14
0.18
1975 1980 1985 1990 1995 2000 2005
EMPIRICAL STUDIES
0
0.05
0.09
0.14
0.18
1975 1980 1985 1990 1995 2000 2005
SERVICES
0
0.05
0.09
0.14
0.18
1975 1980 1985 1990 1995 2000 2005
DISTRIBUTED SYSTEMS
0
0.05
0.09
0.14
0.18
1975 1980 1985 1990 1995 2000 2005
PROGRAMMING LANGUAGES
0
0.05
0.09
0.14
0.18
1975 1980 1985 1990 1995 2000 2005
PER-VENUE INSIGHTSVenue Peculiarities
TSE Biased towards empirical works
TOSEM More focused on formal aspects
ICSE Balanced with respect to other venues
ESEC/FSE Formal, with interests in testing, modeling and requirements engineering
ASE Interests in program analysis and automated reasoning
AFFILIATION ANALYSIS
Where do the most prolific authors work?
How much research is done in industry?
AFFILIATION PROFILE
Author AffiliationAuthor A 1Author B 2Author B 2
Affiliation profileAffiliation profile
Affiliation 1 33%
Affiliation 2 66%
MOST PROLIFIC AFFILIATIONSAffiliation Papers
IBM 186.32Carnegie Mellon University 166.52University of Texas, Austin 122.62
University of Maryland 106.83Microsoft 101.63
AT&T Bell Laboratories 101.37University of California, Irvine 98.17
Georgia Institute of Technology 94.75Massachusetts Institute of Technology 93.24
University of Virginia 81.55
ALL FROM THE USA
PER-VENUE INSIGHTSVenue Peculiarities
TSE Is the venue with more industrial contribution
TOSEM European universities among the top contributors
ICSE Balanced set of contributors we saw in the other venues
ESEC/FSE Despite ESEC, there is no bias towards Europe
ASE Industrial contribution is less relevant.Some affiliations appear only in its top list.
Is Europe more formal?
Is it linked to the presence of empirical works?
It is representative
INDUSTRY VS ACADEMIA
0
0.25
0.50
0.75
1.00
1970 1975 1980 1985 1990 1995 2000 2005
Industry Academia
GEOGRAPHICAL ANALYSIS
Where does the contribution come from?
GEOGRAPHICAL AREAS
North America
Europe
Asia&
Oceania
AfricaSouth
America
LOCATION OF A PAPERAffiliation profileAffiliation profile
Affiliation 1 20%Affiliation 2 30%Affiliation 3 50%
LocationsLocationsAffiliation 1 North AmericaAffiliation 2 EuropeAffiliation 3 Europe
Location profileLocation profile
North America 20%
Europe 80%
GEOGRAPHICAL DISTRIBUTION
0
0.25
0.50
0.75
1.00
1970 1975 1980 1985 1990 1995 2000 2005Europe North America South America Asia & Oceania Africa
CONCLUSION
Academic literature contains a lot of information about a scientific community
With data mining techniques we can unveil it and get some interesting insights
QUESTIONS?