mining email social networks in oss

21
Mining Email Social Networks in OSS Christian Bird, Prem Devanbu, Alex Gourley, and Michael Gertz Department of Computer Science Anand Swaminathan Graduate School of Management University of California, Davis

Upload: salene

Post on 09-Jan-2016

33 views

Category:

Documents


1 download

DESCRIPTION

Mining Email Social Networks in OSS. Christian Bird, Prem Devanbu, Alex Gourley, and Michael Gertz Department of Computer Science Anand Swaminathan Graduate School of Management University of California, Davis. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mining Email Social Networks in OSS

Mining Email Social Networks in OSS

Christian Bird, Prem Devanbu, Alex Gourley, and Michael GertzDepartment of Computer Science

Anand SwaminathanGraduate School of Management

University of California, Davis

Page 2: Mining Email Social Networks in OSS

2

Motivation

• The social process is an important, hard to study, aspect of any software engineering effort

• Can be studied in many stable and mature OSS projects

• Nearly all communication is done via internet

• Records of both communication and development activity are freely available

Page 3: Mining Email Social Networks in OSS

3

Apache Communication and Development (since 1996)

• 100,000+ messages on dev mailing list

• 70,000 CVS commits to files

Page 4: Mining Email Social Networks in OSS

4

It is widely believed that OSS communities form a hierarchy

Can we use social network analysis to examine these OSS communities?

Image from Socialization in an Open Source Community, Nicolas Ducheneaut

Page 5: Mining Email Social Networks in OSS

5

Social Networks• A network consisting of actors and their

social ties to each other.

Network of who dated who in high school.

Courtesy of Mark Newman

Page 6: Mining Email Social Networks in OSS

6

Related Work• Xu, Gao, Christley, and Madey looked at

developers who worked on the same projects• Crowston & Howison co-ocurrence of

developers on a bug-report as a social link• Lopez, Gonzalez-Barahona, & Robles created

networks of developers and modules via CVS data.

• We believe that responses to emails indicates a strong social link.

Python

Alice Bobundirected link

contributecontribute

Bug Report

Alice Bobundirected link

resolve submit

foo.c

Alice Bobundirected link

commit commit

Mailing List

Alice Bob directed link

respond post

Page 7: Mining Email Social Networks in OSS

7

Issues with Mailing List Analysis

• Extracting conversation threads

• Rationalizing Timestamps

• Identifying targets in a broadcast medium

• Resolving Email Aliases

• Extracting Content

Page 8: Mining Email Social Networks in OSS

8

Email Aliases

• 2,544 different email address aliases have been used on the apache dev mailing list since 1996.

• Many of these email addresses belong to the same people.

• The following email addresses were all used by Joe Orton.

[email protected]@[email protected]@[email protected]

Page 9: Mining Email Social Networks in OSS

9

Email Alias Analysis

1. Preprocess name and address.– Remove commas (“orton, joe” -> “joe orton”)– Normalize whitespace and remove punctuation and common prefixes/suffixes (Mr., jr., etc.) – Remove common email terms (list, admin, root)

2. Use heuristics and fuzzy matching (Levenshtein edit distance) to determine what email aliases are similar. – name-name: “joe orton” vs. “joe e. orton”– email-email: “[email protected]” vs “[email protected]”– name-email:“joe orton” vs. “[email protected]

3. Manually post process aliases marked as similar to remove the high level of false positives

4. Use similar process to map CVS accounts to email aliases

Email addresses contain a <name, address> tuple. Often the name is empty.

Page 10: Mining Email Social Networks in OSS

10

Alias Results

• 2,544 email aliases used

• 2,008 unique “identities” used

• Many of the high volume participants had a large number of aliases

Page 11: Mining Email Social Networks in OSS

11

Creating the Email Social Network

• Each email message has a message id.• A response message contains an “in-response-

to” header which includes the message id of the previous message.

• If Joe posts a message and Bob responds, then there is indication of information flow and we create a directed tie from Joe to Bob.

• We have built a tool that will create a directed, valued, adjacency matrix of participants from our mailing list database for any time period.

Page 12: Mining Email Social Networks in OSS

12

Intro to Social Network Metrics

• In-degree – The number of links whose head is connected to a particular actor

• Out-degree – The number of links whose tail is connected to a particular actor

• Geodesic – A shortest path between two actors

• Betweenness – The number of geodesics that a particular actor lies on.

Page 13: Mining Email Social Networks in OSS

13

3

7

2

5

6

4

1

12

108

9

11

Example

High Out-Degree

High Betweenness

High In-Degree

Page 14: Mining Email Social Networks in OSS

14

Betweenness more formally

For a given vertex i

its st

st iiB

)(

)(

• Where σst is the number of geodesics between s and t• And σst(i) is the number of those paths passing through vertex i •Normalizing values so that the total of all betweenness sums to 1 is common

Page 15: Mining Email Social Networks in OSS

15

Everybody likes a pretty picture!

This is the social network of some of the most active participants on the Apache developer mailing list. Each link indicates at least 150 messages between participants.

Ryan Bloom has high betweenness in this network. Of the participants shown, he has the highest number of source file commits.

Page 16: Mining Email Social Networks in OSS

16

The distribution of in-degree and out-degree both exhibit a power-law character

Page 17: Mining Email Social Networks in OSS

17

Status of Developers vs. Non-Developers

Developer Non-Developer

Betweenness 0.0114 0.000140

Out-degree 0.00666 0.000451

In-Degree 0.00794 0.000367

Largest difference is in betweenness

Page 18: Mining Email Social Networks in OSS

18

Correlation between communication and development

  Changes Src Changes

Doc Change

s Out-degree In-degree betweenness

Changes 1

Src Changes 0.789 1

Doc Changes 0.932 0.514 1

Out-degree 0.520 0.712 0.308 1

In-degree 0.474 0.679 0.263 0.971 1

Betweenness 0.553 0.757 0.327 0.955 0.917 1

• High correlation between betweenness and source file changes• Lower correlation between betweenness and document file changes• Similar relationship for in- and out-degree.

Page 19: Mining Email Social Networks in OSS

19

Observations from the network

• The mailing list activity reflects a typical social network.

• Developers are the “key social brokers”.

• More active developers tend to be more important.

• Results robust: Postgres showed similar results.

Page 20: Mining Email Social Networks in OSS

20

Topics of future research

• Visualization of software and social data

• Who becomes a developer?

• Relationship between communication and collaboration networks

• Network Evolution

• Conway’s Law

Page 21: Mining Email Social Networks in OSS

21

Average In-Degree

Months

Avg

In-

Deg

ree