mining email social networks
DESCRIPTION
TRANSCRIPT
Mining Email Social Networks
Christian Bird, Alex Gourley,Prem Devanbu, Michael Gertz, Anand Swaminathan
University of California, Davis
Presented By:Arnamoy Bhattacharyya
Communication & Co-ordination (C&C) activities are central to large software projects
Communication & Co-ordination (C&C) activities are central to large software projects
Difficult to observe and study in traditional (closed-source, commercial) settings
Communication & Co-ordination (C&C) activities are central to large software projects
Difficult to observe and study in traditional (closed-source, commercial) settings
the email archives of OSS projects provide a useful trace of the communication and co-ordination activities of the participants
CHATTERERS & CHANGERS
A mailing list in an OSS project is a public forum
CHATTERERS & CHANGERS
A mailing list in an OSS project is a public forum
Anyone can post messages to the list.
CHATTERERS & CHANGERS
A mailing list in an OSS project is a public forum
Anyone can post messages to the list.
Posted messages are visible to all the mailing list subscribers.subscribers.
CHATTERERS & CHANGERS
A mailing list in an OSS project is a public forum
Anyone can post messages to the list.
Posted messages are visible to all the mailing list subscribers.
Posters include developers, bug-reporters, contributors (who submitpatches, but don't have commit privileges) and ordinaryusers.
subscribers.
A response b to a message a is an indication That –
the sender of b; (Sb) found that the sender of a; (Sa) had something interesting to say
A response b to a message a is an indication That –
the sender of b; (Sb) found that the sender of a; (Sa) had something interesting to say
It is also an indication of Sa’s status, i.e., Sb indicates that s/he found Sa's email worth reading, and worthy of response.
A response b to a message a is an indication That –
the sender of b; (Sb) found that the sender of a; (Sa) had something interesting to say
It is also an indication of Sa’s status, i.e., Sb indicates that s/he found Sa's email worth reading, and worthy of response.
However, the vast majority of individuals participating on the email list sent very few messages, and received very few replies to their messages
OF DOGS AND DEVELOPERS
“On the Internet, no one knows if you're a Dog“ - Peter Steiner
OF DOGS AND DEVELOPERS
“On the Internet, no one knows if you're a Dog"
The same individualcan use different email aliases
OF DOGS AND DEVELOPERS
“On the Internet, no one knows if you're a Dog"
The same individualcan use different email aliases
developer Ian Holsman uses 7 different email aliases
OF DOGS AND DEVELOPERS
“On the Internet, no one knows if you're a Dog"
The same individualcan use different email aliases
developer Ian Holsman uses 7 different email aliases
Ignoring these aliases would confound latersteps of data analysis
Unmasking Aliases
Most emails include a header that identifies the sender, of this form:
From: "Bill Stoddard" <[email protected]>
Unmasking Aliases
Most emails include a header that identifies the sender, of this form:
From: "Bill Stoddard" <[email protected]>
Crawl messages and extract all headers to produce a list of <Name,email> identifiers (IDs)
Execute a clustering algorithm that measure the similarity between every pair of IDs
Manually Post Process the clusters formed to remove further aliases
Unmasking Aliases
Most emails include a header that identifies the sender, of this form:
From: "Bill Stoddard" <[email protected]>
Crawl messages and extract all headers to produce a list of <Name,email> identifiers (IDs)
Execute a clustering algorithm that measure the similarity between every pair of IDs
Manually Post Process the clusters formed to remove further aliases
set the cluster similarity threshold quite low:easier to split big clusters than to unify two disparate clusters from a very large set.
THE CLUSTERING ALGORITHM
1. Normalize name
à remove all punctuation, suffixes(“jr")
àturn all whitespace into a single space
à Remove generic terms like “admin", “support", from the name
à split the name into first name and last name (using whitespace and commas as cues)
THE CLUSTERING ALGORITHM
2. Name Similarity:
Use a scoring algorithm between –
à The full namesà The first name and last name separatelyà Consider names similar if the full names are similar, orif both first and last names are similarif both first and last names are similar
e.G Andy Smith <-> Andrew Smith
Deepa Patel !<-> Deepa Ratnaswamy
THE CLUSTERING ALGORITHM
3. Names-email Similarity:
à If the email contains both first and last names – match
Arnamoy Bhattacharyya <-> [email protected]
à if the email contains the initial of one part of the name and entirety of the other part – match
Erin Bird <-> ebirdErin Bird <-> erinb
4. Email Similarity:
à If the Levenshtein edit distance between two email address bases (not including the domain, after the "@") is small – Match
THE CLUSTERING ALGORITHM
THE CLUSTERING ALGORITHM
5. Cumulative ID similarity:
à The similarity between two IDs is the maximum of the all mentioned above
E.G
Name Similarity – 3Names-email similarity – 5Names-email similarity – 5Email Similarity – 2
If the threshold is 4, it would be considered as a match
vast majority of people send only one message, andthere are some who send a great many
Out-degree - # of different people from whom an individual has received responses
Higher out-degree <-> higher status
In-degree - # of different people to whom an individual has replied-to
Indicates the level of engagement of an individual in the mailing list and the breadth of his/her interests
In-degree - # of different people to whom an individual has replied-to
Indicates the level of engagement of an individual in the mailing list and the breadth of his/her interests
The distributions show a small-world character
High correlation between messages sent and replies got(out order) -0.97
Correlation may not be true-
1. People who only post relevant messages get large responds to messages
2. Only people who receive replies from several people keep sending messages (Survival Effect)
Each link indicates at least 150 messages least 150 messages sent
C&C ACTIVITY AND DEVELOPMENTACTIVITY
How does email activity relate to software development activity?
73 committers-
1. A correlation of 0.80 between the number of messages sent by an individual, and number of source changes they make –
more software development work <-> more C&C activitymore software development work <-> more C&C activity
C&C ACTIVITY AND DEVELOPMENTACTIVITY
How does email activity relate to software development activity?
73 committers-
1. A correlation of 0.80 between the number of messages sent by an individual, and number of source changes they make –
more software development work <-> more C&C activity
2. A correlation of 0.57 between the number of messages sent by an individual, and number of document changes they make
source code activities require much more co-ordination effortthan documentation effort
more software development work <-> more C&C activity
Are developers more likely to play the role of gatekeepers or brokers in the complete email social network?
Are developers more likely to play the role of gatekeepers or brokers in the complete email social network?
Betweenness (BW)---
Are developers more likely to play the role of gatekeepers or brokers in the complete email social network?
Betweenness (BW)---
High betweenness <-> that the person is a kind of broker, or gatekeeper
mean
mean
Developers are higher in status than non-developers
Relative Status of Developers
Do the most active developers have the highest status among developers ?
Relative Status of Developers
Do the most active developers have the highest status among developers ?
Source changes are not as highly correlated with document changes <-> not all developers are engaged in both to the same degree
Relative Status of Developers
Do the most active developers have the highest status among developers ?
Source changes are not as highly correlated with document changes <-> not all developers are engaged in both to the same degree
Source changes shows the strongest rank correlation with the social network status <-> the most active developers play the strongest role of communicators, brokers, and gatekeepers
The level of activity on the mailing list is strongly correlated with source code change activity, and to a lesser extent with document change activity.
Conclusion
The level of activity on the mailing list is strongly correlated with source code change activity, and to a lesser extent with document change activity.
Social network measures such as in-degree, out-degree and betweennessindicate that developers who actually commit changes, play much more significant roles in the email community than non-developers.
Conclusion
The level of activity on the mailing list is strongly correlated with source code change activity, and to a lesser extent with document change activity.
Social network measures such as in-degree, out-degree and betweennessindicate that developers who actually commit changes, play much more significant roles in the email community than non-developers.
Conclusion
Even within the select group of developers, there is a strong correlation between the social network importance and level of source code change activity.
Questions?