mining email social networks

46
Mining Email Social Networks Christian Bird, Alex Gourley, Prem Devanbu, Michael Gertz, Anand Swaminathan University of California, Davis Presented By: ArnamoyBhattacharyya

Upload: arnamoy10

Post on 29-Nov-2014

276 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Mining Email Social Networks

Mining Email Social Networks

Christian Bird, Alex Gourley,Prem Devanbu, Michael Gertz, Anand Swaminathan

University of California, Davis

Presented By:Arnamoy Bhattacharyya

Page 2: Mining Email Social Networks

Communication & Co-ordination (C&C) activities are central to large software projects

Page 3: Mining Email Social Networks

Communication & Co-ordination (C&C) activities are central to large software projects

Difficult to observe and study in traditional (closed-source, commercial) settings

Page 4: Mining Email Social Networks

Communication & Co-ordination (C&C) activities are central to large software projects

Difficult to observe and study in traditional (closed-source, commercial) settings

the email archives of OSS projects provide a useful trace of the communication and co-ordination activities of the participants

Page 5: Mining Email Social Networks

CHATTERERS & CHANGERS

A mailing list in an OSS project is a public forum

Page 6: Mining Email Social Networks

CHATTERERS & CHANGERS

A mailing list in an OSS project is a public forum

Anyone can post messages to the list.

Page 7: Mining Email Social Networks

CHATTERERS & CHANGERS

A mailing list in an OSS project is a public forum

Anyone can post messages to the list.

Posted messages are visible to all the mailing list subscribers.subscribers.

Page 8: Mining Email Social Networks

CHATTERERS & CHANGERS

A mailing list in an OSS project is a public forum

Anyone can post messages to the list.

Posted messages are visible to all the mailing list subscribers.

Posters include developers, bug-reporters, contributors (who submitpatches, but don't have commit privileges) and ordinaryusers.

subscribers.

Page 9: Mining Email Social Networks

A response b to a message a is an indication That –

the sender of b; (Sb) found that the sender of a; (Sa) had something interesting to say

Page 10: Mining Email Social Networks

A response b to a message a is an indication That –

the sender of b; (Sb) found that the sender of a; (Sa) had something interesting to say

It is also an indication of Sa’s status, i.e., Sb indicates that s/he found Sa's email worth reading, and worthy of response.

Page 11: Mining Email Social Networks

A response b to a message a is an indication That –

the sender of b; (Sb) found that the sender of a; (Sa) had something interesting to say

It is also an indication of Sa’s status, i.e., Sb indicates that s/he found Sa's email worth reading, and worthy of response.

However, the vast majority of individuals participating on the email list sent very few messages, and received very few replies to their messages

Page 12: Mining Email Social Networks

OF DOGS AND DEVELOPERS

“On the Internet, no one knows if you're a Dog“ - Peter Steiner

Page 13: Mining Email Social Networks

OF DOGS AND DEVELOPERS

“On the Internet, no one knows if you're a Dog"

The same individualcan use different email aliases

Page 14: Mining Email Social Networks

OF DOGS AND DEVELOPERS

“On the Internet, no one knows if you're a Dog"

The same individualcan use different email aliases

developer Ian Holsman uses 7 different email aliases

Page 15: Mining Email Social Networks

OF DOGS AND DEVELOPERS

“On the Internet, no one knows if you're a Dog"

The same individualcan use different email aliases

developer Ian Holsman uses 7 different email aliases

Ignoring these aliases would confound latersteps of data analysis

Page 16: Mining Email Social Networks

Unmasking Aliases

Most emails include a header that identifies the sender, of this form:

From: "Bill Stoddard" <[email protected]>

Page 17: Mining Email Social Networks

Unmasking Aliases

Most emails include a header that identifies the sender, of this form:

From: "Bill Stoddard" <[email protected]>

Crawl messages and extract all headers to produce a list of <Name,email> identifiers (IDs)

Execute a clustering algorithm that measure the similarity between every pair of IDs

Manually Post Process the clusters formed to remove further aliases

Page 18: Mining Email Social Networks

Unmasking Aliases

Most emails include a header that identifies the sender, of this form:

From: "Bill Stoddard" <[email protected]>

Crawl messages and extract all headers to produce a list of <Name,email> identifiers (IDs)

Execute a clustering algorithm that measure the similarity between every pair of IDs

Manually Post Process the clusters formed to remove further aliases

set the cluster similarity threshold quite low:easier to split big clusters than to unify two disparate clusters from a very large set.

Page 19: Mining Email Social Networks

THE CLUSTERING ALGORITHM

1. Normalize name

à remove all punctuation, suffixes(“jr")

àturn all whitespace into a single space

à Remove generic terms like “admin", “support", from the name

à split the name into first name and last name (using whitespace and commas as cues)

Page 20: Mining Email Social Networks

THE CLUSTERING ALGORITHM

2. Name Similarity:

Use a scoring algorithm between –

à The full namesà The first name and last name separatelyà Consider names similar if the full names are similar, orif both first and last names are similarif both first and last names are similar

e.G Andy Smith <-> Andrew Smith

Deepa Patel !<-> Deepa Ratnaswamy

Page 21: Mining Email Social Networks

THE CLUSTERING ALGORITHM

3. Names-email Similarity:

à If the email contains both first and last names – match

Arnamoy Bhattacharyya <-> [email protected]

à if the email contains the initial of one part of the name and entirety of the other part – match

Erin Bird <-> ebirdErin Bird <-> erinb

Page 22: Mining Email Social Networks

4. Email Similarity:

à If the Levenshtein edit distance between two email address bases (not including the domain, after the "@") is small – Match

THE CLUSTERING ALGORITHM

Page 23: Mining Email Social Networks

THE CLUSTERING ALGORITHM

5. Cumulative ID similarity:

à The similarity between two IDs is the maximum of the all mentioned above

E.G

Name Similarity – 3Names-email similarity – 5Names-email similarity – 5Email Similarity – 2

If the threshold is 4, it would be considered as a match

Page 24: Mining Email Social Networks
Page 25: Mining Email Social Networks

vast majority of people send only one message, andthere are some who send a great many

Page 26: Mining Email Social Networks
Page 27: Mining Email Social Networks

Out-degree - # of different people from whom an individual has received responses

Higher out-degree <-> higher status

Page 28: Mining Email Social Networks

In-degree - # of different people to whom an individual has replied-to

Indicates the level of engagement of an individual in the mailing list and the breadth of his/her interests

Page 29: Mining Email Social Networks

In-degree - # of different people to whom an individual has replied-to

Indicates the level of engagement of an individual in the mailing list and the breadth of his/her interests

The distributions show a small-world character

Page 30: Mining Email Social Networks

High correlation between messages sent and replies got(out order) -0.97

Page 31: Mining Email Social Networks

Correlation may not be true-

1. People who only post relevant messages get large responds to messages

2. Only people who receive replies from several people keep sending messages (Survival Effect)

Page 32: Mining Email Social Networks

Each link indicates at least 150 messages least 150 messages sent

Page 33: Mining Email Social Networks

C&C ACTIVITY AND DEVELOPMENTACTIVITY

How does email activity relate to software development activity?

73 committers-

1. A correlation of 0.80 between the number of messages sent by an individual, and number of source changes they make –

more software development work <-> more C&C activitymore software development work <-> more C&C activity

Page 34: Mining Email Social Networks

C&C ACTIVITY AND DEVELOPMENTACTIVITY

How does email activity relate to software development activity?

73 committers-

1. A correlation of 0.80 between the number of messages sent by an individual, and number of source changes they make –

more software development work <-> more C&C activity

2. A correlation of 0.57 between the number of messages sent by an individual, and number of document changes they make

source code activities require much more co-ordination effortthan documentation effort

more software development work <-> more C&C activity

Page 35: Mining Email Social Networks

Are developers more likely to play the role of gatekeepers or brokers in the complete email social network?

Page 36: Mining Email Social Networks

Are developers more likely to play the role of gatekeepers or brokers in the complete email social network?

Betweenness (BW)---

Page 37: Mining Email Social Networks

Are developers more likely to play the role of gatekeepers or brokers in the complete email social network?

Betweenness (BW)---

High betweenness <-> that the person is a kind of broker, or gatekeeper

Page 38: Mining Email Social Networks

mean

Page 39: Mining Email Social Networks

mean

Developers are higher in status than non-developers

Page 40: Mining Email Social Networks

Relative Status of Developers

Do the most active developers have the highest status among developers ?

Page 41: Mining Email Social Networks

Relative Status of Developers

Do the most active developers have the highest status among developers ?

Source changes are not as highly correlated with document changes <-> not all developers are engaged in both to the same degree

Page 42: Mining Email Social Networks

Relative Status of Developers

Do the most active developers have the highest status among developers ?

Source changes are not as highly correlated with document changes <-> not all developers are engaged in both to the same degree

Source changes shows the strongest rank correlation with the social network status <-> the most active developers play the strongest role of communicators, brokers, and gatekeepers

Page 43: Mining Email Social Networks

The level of activity on the mailing list is strongly correlated with source code change activity, and to a lesser extent with document change activity.

Conclusion

Page 44: Mining Email Social Networks

The level of activity on the mailing list is strongly correlated with source code change activity, and to a lesser extent with document change activity.

Social network measures such as in-degree, out-degree and betweennessindicate that developers who actually commit changes, play much more significant roles in the email community than non-developers.

Conclusion

Page 45: Mining Email Social Networks

The level of activity on the mailing list is strongly correlated with source code change activity, and to a lesser extent with document change activity.

Social network measures such as in-degree, out-degree and betweennessindicate that developers who actually commit changes, play much more significant roles in the email community than non-developers.

Conclusion

Even within the select group of developers, there is a strong correlation between the social network importance and level of source code change activity.

Page 46: Mining Email Social Networks

Questions?