exploiting geographical location for team formation in ... file11 communicationcost...
TRANSCRIPT
Yuqiang Han1, Yao Wan1,3, Liang Chen2, Guandong Xu3, and Jian Wu11 Zhejiang University, Hangzhou, China
2 Sun Yat-Sen University, Guangzhou, China3 University of Technology, Sydney, Australia
May 25, 2017
Exploiting Geographical Location for TeamFormation in Social Coding Sites
PAKDD2017May 23-26, 2017, Jeju, South Korea
2
q Introduction§ Background§ Motivation§ Challenge
qModel§ Communication Cost§ Geographical Proximity Cost§ Combined Cost§ GA-based Optimization
q Dataset
q Experiments
q Conclusion and Future Work
Outline
3
v What is team formation?
Background
Given a task and a set of experts(organized in anetwork), find the subset of experts that caneffectively perform the task.
v Applications
ü Collaboration networks(e.g., scientists, developers)ü Organizational structure of companiesü Team-based hiring
4
T = {C++, Python, Graphics, Algorithms}
Background
A{Algorithms}
B{C++}
C{Python, Graphics}
D{C++, Algorithms}
E{Python, Graphics, C++}
A
B C D
E
What’s the best team we can recommend?
5
Background
v What is Social Coding?
It is an approach to software development focusing oneffective collaboration.
6
Background
v Geographical proximity
Geographical proximity is playing an increasing important role inmany domains, such as knowledge production and technologicalinnovation, in spite of rapid development in telecommunications
technology.
7
v Team formation in social coding sites
Ø Developers(defining the set 𝑽, with |𝑽| = 𝒏)Ø Every developer 𝒊 is associated with a set of skills 𝑺𝒊Ø and a geographical location𝒈𝒊
Ø ProjectsØ Every project𝑷 is associated with a set of skills required for
completing the project
Ø A social coding network of developers(𝑮 = (𝑽, 𝑬,𝒘))Ø Weight on the edge indicates communication cost
Motivation
8
Motivation
0.2
0.2
0.3
0.3
0.1
Location Developer Skill Project
a
d
c
b
e
v Given a project and a social coding network of developers,find the subset(team) of developersI. each skill in project will be covered by the specified number of
developersII. each developer will cover and only cover one skillIII. the communication cost and geographical proximity cost are as
minimum as possible
9
Challenge
define the communication cost?
define the geographical proximitycost?
combine the communication costand geographical proximity cost?
10
Communication Cost
v Communication cost between two developers
Thecommunicationcostisthesumofweightsontheshortestpathbetweentwodevelopersinsocialcodingnetworks.
The lower the communication cost is, the more easily they cancollaborate with each other.
In social coding networks such as GitHub, the weights of edges aredefined as
𝑤 𝑢, 𝑣 = 1 −|𝑁6 ∩ 𝑁8||𝑁6 ∪ 𝑁8|
Where 𝑁6 and 𝑁8 is the set of projects in which 𝑢 and 𝑣 are listed as contributors respectively.
11
Communication Cost
v Communication cost of a team
Kargar,Mehdi,andAijun An."Discoveringtop-kteamsofexpertswith/withoutaleaderinsocialnetworks."
Givenasocialcodingnetwork𝐺whoseedgesareweightedbythecommunicationcostbetweentwodevelopersandateam𝑇 ofdevelopersfrom𝐺,thecommunicationcostof𝑇 isdefinedas
𝑆𝐶𝐶 𝑇 = > > 𝑐𝑐(𝑒A, 𝑒B)C
BDAEF
C
ADF
where𝑐𝑐(𝑒A, 𝑒B) isthecommunicationcostofdeveloper𝑒A and𝑒B.
A
B C
E0.5
0.30.2
0.4
SCC = 0.2 + 0.6 + 0.9 + 0.4 + 0.7 +0.3 = 3.1
12
Geographical Proximity Cost
v Geographical proximity cost of a team
Givenateam𝑇ofexperts,whereeachhavingalocationcode,thegeographicalproximitycost ofteam 𝑇 isdefinedas
𝑆𝐺𝑃 𝑇 = > > 𝑔𝑝(𝑒A, 𝑒B)C
BDAEF
C
ADF
where𝑔𝑝(𝑒A, 𝑒B) isthegeographical proximity costofdeveloper𝑒A and𝑒B.
The geographical proximity of two developers is defined as
𝑔𝑝 𝑢, 𝑣 = J0,1.𝑖𝑓𝑢𝑎𝑛𝑑𝑣𝑎𝑟𝑒𝑖𝑛𝑡ℎ𝑒𝑠𝑎𝑚𝑒𝑟𝑒𝑔𝑖𝑜𝑛
𝑜𝑡ℎ𝑒𝑟𝑠It is related to the differences in culture, work habits and so on.
13
Combined Cost
Givenasocial coding networkandatrade-off𝜆 betweenthecommunicationcostandgeographicalproximity,wedefinethecombinedcostoftheteam𝑇as
𝐶𝑜𝑚𝐶𝑜𝑠𝑡 𝑇 = 1 − 𝜆 ×𝑆𝐶𝐶 𝑇 + 𝜆×𝑆𝐺𝑃(𝑇)
The parameter 𝜆 varying from 0 to 1 indicates the tradeoff between communicationcost and geographical proximity cost.
v The combined cost function
TeamFormationbyMinimizingtheCombinedCostNP-hard
14
𝑆𝑘𝑖𝑙𝑙1 𝑆𝑘𝑖𝑙𝑙2 𝑆𝑘𝑖𝑙𝑙3 𝑆𝑘𝑖𝑙𝑙4
𝑑1 𝑑2 𝑑8𝑑7𝑑6𝑑5𝑑4𝑑3
GA-based Optimization
v Genetic algorithm based optimization
Selection
Crossover
Mutation
Evaluation
Solution Set
TerminationCriterion?
Yes
No
Initial populationgeneration
Encoding
Evaluation
Fitness function =Combined cost function s
15
Dataset
(a) Top10countrieswiththelargestnumberofdevelopers.
(b) Distributionoflocationdiversitydistribution,consideringthecompositionofteam
GitHub: 36,701developers, 3,532,453 projects, 1,610,072 edges.
Observation:inmostteams(nearly55%),thedeveloperscomefromnomorethanoneortwocountries.
16
Experiments
v Experiments Setup
Parameter Value
Population size 200
Number of generation 100
Crossover probability 0.2
Mutation probability 0.8
Number of skills 𝑘 10
Tradeoff 𝜆 0.5
Iterations for each experiments 10
17
Experiments
v Evaluation metrics
v Performance comparison
1. Communication cost: revealstheefficiencyofcommunicationbetweendevelopers
2. Geographical proximity cost: revealshowcloselythedevelopersoftheteamintermsofgeographicallocation
3. Combined cost: reveals the effect of combination ofcommunication cost and geographical proximity cost
1. Random Algorithm2. Approximation Rare Algorithm3. Minimum Cost Contribution Rare Algorithm
18
Experimentsv Experiments results
Analysis:1. Theproposedmodelachievesbetterperformance becausetheit
considersthegeographicalproximityduringtheprocessoffindingaoptimalteam.
2. Theproposedmodelachievesbetterperformance becauseit hasalargersearchspace.
3. Theproposedmodelachievesbetterperformancebecauseit considersboth the costs.
19
Experiments
v Impact of number of skills
Tostudytheimpactofnumberof skills ontheperformance,wesetthenumber𝑘 = {2, 4, 6, 8, 10}. Andforeach𝑘,wegenerate10randomprojectstotaketheaverageresult.The proposed modelcanalwaysachievebetterperformance.
20
Conclusion and Future Work
v Conclusion
v Future Work
1. Exploitthegeographicallocationofdeveloperstoboosttheperformanceof teamformationinsocialcodingsites.
2. Incorporatethecommunicationcostandgeographicalproximitycost intoaunifiedobjectivefunctionandemploygeneticalgorithmtooptimizeit.
3. Comprehensiveexperimentsonareal-worlddatasetillustratethe effectiveness of the proposed approach.
1. Investigate the impact of social media on the performance ofteam formation.
2. Exploit the interaction patterns for the accurate interpretationof link strength between developers.