future author collaboration
TRANSCRIPT
-
8/17/2019 Future author collaboration
1/6
Future Collaboration Prediction in Co-authorship Network
Roopashree N
Post Graduate Student
Department of CSE, BMS College of EngineeringBangalore, India
Umadevi V
Associate Professor
Department of CSE, BMS College of Engineering
Bangalore, India
Abstract — The advent of proliferation of social networking is
high on use in present era. A co-authorship network which shows
research collaborations, are an important class of social
networks. Research collaborations often yield good results but
organizing a research group is a tedious task. Every researcher is
concerned to collaborate with the best expertise complimenting
him. Although there was abundant research conducted to find
future collaborators or links, very few of them are able to findout effective relationship among them. In this article, we propose
a method that makes link predictions in co-authorship networks
using supervised approach. The model extracts the features from
the networks node and topological structure which can be good
indicators of future collaborations. The proposed method was
evaluated on synthetic as well as real social networks such as
NetScience. Our experiment corroborated the results, and
demonstrated the efficiency of the method.
Keywords—SVM; Co-authorship network; Future
collaboration
I. Introduction
Social Network Analysis (SNA) has been evolved asone of the key research area which has attracted a considerable
amount of attention in recent years. It mainly focuses on
relationships between individuals also referred as social
entities. A social network can be defined as a network of
interactions whose nodes represent people or other entitiesembedded in a social context, and whose edges signifies the
interaction, collaboration, or influence between entities which
are driven by mutual interests, intrinsic to the group.
In general, social networks are extremely rich in
content, and they contain a very large amount of linkage data
which can be leveraged for analysis. The linkage data
constitutes the graph structure of the social network. The
availability of massive amounts of data has given a new
impetus towards a scientific and statistically robust study inthe field of social networks. This data-centric thrust has led to
a significant amount of research, which has been unique in its
statistical and computational focus in analyzing large amountsof online social network data.
SNA helps to identify highly peripheral people who
essentially represents untapped expertise and thus,
underutilized resources for the group.
Social networks are highly dynamic; they grow and
change quickly over time through the addition of new edges,
signifying the appearance of new connections in the
underlying social structure. Understanding the mechanisms
that drive the volatility is a fundamental and complex question
that is still not well understood. However an important class of
technique that can be addressed is to predict future
associations and factors driving those associations. This problem is known as Link Prediction which is a key research
direction within the social network analysis.Prediction of imminent links in co-authorship graph
is an important research direction, since it is conceptually and
structurally identical with the realistic problem of social
network where the scientists in the community interact toachieve common goal.
More formally, the link prediction task can be
formulated as follows (based upon the definition in Liben-
Nowell and Kleinberg [1]): Given a social network inwhich an edge represents some form ofinteractions between its endpoints at a particular time .Multiple interactions can be recorded by parallel edges or by
using a complex timestamp for an edge. For time weassume that [ ] denotes the subgraph of G restricted tothe edges with time-stamps between
and
. In a supervised
training setup for link prediction, we can choose a training
interval [t0, t0 ] and a test interval [t1, t1 ] where t0< t1. Nowthe link prediction task is to output a list of edges not present
in t0, t0, which are predicted to appear in the network t1,t1 .
The prime goal of this work is to propose an efficient
technique for designing a link predictor in networks, where
nodes can represent researchers and links representingcollaborations. The above mentioned goal has the following
challenges to be addressed:
• To extract the node and topological features and use
them in combination to expect a better result.
• Link prediction datasets are characterized by a large
amount of imbalance in the distribution of class
labels i.e., the existing number of edges often less
than the number of edges known to be not existing.
• It is vital that the proposed method computes
effectively if they are scaled to a large networkconsisting large number of nodes and edges.
2014 3rd International Conference on Eco-friendly Computing and Communication Systems
978-1-4799-7002-5/14 $31.00 © 2014 IEEE
DOI 10.1109/ICECCS.2014.45
184
2014 3rd International Conference on Eco-friendly Computing and Communication Systems
978-1-4799-7002-5/14 $31.00 © 2014 IEEE
DOI 10.1109/ICECCS.2014.45
183
2014 3rd International Conference on Eco-friendly Computing and Communication Systems
978-1-4799-7002-5/14 $31.00 © 2014 IEEE
DOI 10.1109/ICECCS.2014.45
183
2014 3rd International Conference on Eco-friendly Computing and Communication Systems
978-1-4799-7002-5/14 $31.00 © 2014 IEEE
DOI 10.1109/Eco-friendly.2014.45
183
-
8/17/2019 Future author collaboration
2/6
II. Related Work
A. Background
The earliest and the most basic link prediction modelwas proposed by Liben-Nowell and Kleinberg [1] that works
explicitly on a social network. Every vertex in the graph
represents a person and an edge between two vertices
represents the interaction between the persons. Multiplicity ofinteractions can be modelled explicitly by allowing parallel
edges or by adopting a suitable weighting scheme for the
edges. The learning paradigm in this setup typically extracts
the similarity between a pair of vertices by various graph-
based similarity metrics and uses the ranking on the similarity
scores to predict the link between two vertices. They
concentrate mostly on the performance of various graph-basedsimilarity metrics for the link prediction task.
The recent methods and techniques were surveyed by
Mohammad Al Hasan.et.al [2] which includes a variety of
techniques of link prediction ranging from feature-based
classification and kernel based method to matrix factorization
and probabilistic graphical models. These methods vary with
respect to complexity of the model, prediction performance,scalability, and its generalization ability. They have
considered the traditional (non-Bayesian) models which
extract a set of features to train a binary classification model.
These authors also presented another work on link prediction
using supervised learning [3] in which many features have been identified. The features are calculated and effectiveness
has been calculated. They also compare the different classes of
supervised learning algorithms in terms of their performance
metrics. This research work involves how to construct a
dataset for a machine learning algorithm. The features selected
were based on node and structural attributes both resulting in
the improved accuracy. They have experimented on two
datasets of co-authorship network using most of the well-
known supervised algorithms and based on the ranking. It is
known that small set of features always yield better
performance results.According to Kanika Narang.et.al,[4] link prediction
heuristic should take into account not only how close two
nodes is in a network, but also their ability to send and receive
information or to influence each other. This is determined by
the nature of the flow taking place on the network, i.e., the
process by which information is transmitted from one node toanother node to show that how easily two nodes can interact
with or influence each other depends also on the nature of the
flow which is an intermediate between their interactions. They
show that different types of flows ultimately lead to different
notions of network proximity. They measure the performance
of different heuristics on the missing link prediction task in a
variety of real-world social, technological and biologicalnetworks. They show that heuristics based on random walk-
type processes outperform the popular Adamic-Adar and the
number of common neighbor’s heuristics in many networks.
While the newly defined heuristics measures did not beatexisting ones in the missing link prediction task, the work
motivated these heuristics in terms of a flow-based
framework.
The e ects of social inuence and homophily wasconsidered by Neil Zhenqiang Gong.et.al, [5], suggest that
both network structure and node attribute information should
inform the tasks of link prediction and node attribute
inference. They used the SAN (Social Attribute Network)
framework with several leading supervised and unsupervisedlink prediction algorithms and demonstrate performance
improvement for each algorithm on both link prediction and
attribute inference. They made the novel observation that
attribute inference can help inform link prediction, i.e., link
prediction accuracy is improved by first inferring and predicting missing attributes. They comprehensively evaluate
these algorithms and compare them with other existing
algorithms using a novel, large-scale Google+ dataset, which
we make publicly available. The evaluation with a large-scale
novel Google+ network dataset demonstrates performanceimprovement for each of these generalized algorithms on both
link prediction and attribute inference. Another challenge in
the link prediction problem is to combine effectively the
information from network structure with rich nodes and edgeattribute data. An algorithm was developed by Lack
Backstrom.et.al [6] based on supervised Random walks thatcombines the information from the network and edge attribute
information. The algorithm was formulated to assign strengths
to edges in networks and the random walker visits the nodes to
which the new links will be created in future. Their approach
outperformed the state-of-the art unsupervised approaches as
well as approaches that are based on feature extraction.Previous works proposed in literature for link
prediction based on supervised or unsupervised approach have
used large feature set size. An approach with minimal number
of features will improve the performance of the algorithm. In
this work we propose a supervised learning approach which
uses minimal number of features for link prediction in social
networks.Support Vector Machine (SVM) is one of the
supervised learning approaches that can be applied for
prediction. LibSVM [7], a library of SVM was used for link
prediction in this work.
III. Proposed method for Link
PredictionThe system architecture of our approach for future
collaboration prediction in co-authorship network is shown in
Fig 1. The main tasks of this approach are as follows:
• Constructing an adjacency matrix from the dataset
• Extraction of features
•
Feature Set Construction
• Building training model using SVM
• Testing the model
185184184184
-
8/17/2019 Future author collaboration
3/6
Fig.1. System Architecture of the proposed method for link prediction
A.
Construction of adjacency matrix
Generally the dataset contains the information in the form
of edge-pairs representing the collaboration between authors.
The network in the problem space can be simulated to a graph
represented as an adjacency matrix in the solution space.
Formally, link prediction has an input, which is a partiallyobserved graph where 0 denotes a knownnon-existing link, 1 indicating a known present link, and ?
denoting an unknown link. Our goal is to make predictions for
the unknown links.
B. Feature Extraction
A multitude of topological features can be used for a pair
of nodes. In this paper, the features documented in [2] were
chosen for co-author relationship prediction.
1. Node Neighborhood based Features
Common neighbors. Common neighbors is a measure that
considers the intersection of neighbors of two nodes vi and v j.The idea of using the size of common neighbors is just an
attestation to the network transitivity property. As the number
of common neighbors’ increases, the link that two nodes will
be linked will be higher.
Common neighbor =
(u) denotes the set of neighbors for node u.
(u) (v) denotes the set of common neighbors for nodes uand v.
|(u) (v)| denotes the cardinality of the common neighbors.
Jaccard’s coefficient. Jaccard’s coefficient is a normalized
measure of common neighbor’s. It calculates the ratio of
common neighbor’s out of all the neighbor’s between any twonodes, and can also be used for comparison of the similarity
and diversity of neighbor set.
Jaccard coefficient =
Adamic/Adar. Adamic/Adar, a weighted version of common
neighbor’s, assigns greater weight to neighbor’s that are not
shared with many others. This means the contribution of acommon neighbor to the score is weighted in proportion to the
rarity of the neighbor.
Adamic/adar =
2.
Vertex feature Aggregation
Preferential attachment. Preferential attachment was
introduced to explain the power-law degree distribution in
complex real-world networks.
The preferential attachment concept is akin to the
well-known rich get richer model. It means a node connected
to a higher degree is more likely to have more links in futurei.e., nodes with higher degree grabs more links which are
introduced to the respective network.
Preferential attachment score (u, v) = | (u)|. | (v)|
3. Path based features
Shortest path. The fact that the friends of a friend can
become a friend suggests that the path distance between two
nodes in a social network can influence the formation of a link
between them. As the distance is shorter, the links are more
likely to happen between the nodes.
Katz. It is a variant of shortest path distance, but works better
than the former for link prediction. Katz defines a measure
that sums over all paths between two nodes, damping
exponentially by length and counts short paths more heavily.
Katz = l.|paths(l)u,v|where | pathslu,v| is the set of all paths of length l from u to v.
Katz generally works much better than the shortest path sinceit is based on the ensemble of all paths between the nodes u
and v. The parameter ( 1) can be used to regularize this
feature. A small value of considers only the shorter paths for
which this feature very much behaves like features that are based on the node neighborhood.
186185185185
-
8/17/2019 Future author collaboration
4/6
C. Feature Set Construction
For link prediction, each data point corresponds to a pair of vertices with the label denoting their link status, i.e., 1
if link exist and 0 otherwise, so the chosen features should
represent some form of proximity between the pair of vertices.
The class labels being used is -1 and +1 where -1
denotes the non-existence of links and +1 denotes the
existence of link. Once the features are calculated for all the
nodes in the graph, a feature vector consisting of all the
feature based score for each node pairs and class labels is
obtained. Sample feature set representation is shown in thetable 1.
TABLE 1: FEATURE VECTOR REPRESENTATION
The edge-pairs column constitutes the obtained edge
pairs from the dataset. The corresponding column consists of
class labels, and the extracted feature values for each edge.
D.
Building link predictive model
The feature set constructed was used to train the model.
The fractions of feature vector i.e., 70% among all feature
vectors were used for training. The feature set is input to SVM
function in order to obtain the prediction model.
For our experiment, LibSVM [7] was used to train and
obtain a prediction model.
The feature set constructed will be provided as input to theLibSVM which outputs a predictive model containing the
attributes and other information required for prediction.
E.
Testing the model
The trained model obtained from SVM will tested for its
performance. The fraction of feature vectors i.e., 30% amongthe entire feature vector retained from training was used for
testing purpose. SVM testing outputs a set of predicted labels.
IV. ResultsThe observations were made on two datasets-Synthetic and
NetScience [8] data. Results were evaluated by four
performance metrics namely Accuracy, Precision, Recall and
F1-score.
I. Synthetic Data
A graph of 10 node structure was considered to determine the
performance of the proposed approach
A.
Characteristics of Synthetic dataset
Characteristics of the synthetic dataset of 10 node network is
as follows.
Number of nodes 10
Number of Edge-pairs 16
TABLE 2: CHARACTERISTICS OF DATASET CONTAINING 10 NODES
Fig. 2.
Graph of 10-node network
The Fig 2 shows the graph of the synthetic data.The link
prediction experiment was conducted on the synthetic data and
results obtained were observed. Based on the results, the
performance measures were calculated.
Metrics Values
Accuracy93%
Precision93%
Recall93%
F1-score0.9333
TABLE 3: PERFORMANCE MEASURE OBTAINED FOR SYNTHETIC DATA
Table 3 shows the performance metrics values obtained for the
synthetic data.
II. Real-time Dataset
In this experiment, the real dataset to be used wasobtained from NetScience.
Co-authorship network of scientists working on
network theory and experiment, as compiled by M. Newman
in May 2006 [8] refers to NetScience data.
187186186186
-
8/17/2019 Future author collaboration
5/6
A.
Characteristics of NetScience dataset
Characteristics Values
Number of nodes1588
Number of edges2743
TABLE 4: CHARACTERISCTICS OF NETSCIENCE DATA
The data was divided into several smaller datasets
based on number of positive, negative classes and three
experiments were conducted.
In order to check for class-imbalance, a small
variation in the number of positive and negative classes is
considered.
Experiment 1 was conducted considering equal
number of positive and negative classes. Table 5 shows the
number of training and test samples considered for
Experiment-1.
Characteristics Values
Number of Positive classes2743
Number of Negative classes2743
Number of Training data3840
Number of Testing data1646
TABLE 5: SAMPLE STATISTICS FOR EXPERIMENT-1
The Experiment 2 was conducted considering more number of
positive and less numbers of negative classes. Table 6 shows
the number of training and test samples considered for
Experiment-2.
Characteristics Values
Number of Positive classes2743
Number of Negative classes1371
Number of Training data2879
Number of Testing data 1235
TABLE 6: SAMPLES STATISTICS FOR EXPERIMENT -2
The experiment 3 was conducted considering less number of
positive and more number of negative classes. Table 7 shows
the number of training and test samples considered for
Experiment-3.
Characteristics Values
Number of Positive classes2743
Number of Negative classes5486
Number of Training data5760
Number of Testing data2469
TABLE 7: SAMPLE STATISTICS FOR EXPERIMENT -3
The table 8 shows the Performance metric values obtained for
the three experiments on NetScience data.
Metrics Experiment
1 2 3
Accuracy (%) 99.6 99.6 99.8
Precision (%) 99.6 99.5 99.5
Recall (%) 99.7 99 99
F1-Score (%) 99.7 99.2 99.2
TABLE 8: PERFORMANCE MEASURE OBTAINED FOR NETSCIENCE DATA
The supervised algorithm SVM performed well in co-
author relationship prediction with limited number of features.
The collaborations were easier to predict for authors who
are in higher degree of collaboration than less productive
authors in terms of all the four evaluation measures.
V. Conclusion
In this work, the classical problem of link prediction was
considered where we can predict the edges in a given snapshot
of a social network that have more probability to occur in
future. There have been numerous research attempts to address
the problem of link prediction using supervised learning
methods. However, the knowledge gained was not sufficient
for accurate link prediction. The links or future collaboration
can be predicted accurately by selection of the appropriate
features which extracts network related information. Thefeatures selected for this work included node and vertexfeatures. In this work an approach using limited number of
feature set was proposed for building link prediction model in
co-authorship networks. Proposed approach was tested on
synthetic data and real network data.
188187187187
-
8/17/2019 Future author collaboration
6/6
In the concluding remarks, it is emphasized that the
selection of appropriate features was helpful in predicting the
links with better results. We observed that the number of
samples to be selected for testing must be balanced with
appropriate number of positive classes and negative classes.
The proposed approach could be used in identifying latent
relationships yet potentially successful collaborations, which
would facilitate the development of research collaborations.
References
[1] Liben Nowell, David, and Jon Kleinberg. "The link prediction problemfor social networks." Journal of the American society for information
science and technology 58.7 (2007): 1019-1031.[2] Al Hasan, Mohammad, and Mohammed J. Zaki. "A survey of link
prediction in social networks." Social network data analytics. SpringerUS, 2011. 243-275.
[3] Al Hasan, Mohammad, et al. "Link prediction using supervisedlearning."SDM’06: Workshop on Link Analysis, Counter-terrorism andSecurity. 2006.
[4] Narang, Kanika, Kristina Lerman, and Ponnurangam Kumaraguru."Network flows and the link prediction problem." Proceedings of the 7thWorkshop on Social Network Mining and Analysis. ACM, 2013.
[5]
Gong, Neil Zhenqiang, et al. "Jointly predicting links and inferringattributes using a social-attribute network (san)." arXiv preprintarXiv:1112.3265 (2011).
[6] Backstrom, Lars, and Jure Leskovec. "Supervised random walks: predicting and recommending links in social networks." Proceedings ofthe fourth ACM international conference on Web search and data
mining . ACM, 2011.[7] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support
vector machines." ACM Transactions on Intelligent Systems andTechnology (TIST) 2, no. 3 (2011): 27.
[8] NetScienceDataset:M.E.J.Newman,Phys.Rev.E 74,036104 (2006).
189188188188