future author collaboration

8/17/2019 Future author collaboration

1/6

Future Collaboration Prediction in Co-authorship Network

Roopashree N

Post Graduate Student

Department of CSE, BMS College of EngineeringBangalore, India

[email protected]

Umadevi V

Associate Professor

Department of CSE, BMS College of Engineering

Bangalore, India

[email protected]

Abstract — The advent of proliferation of social networking is

high on use in present era. A co-authorship network which shows

research collaborations, are an important class of social

networks. Research collaborations often yield good results but

organizing a research group is a tedious task. Every researcher is

concerned to collaborate with the best expertise complimenting

him. Although there was abundant research conducted to find

future collaborators or links, very few of them are able to findout effective relationship among them. In this article, we propose

a method that makes link predictions in co-authorship networks

using supervised approach. The model extracts the features from

the networks node and topological structure which can be good

indicators of future collaborations. The proposed method was

evaluated on synthetic as well as real social networks such as

NetScience. Our experiment corroborated the results, and

demonstrated the efficiency of the method.

Keywords—SVM; Co-authorship network; Future

collaboration

I. Introduction

Social Network Analysis (SNA) has been evolved asone of the key research area which has attracted a considerable

amount of attention in recent years. It mainly focuses on

relationships between individuals also referred as social

entities. A social network can be defined as a network of

interactions whose nodes represent people or other entitiesembedded in a social context, and whose edges signifies the

interaction, collaboration, or influence between entities which

are driven by mutual interests, intrinsic to the group.

In general, social networks are extremely rich in

content, and they contain a very large amount of linkage data

which can be leveraged for analysis. The linkage data

constitutes the graph structure of the social network. The

availability of massive amounts of data has given a new

impetus towards a scientific and statistically robust study inthe field of social networks. This data-centric thrust has led to

a significant amount of research, which has been unique in its

statistical and computational focus in analyzing large amountsof online social network data.

SNA helps to identify highly peripheral people who

essentially represents untapped expertise and thus,

underutilized resources for the group.

Social networks are highly dynamic; they grow and

change quickly over time through the addition of new edges,

signifying the appearance of new connections in the

underlying social structure. Understanding the mechanisms

that drive the volatility is a fundamental and complex question

that is still not well understood. However an important class of

technique that can be addressed is to predict future

associations and factors driving those associations. This problem is known as Link Prediction which is a key research

direction within the social network analysis.Prediction of imminent links in co-authorship graph

is an important research direction, since it is conceptually and

structurally identical with the realistic problem of social

network where the scientists in the community interact toachieve common goal.

More formally, the link prediction task can be

formulated as follows (based upon the definition in Liben-

Nowell and Kleinberg [1]): Given a social network inwhich an edge represents some form ofinteractions between its endpoints at a particular time .Multiple interactions can be recorded by parallel edges or by

using a complex timestamp for an edge. For time weassume that [ ] denotes the subgraph of G restricted tothe edges with time-stamps between

and

. In a supervised

training setup for link prediction, we can choose a training

interval [t0, t0 ] and a test interval [t1, t1 ] where t0< t1. Nowthe link prediction task is to output a list of edges not present

in t0, t0, which are predicted to appear in the network t1,t1 .

The prime goal of this work is to propose an efficient

technique for designing a link predictor in networks, where

nodes can represent researchers and links representingcollaborations. The above mentioned goal has the following

challenges to be addressed:

• To extract the node and topological features and use

them in combination to expect a better result.

• Link prediction datasets are characterized by a large

amount of imbalance in the distribution of class

labels i.e., the existing number of edges often less

than the number of edges known to be not existing.

• It is vital that the proposed method computes

effectively if they are scaled to a large networkconsisting large number of nodes and edges.

2014 3rd International Conference on Eco-friendly Computing and Communication Systems

978-1-4799-7002-5/14 $31.00 © 2014 IEEE

DOI 10.1109/ICECCS.2014.45

184


978-1-4799-7002-5/14 $31.00 © 2014 IEEE

DOI 10.1109/ICECCS.2014.45

183


978-1-4799-7002-5/14 $31.00 © 2014 IEEE

DOI 10.1109/ICECCS.2014.45

183


978-1-4799-7002-5/14 $31.00 © 2014 IEEE

DOI 10.1109/Eco-friendly.2014.45

183


2/6

II. Related Work

A. Background

The earliest and the most basic link prediction modelwas proposed by Liben-Nowell and Kleinberg [1] that works

explicitly on a social network. Every vertex in the graph

represents a person and an edge between two vertices

represents the interaction between the persons. Multiplicity ofinteractions can be modelled explicitly by allowing parallel

edges or by adopting a suitable weighting scheme for the

edges. The learning paradigm in this setup typically extracts

the similarity between a pair of vertices by various graph-

based similarity metrics and uses the ranking on the similarity

scores to predict the link between two vertices. They

concentrate mostly on the performance of various graph-basedsimilarity metrics for the link prediction task.

The recent methods and techniques were surveyed by

Mohammad Al Hasan.et.al [2] which includes a variety of

techniques of link prediction ranging from feature-based

classification and kernel based method to matrix factorization

and probabilistic graphical models. These methods vary with

respect to complexity of the model, prediction performance,scalability, and its generalization ability. They have

considered the traditional (non-Bayesian) models which

extract a set of features to train a binary classification model.

These authors also presented another work on link prediction

using supervised learning [3] in which many features have been identified. The features are calculated and effectiveness

has been calculated. They also compare the different classes of

supervised learning algorithms in terms of their performance

metrics. This research work involves how to construct a

dataset for a machine learning algorithm. The features selected

were based on node and structural attributes both resulting in

the improved accuracy. They have experimented on two

datasets of co-authorship network using most of the well-

known supervised algorithms and based on the ranking. It is

known that small set of features always yield better

performance results.According to Kanika Narang.et.al,[4] link prediction

heuristic should take into account not only how close two

nodes is in a network, but also their ability to send and receive

information or to influence each other. This is determined by

the nature of the flow taking place on the network, i.e., the

process by which information is transmitted from one node toanother node to show that how easily two nodes can interact

with or influence each other depends also on the nature of the

flow which is an intermediate between their interactions. They

show that different types of flows ultimately lead to different

notions of network proximity. They measure the performance

of different heuristics on the missing link prediction task in a

variety of real-world social, technological and biologicalnetworks. They show that heuristics based on random walk-

type processes outperform the popular Adamic-Adar and the

number of common neighbor’s heuristics in many networks.

While the newly defined heuristics measures did not beatexisting ones in the missing link prediction task, the work

motivated these heuristics in terms of a flow-based

framework.

The e ects of social inuence and homophily wasconsidered by Neil Zhenqiang Gong.et.al, [5], suggest that

both network structure and node attribute information should

inform the tasks of link prediction and node attribute

inference. They used the SAN (Social Attribute Network)

framework with several leading supervised and unsupervisedlink prediction algorithms and demonstrate performance

improvement for each algorithm on both link prediction and

attribute inference. They made the novel observation that

attribute inference can help inform link prediction, i.e., link

prediction accuracy is improved by first inferring and predicting missing attributes. They comprehensively evaluate

these algorithms and compare them with other existing

algorithms using a novel, large-scale Google+ dataset, which

we make publicly available. The evaluation with a large-scale

novel Google+ network dataset demonstrates performanceimprovement for each of these generalized algorithms on both

link prediction and attribute inference. Another challenge in

the link prediction problem is to combine effectively the

information from network structure with rich nodes and edgeattribute data. An algorithm was developed by Lack

Backstrom.et.al [6] based on supervised Random walks thatcombines the information from the network and edge attribute

information. The algorithm was formulated to assign strengths

to edges in networks and the random walker visits the nodes to

which the new links will be created in future. Their approach

outperformed the state-of-the art unsupervised approaches as

well as approaches that are based on feature extraction.Previous works proposed in literature for link

prediction based on supervised or unsupervised approach have

used large feature set size. An approach with minimal number

of features will improve the performance of the algorithm. In

this work we propose a supervised learning approach which

uses minimal number of features for link prediction in social

networks.Support Vector Machine (SVM) is one of the

supervised learning approaches that can be applied for

prediction. LibSVM [7], a library of SVM was used for link

prediction in this work.

III. Proposed method for Link

PredictionThe system architecture of our approach for future

collaboration prediction in co-authorship network is shown in

Fig 1. The main tasks of this approach are as follows:

• Constructing an adjacency matrix from the dataset

• Extraction of features

•

Feature Set Construction

• Building training model using SVM

• Testing the model

185184184184


3/6

Fig.1. System Architecture of the proposed method for link prediction

A.

Construction of adjacency matrix

Generally the dataset contains the information in the form

of edge-pairs representing the collaboration between authors.

The network in the problem space can be simulated to a graph

represented as an adjacency matrix in the solution space.

Formally, link prediction has an input, which is a partiallyobserved graph where 0 denotes a knownnon-existing link, 1 indicating a known present link, and ?

denoting an unknown link. Our goal is to make predictions for

the unknown links.

B. Feature Extraction

A multitude of topological features can be used for a pair

of nodes. In this paper, the features documented in [2] were

chosen for co-author relationship prediction.

1. Node Neighborhood based Features

Common neighbors. Common neighbors is a measure that

considers the intersection of neighbors of two nodes vi and v j.The idea of using the size of common neighbors is just an

attestation to the network transitivity property. As the number

of common neighbors’ increases, the link that two nodes will

be linked will be higher.

Common neighbor =

(u) denotes the set of neighbors for node u.

(u) (v) denotes the set of common neighbors for nodes uand v.

|(u) (v)| denotes the cardinality of the common neighbors.

Jaccard’s coefficient. Jaccard’s coefficient is a normalized

measure of common neighbor’s. It calculates the ratio of

common neighbor’s out of all the neighbor’s between any twonodes, and can also be used for comparison of the similarity

and diversity of neighbor set.

Jaccard coefficient =

Adamic/Adar. Adamic/Adar, a weighted version of common

neighbor’s, assigns greater weight to neighbor’s that are not

shared with many others. This means the contribution of acommon neighbor to the score is weighted in proportion to the

rarity of the neighbor.

Adamic/adar =

2.

Vertex feature Aggregation

Preferential attachment. Preferential attachment was

introduced to explain the power-law degree distribution in

complex real-world networks.

The preferential attachment concept is akin to the

well-known rich get richer model. It means a node connected

to a higher degree is more likely to have more links in futurei.e., nodes with higher degree grabs more links which are

introduced to the respective network.

Preferential attachment score (u, v) = | (u)|. | (v)|

3. Path based features

Shortest path. The fact that the friends of a friend can

become a friend suggests that the path distance between two

nodes in a social network can influence the formation of a link

between them. As the distance is shorter, the links are more

likely to happen between the nodes.

Katz. It is a variant of shortest path distance, but works better

than the former for link prediction. Katz defines a measure

that sums over all paths between two nodes, damping

exponentially by length and counts short paths more heavily.

Katz = l.|paths(l)u,v|where | pathslu,v| is the set of all paths of length l from u to v.

Katz generally works much better than the shortest path sinceit is based on the ensemble of all paths between the nodes u

and v. The parameter ( 1) can be used to regularize this

feature. A small value of considers only the shorter paths for

which this feature very much behaves like features that are based on the node neighborhood.

186185185185


4/6

C. Feature Set Construction

For link prediction, each data point corresponds to a pair of vertices with the label denoting their link status, i.e., 1

if link exist and 0 otherwise, so the chosen features should

represent some form of proximity between the pair of vertices.

The class labels being used is -1 and +1 where -1

denotes the non-existence of links and +1 denotes the

existence of link. Once the features are calculated for all the

nodes in the graph, a feature vector consisting of all the

feature based score for each node pairs and class labels is

obtained. Sample feature set representation is shown in thetable 1.

TABLE 1: FEATURE VECTOR REPRESENTATION

The edge-pairs column constitutes the obtained edge

pairs from the dataset. The corresponding column consists of

class labels, and the extracted feature values for each edge.

D.

Building link predictive model

The feature set constructed was used to train the model.

The fractions of feature vector i.e., 70% among all feature

vectors were used for training. The feature set is input to SVM

function in order to obtain the prediction model.

For our experiment, LibSVM [7] was used to train and

obtain a prediction model.

The feature set constructed will be provided as input to theLibSVM which outputs a predictive model containing the

attributes and other information required for prediction.

E.

Testing the model

The trained model obtained from SVM will tested for its

performance. The fraction of feature vectors i.e., 30% amongthe entire feature vector retained from training was used for

testing purpose. SVM testing outputs a set of predicted labels.

IV. ResultsThe observations were made on two datasets-Synthetic and

NetScience [8] data. Results were evaluated by four

performance metrics namely Accuracy, Precision, Recall and

F1-score.

I. Synthetic Data

A graph of 10 node structure was considered to determine the

performance of the proposed approach

A.

Characteristics of Synthetic dataset

Characteristics of the synthetic dataset of 10 node network is

as follows.

Number of nodes 10

Number of Edge-pairs 16

TABLE 2: CHARACTERISTICS OF DATASET CONTAINING 10 NODES

Fig. 2.

Graph of 10-node network

The Fig 2 shows the graph of the synthetic data.The link

prediction experiment was conducted on the synthetic data and

results obtained were observed. Based on the results, the

performance measures were calculated.

Metrics Values

Accuracy93%

Precision93%

Recall93%

F1-score0.9333

TABLE 3: PERFORMANCE MEASURE OBTAINED FOR SYNTHETIC DATA

Table 3 shows the performance metrics values obtained for the

synthetic data.

II. Real-time Dataset

In this experiment, the real dataset to be used wasobtained from NetScience.

Co-authorship network of scientists working on

network theory and experiment, as compiled by M. Newman

in May 2006 [8] refers to NetScience data.

187186186186


5/6

A.

Characteristics of NetScience dataset

Characteristics Values

Number of nodes1588

Number of edges2743

TABLE 4: CHARACTERISCTICS OF NETSCIENCE DATA

The data was divided into several smaller datasets

based on number of positive, negative classes and three

experiments were conducted.

In order to check for class-imbalance, a small

variation in the number of positive and negative classes is

considered.

Experiment 1 was conducted considering equal

number of positive and negative classes. Table 5 shows the

number of training and test samples considered for

Experiment-1.


Number of Positive classes2743

Number of Negative classes2743

Number of Training data3840

Number of Testing data1646

TABLE 5: SAMPLE STATISTICS FOR EXPERIMENT-1

The Experiment 2 was conducted considering more number of

positive and less numbers of negative classes. Table 6 shows

the number of training and test samples considered for

Experiment-2.





Number of Testing data 1235

TABLE 6: SAMPLES STATISTICS FOR EXPERIMENT -2

The experiment 3 was conducted considering less number of

positive and more number of negative classes. Table 7 shows

the number of training and test samples considered for

Experiment-3.





Number of Testing data2469

TABLE 7: SAMPLE STATISTICS FOR EXPERIMENT -3

The table 8 shows the Performance metric values obtained for

the three experiments on NetScience data.

Metrics Experiment

1 2 3

Accuracy (%) 99.6 99.6 99.8

Precision (%) 99.6 99.5 99.5

Recall (%) 99.7 99 99

F1-Score (%) 99.7 99.2 99.2

TABLE 8: PERFORMANCE MEASURE OBTAINED FOR NETSCIENCE DATA

The supervised algorithm SVM performed well in co-

author relationship prediction with limited number of features.

The collaborations were easier to predict for authors who

are in higher degree of collaboration than less productive

authors in terms of all the four evaluation measures.

V. Conclusion

In this work, the classical problem of link prediction was

considered where we can predict the edges in a given snapshot

of a social network that have more probability to occur in

future. There have been numerous research attempts to address

the problem of link prediction using supervised learning

methods. However, the knowledge gained was not sufficient

for accurate link prediction. The links or future collaboration

can be predicted accurately by selection of the appropriate

features which extracts network related information. Thefeatures selected for this work included node and vertexfeatures. In this work an approach using limited number of

feature set was proposed for building link prediction model in

co-authorship networks. Proposed approach was tested on

synthetic data and real network data.

188187187187


6/6

In the concluding remarks, it is emphasized that the

selection of appropriate features was helpful in predicting the

links with better results. We observed that the number of

samples to be selected for testing must be balanced with

appropriate number of positive classes and negative classes.

The proposed approach could be used in identifying latent

relationships yet potentially successful collaborations, which

would facilitate the development of research collaborations.

References

[1] Liben Nowell, David, and Jon Kleinberg. "The link prediction problemfor social networks." Journal of the American society for information

science and technology 58.7 (2007): 1019-1031.[2] Al Hasan, Mohammad, and Mohammed J. Zaki. "A survey of link

prediction in social networks." Social network data analytics. SpringerUS, 2011. 243-275.

[3] Al Hasan, Mohammad, et al. "Link prediction using supervisedlearning."SDM’06: Workshop on Link Analysis, Counter-terrorism andSecurity. 2006.

[4] Narang, Kanika, Kristina Lerman, and Ponnurangam Kumaraguru."Network flows and the link prediction problem." Proceedings of the 7thWorkshop on Social Network Mining and Analysis. ACM, 2013.

[5]

Gong, Neil Zhenqiang, et al. "Jointly predicting links and inferringattributes using a social-attribute network (san)." arXiv preprintarXiv:1112.3265 (2011).

[6] Backstrom, Lars, and Jure Leskovec. "Supervised random walks: predicting and recommending links in social networks." Proceedings ofthe fourth ACM international conference on Web search and data

mining . ACM, 2011.[7] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support

vector machines." ACM Transactions on Intelligent Systems andTechnology (TIST) 2, no. 3 (2011): 27.

[8] NetScienceDataset:M.E.J.Newman,Phys.Rev.E 74,036104 (2006).

189188188188

future author collaboration

Documents