future author collaboration

Upload: arush-sharma

Post on 06-Jul-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/17/2019 Future author collaboration

    1/6

    Future Collaboration Prediction in Co-authorship Network

    Roopashree N

    Post Graduate Student

    Department of CSE, BMS College of EngineeringBangalore, India

    [email protected]

    Umadevi V

    Associate Professor

    Department of CSE, BMS College of Engineering

    Bangalore, India

    [email protected]

     Abstract  — The advent of proliferation of social networking is

    high on use in present era. A co-authorship network which shows

    research collaborations, are an important class of social

    networks. Research collaborations often yield good results but

    organizing a research group is a tedious task. Every researcher is

    concerned to collaborate with the best expertise complimenting

    him. Although there was abundant research conducted to find

    future collaborators or links, very few of them are able to findout effective relationship among them. In this article, we propose

    a method that makes link predictions in co-authorship networks

    using supervised approach. The model extracts the features from

    the networks node and topological structure which can be good

    indicators of future collaborations. The proposed method was

    evaluated on synthetic as well as real social networks such as

    NetScience. Our experiment corroborated the results, and

    demonstrated the efficiency of the method.

     Keywords—SVM; Co-authorship network; Future

    collaboration

    I.  Introduction

    Social Network Analysis (SNA) has been evolved asone of the key research area which has attracted a considerable

    amount of attention in recent years. It mainly focuses on

    relationships between individuals also referred as social

    entities. A social network can be defined as a network of

    interactions whose nodes represent people or other entitiesembedded in a social context, and whose edges signifies the

    interaction, collaboration, or influence between entities which

    are driven by mutual interests, intrinsic to the group.

    In general, social networks are extremely rich in

    content, and they contain a very large amount of linkage data

    which can be leveraged for analysis. The linkage data

    constitutes the graph structure of the social network. The

    availability of massive amounts of data has given a new

    impetus towards a scientific and statistically robust study inthe field of social networks. This data-centric thrust has led to

    a significant amount of research, which has been unique in its

    statistical and computational focus in analyzing large amountsof online social network data.

    SNA helps to identify highly peripheral people who

    essentially represents untapped expertise and thus,

    underutilized resources for the group.

    Social networks are highly dynamic; they grow and

    change quickly over time through the addition of new edges,

    signifying the appearance of new connections in the

    underlying social structure. Understanding the mechanisms

    that drive the volatility is a fundamental and complex question

    that is still not well understood. However an important class of

    technique that can be addressed is to predict future

    associations and factors driving those associations. This problem is known as Link Prediction which is a key research

    direction within the social network analysis.Prediction of imminent links in co-authorship graph

    is an important research direction, since it is conceptually and

    structurally identical with the realistic problem of social

    network where the scientists in the community interact toachieve common goal.

    More formally, the link prediction task can be

    formulated as follows (based upon the definition in Liben-

     Nowell and Kleinberg [1]): Given a social network  inwhich an edge       represents some form ofinteractions between its endpoints at a particular time   .Multiple interactions can be recorded by parallel edges or by

    using a complex timestamp for an edge. For time    weassume that  [   ] denotes the subgraph of G restricted tothe edges with time-stamps between

      and

     . In a supervised

    training setup for link prediction, we can choose a training

    interval [t0, t0 ] and a test interval [t1, t1 ] where t0< t1. Nowthe link prediction task is to output a list of edges not present

    in t0, t0, which are predicted to appear in the network  t1,t1 .

    The prime goal of this work is to propose an efficient

    technique for designing a link predictor in networks, where

    nodes can represent researchers and links representingcollaborations. The above mentioned goal has the following

    challenges to be addressed:

    •  To extract the node and topological features and use

    them in combination to expect a better result.

    •  Link prediction datasets are characterized by a large

    amount of imbalance in the distribution of class

    labels i.e., the existing number of edges often less

    than the number of edges known to be not existing.

    •  It is vital that the proposed method computes

    effectively if they are scaled to a large networkconsisting large number of nodes and edges.

    2014 3rd International Conference on Eco-friendly Computing and Communication Systems

    978-1-4799-7002-5/14 $31.00 © 2014 IEEE

    DOI 10.1109/ICECCS.2014.45

    184

    2014 3rd International Conference on Eco-friendly Computing and Communication Systems

    978-1-4799-7002-5/14 $31.00 © 2014 IEEE

    DOI 10.1109/ICECCS.2014.45

    183

    2014 3rd International Conference on Eco-friendly Computing and Communication Systems

    978-1-4799-7002-5/14 $31.00 © 2014 IEEE

    DOI 10.1109/ICECCS.2014.45

    183

    2014 3rd International Conference on Eco-friendly Computing and Communication Systems

    978-1-4799-7002-5/14 $31.00 © 2014 IEEE

    DOI 10.1109/Eco-friendly.2014.45

    183

  • 8/17/2019 Future author collaboration

    2/6

    II.  Related Work

     A.  Background

    The earliest and the most basic link prediction modelwas proposed by Liben-Nowell and Kleinberg [1] that works

    explicitly on a social network. Every vertex in the graph

    represents a person and an edge between two vertices

    represents the interaction between the persons. Multiplicity ofinteractions can be modelled explicitly by allowing parallel

    edges or by adopting a suitable weighting scheme for the

    edges. The learning paradigm in this setup typically extracts

    the similarity between a pair of vertices by various graph-

     based similarity metrics and uses the ranking on the similarity

    scores to predict the link between two vertices. They

    concentrate mostly on the performance of various graph-basedsimilarity metrics for the link prediction task.

    The recent methods and techniques were surveyed by

    Mohammad Al Hasan.et.al [2] which includes a variety of

    techniques of link prediction ranging from feature-based

    classification and kernel based method to matrix factorization

    and probabilistic graphical models. These methods vary with

    respect to complexity of the model, prediction performance,scalability, and its generalization ability. They have

    considered the traditional (non-Bayesian) models which

    extract a set of features to train a binary classification model.

    These authors also presented another work on link prediction

    using supervised learning [3] in which many features have been identified. The features are calculated and effectiveness

    has been calculated. They also compare the different classes of

    supervised learning algorithms in terms of their performance

    metrics. This research work involves how to construct a

    dataset for a machine learning algorithm. The features selected

    were based on node and structural attributes both resulting in

    the improved accuracy. They have experimented on two

    datasets of co-authorship network using most of the well-

    known supervised algorithms and based on the ranking. It is

    known that small set of features always yield better

     performance results.According to Kanika Narang.et.al,[4] link prediction

    heuristic should take into account not only how close two

    nodes is in a network, but also their ability to send and receive

    information or to influence each other. This is determined by

    the nature of the flow taking place on the network, i.e., the

     process by which information is transmitted from one node toanother node to show that how easily two nodes can interact

    with or influence each other depends also on the nature of the

    flow which is an intermediate between their interactions. They

    show that different types of flows ultimately lead to different

    notions of network proximity. They measure the performance

    of different heuristics on the missing link prediction task in a

    variety of real-world social, technological and biologicalnetworks. They show that heuristics based on random walk-

    type processes outperform the popular Adamic-Adar and the

    number of common neighbor’s heuristics in many networks.

    While the newly defined heuristics measures did not beatexisting ones in the missing link prediction task, the work

    motivated these heuristics in terms of a flow-based

    framework.

    The e ects of social inuence and homophily wasconsidered by Neil Zhenqiang Gong.et.al, [5], suggest that

     both network structure and node attribute information should

    inform the tasks of link prediction and node attribute

    inference. They used the SAN (Social Attribute Network)

    framework with several leading supervised and unsupervisedlink prediction algorithms and demonstrate performance

    improvement for each algorithm on both link prediction and

    attribute inference. They made the novel observation that

    attribute inference can help inform link prediction, i.e., link

     prediction accuracy is improved by first inferring and predicting missing attributes. They comprehensively evaluate

    these algorithms and compare them with other existing

    algorithms using a novel, large-scale Google+ dataset, which

    we make publicly available. The evaluation with a large-scale

    novel Google+ network dataset demonstrates performanceimprovement for each of these generalized algorithms on both

    link prediction and attribute inference. Another challenge in

    the link prediction problem is to combine effectively the

    information from network structure with rich nodes and edgeattribute data. An algorithm was developed by Lack

    Backstrom.et.al [6] based on supervised Random walks thatcombines the information from the network and edge attribute

    information. The algorithm was formulated to assign strengths

    to edges in networks and the random walker visits the nodes to

    which the new links will be created in future. Their approach

    outperformed the state-of-the art unsupervised approaches as

    well as approaches that are based on feature extraction.Previous works proposed in literature for link

     prediction based on supervised or unsupervised approach have

    used large feature set size. An approach with minimal number

    of features will improve the performance of the algorithm. In

    this work we propose a supervised learning approach which

    uses minimal number of features for link prediction in social

    networks.Support Vector Machine (SVM) is one of the

    supervised learning approaches that can be applied for

     prediction. LibSVM [7], a library of SVM was used for link

     prediction in this work.

    III.  Proposed method for Link

    PredictionThe system architecture of our approach for future

    collaboration prediction in co-authorship network is shown in

    Fig 1. The main tasks of this approach are as follows:

    •  Constructing an adjacency matrix from the dataset

    •  Extraction of features

    • 

    Feature Set Construction

    •  Building training model using SVM

    •  Testing the model

    185184184184

  • 8/17/2019 Future author collaboration

    3/6

     

    Fig.1. System Architecture of the proposed method for link prediction

     A. 

    Construction of adjacency matrix

    Generally the dataset contains the information in the form

    of edge-pairs representing the collaboration between authors.

    The network in the problem space can be simulated to a graph

    represented as an adjacency matrix in the solution space.

    Formally, link prediction has an input, which is a partiallyobserved graph     where 0 denotes a knownnon-existing link, 1 indicating a known present link, and ?

    denoting an unknown link. Our goal is to make predictions for

    the unknown links.

     B.  Feature Extraction

    A multitude of topological features can be used for a pair

    of nodes. In this paper, the features documented in [2] were

    chosen for co-author relationship prediction.

    1.   Node Neighborhood based Features

    Common neighbors. Common neighbors is a measure that

    considers the intersection of neighbors of two nodes vi and v j.The idea of using the size of common neighbors is just an

    attestation to the network transitivity property. As the number

    of common neighbors’ increases, the link that two nodes will

     be linked will be higher. 

    Common neighbor  =    

    (u) denotes the set of neighbors for node u.

    (u) (v) denotes the set of common neighbors for nodes uand v.

    |(u) (v)| denotes the cardinality of the common neighbors.

    Jaccard’s coefficient. Jaccard’s coefficient is a normalized

    measure of common neighbor’s. It calculates the ratio of

    common neighbor’s out of all the neighbor’s between any twonodes, and can also be used for comparison of the similarity

    and diversity of neighbor set.

    Jaccard coefficient =  

    Adamic/Adar. Adamic/Adar, a weighted version of common

    neighbor’s, assigns greater weight to neighbor’s that are not

    shared with many others. This means the contribution of acommon neighbor to the score is weighted in proportion to the

    rarity of the neighbor.

    Adamic/adar  =    

    2. 

    Vertex feature Aggregation 

    Preferential attachment.  Preferential attachment was

    introduced to explain the power-law degree distribution in

    complex real-world networks.

    The preferential attachment concept is akin to the

    well-known rich get richer model. It means a node connected

    to a higher degree is more likely to have more links in futurei.e., nodes with higher degree grabs more links which are

    introduced to the respective network.

    Preferential attachment score (u, v) = | (u)|. | (v)|

    3.   Path based features 

    Shortest path. The fact that the friends of a friend can

     become a friend suggests that the path distance between two

    nodes in a social network can influence the formation of a link

     between them. As the distance is shorter, the links are more

    likely to happen between the nodes.

    Katz. It is a variant of shortest path distance, but works better

    than the former for link prediction. Katz defines a measure

    that sums over all paths between two nodes, damping

    exponentially by length and counts short paths more heavily.

    Katz =   l.|paths(l)u,v|where | pathslu,v| is the set of all paths of length l from u to v.

    Katz generally works much better than the shortest path sinceit is based on the ensemble of all paths between the nodes u

    and v. The parameter ( 1) can be used to regularize this

    feature. A small value of considers only the shorter paths for

    which this feature very much behaves like features that are based on the node neighborhood.

    186185185185

  • 8/17/2019 Future author collaboration

    4/6

    C.  Feature Set Construction

    For link prediction, each data point corresponds to a pair of vertices with the label denoting their link status, i.e., 1

    if link exist and 0 otherwise, so the chosen features should

    represent some form of proximity between the pair of vertices.

         The class labels being used is -1 and +1 where -1

    denotes the non-existence of links and +1 denotes the

    existence of link. Once the features are calculated for all the

    nodes in the graph, a feature vector consisting of all the

    feature based score for each node pairs and class labels is

    obtained. Sample feature set representation is shown in thetable 1.

    TABLE 1: FEATURE VECTOR REPRESENTATION

    The edge-pairs column constitutes the obtained edge

     pairs from the dataset. The corresponding column consists of

    class labels, and the extracted feature values for each edge.

     D. 

     Building link predictive model

    The feature set constructed was used to train the model.

    The fractions of feature vector i.e., 70% among all feature

    vectors were used for training. The feature set is input to SVM

    function in order to obtain the prediction model.

    For our experiment, LibSVM [7] was used to train and

    obtain a prediction model.

    The feature set constructed will be provided as input to theLibSVM which outputs a predictive model containing the

    attributes and other information required for prediction.

     E. 

    Testing the model

    The trained model obtained from SVM will tested for its

     performance. The fraction of feature vectors i.e., 30% amongthe entire feature vector retained from training was used for

    testing purpose. SVM testing outputs a set of predicted labels.

    IV.  ResultsThe observations were made on two datasets-Synthetic and

     NetScience [8] data. Results were evaluated by four

     performance metrics namely Accuracy, Precision, Recall and

    F1-score.

     I.  Synthetic Data

    A graph of 10 node structure was considered to determine the

     performance of the proposed approach

     A. 

    Characteristics of Synthetic dataset

    Characteristics of the synthetic dataset of 10 node network is

    as follows.

     Number of nodes 10

     Number of Edge-pairs 16

    TABLE 2: CHARACTERISTICS OF DATASET CONTAINING 10 NODES 

    Fig. 2. 

    Graph of 10-node network

    The Fig 2 shows the graph of the synthetic data.The link

     prediction experiment was conducted on the synthetic data and

    results obtained were observed. Based on the results, the

     performance measures were calculated.

    Metrics Values

    Accuracy93%

    Precision93%

    Recall93%

    F1-score0.9333

    TABLE 3: PERFORMANCE MEASURE OBTAINED FOR SYNTHETIC DATA 

    Table 3 shows the performance metrics values obtained for the

    synthetic data.

     II.   Real-time Dataset

    In this experiment, the real dataset to be used wasobtained from NetScience.

    Co-authorship network of scientists working on

    network theory and experiment, as compiled by M. Newman

    in May 2006 [8] refers to NetScience data.

    187186186186

  • 8/17/2019 Future author collaboration

    5/6

     A. 

    Characteristics of NetScience dataset

    Characteristics Values

     Number of nodes1588

     Number of edges2743

    TABLE 4: CHARACTERISCTICS OF NETSCIENCE DATA 

    The data was divided into several smaller datasets

     based on number of positive, negative classes and three

    experiments were conducted.

    In order to check for class-imbalance, a small

    variation in the number of positive and negative classes is

    considered.

    Experiment 1 was conducted considering equal

    number of positive and negative classes. Table 5 shows the

    number of training and test samples considered for

    Experiment-1.

    Characteristics Values

     Number of Positive classes2743

     Number of Negative classes2743

     Number of Training data3840

     Number of Testing data1646

    TABLE 5: SAMPLE STATISTICS FOR  EXPERIMENT-1

    The Experiment 2 was conducted considering more number of

     positive and less numbers of negative classes. Table 6 shows

    the number of training and test samples considered for

    Experiment-2.

    Characteristics Values

     Number of Positive classes2743

     Number of Negative classes1371

     Number of Training data2879

     Number of Testing data 1235

    TABLE 6: SAMPLES STATISTICS FOR EXPERIMENT -2

    The experiment 3 was conducted considering less number of

     positive and more number of negative classes. Table 7 shows

    the number of training and test samples considered for

    Experiment-3.

    Characteristics Values

     Number of Positive classes2743

     Number of Negative classes5486

     Number of Training data5760

     Number of Testing data2469

    TABLE 7: SAMPLE STATISTICS FOR EXPERIMENT -3

    The table 8 shows the Performance metric values obtained for

    the three experiments on NetScience data.

    Metrics Experiment

    1 2 3

    Accuracy (%) 99.6 99.6 99.8

    Precision (%) 99.6 99.5 99.5

    Recall (%) 99.7 99 99

    F1-Score (%) 99.7 99.2 99.2

    TABLE 8: PERFORMANCE MEASURE OBTAINED FOR NETSCIENCE DATA 

    The supervised algorithm SVM performed well in co-

    author relationship prediction with limited number of features.

    The collaborations were easier to predict for authors who

    are in higher degree of collaboration than less productive

    authors in terms of all the four evaluation measures.

    V.  Conclusion

    In this work, the classical problem of link prediction was

    considered where we can predict the edges in a given snapshot

    of a social network that have more probability to occur in

    future. There have been numerous research attempts to address

    the problem of link prediction using supervised learning

    methods. However, the knowledge gained was not sufficient

    for accurate link prediction. The links or future collaboration

    can be predicted accurately by selection of the appropriate

    features which extracts network related information. Thefeatures selected for this work included node and vertexfeatures. In this work an approach using limited number of

    feature set was proposed for building link prediction model in

    co-authorship networks. Proposed approach was tested on

    synthetic data and real network data.

    188187187187

  • 8/17/2019 Future author collaboration

    6/6

    In the concluding remarks, it is emphasized that the

    selection of appropriate features was helpful in predicting the

    links with better results. We observed that the number of

    samples to be selected for testing must be balanced with

    appropriate number of positive classes and negative classes.

    The proposed approach could be used in identifying latent

    relationships yet potentially successful collaborations, which

    would facilitate the development of research collaborations.

    References

    [1]  Liben Nowell, David, and Jon Kleinberg. "The link  prediction problemfor social networks." Journal of the American society for information

     science and technology 58.7 (2007): 1019-1031.[2]  Al Hasan, Mohammad, and Mohammed J. Zaki. "A survey of link

     prediction in social networks." Social network data analytics. SpringerUS, 2011. 243-275.

    [3]  Al Hasan, Mohammad, et al. "Link prediction using supervisedlearning."SDM’06: Workshop on Link Analysis, Counter-terrorism andSecurity. 2006.

    [4]   Narang, Kanika, Kristina Lerman, and Ponnurangam Kumaraguru."Network flows and the link prediction problem."  Proceedings of the 7thWorkshop on Social Network Mining and Analysis. ACM, 2013.

    [5] 

    Gong, Neil Zhenqiang, et al. "Jointly predicting links and inferringattributes using a social-attribute network (san)." arXiv preprintarXiv:1112.3265 (2011).

    [6]  Backstrom, Lars, and Jure Leskovec. "Supervised random walks: predicting and recommending links in social networks." Proceedings ofthe fourth ACM international conference on Web search and data

    mining . ACM, 2011.[7]  Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support

    vector machines."  ACM Transactions on Intelligent Systems andTechnology (TIST) 2, no. 3 (2011): 27.

    [8]   NetScienceDataset:M.E.J.Newman,Phys.Rev.E 74,036104 (2006).

    189188188188