Download - Coauthor Prediction for Junior Researchers
Coauthor Prediction for Junior Researchers
Shuguang Han, Daqing He, Peter Brusilovsky, Zhen Yue
School of Information Sciences, University of Pittsburgh
135 North Bellefield Ave. Pittsburgh 15213 United States
{shh69, dah44, peterb, zhy18}@pitt.edu
Abstract. Research collaboration can bring in different perspectives and gener-
ate more productive results. However, finding an appropriate collaborator can
be difficult due to the lacking of sufficient information. Link prediction is a re-
lated technique for collaborator discovery; but its focus has been mostly on the
core authors who have relatively more publications. We argue that junior re-
searchers actually need more help in finding collaborators. Thus, in this paper,
we focus on coauthor prediction for junior researchers. Most of the previous
works on coauthor prediction considered global network feature and local net-
work feature separately, or tried to combine local network feature and content
feature. But we found a significant improvement by simply combing local net-
work feature and global network feature. We further developed a regularization
based approach to incorporate multiple features simultaneously. Experimental
results demonstrated that this approach outperformed the simple linear combi-
nation of multiple features. We further showed that content features, which
were proved to be useful in link prediction, can be easily integrated into our
regularization approach.
Keywords: Coauthor prediction; link prediction; social network; expert search
1 Introduction
Identifying and maintaining appropriate collaboration relations are critical in a re-
searcher’s academic life [18] because collaboration can bring together diverse exper-
tise to the same research problem and generate more influential results. The link pre-
diction techniques developed in social network research community [12] can help
predict future collaboration and make researchers aware of the possible coauthors.
However, most of the research works considered only the core authors [12,20] who
have at least a certain number of publications both in the training dataset and the test-
ing dataset (three in [12], and five in [20]). Considering the skewed distribution be-
tween the number of authors and the number of publications [13], the selection crite-
ria will cut off a large proportion of authors. The conclusion from the core authors
may not be useful for the rest authors, because predicting from sparse data is more
difficult [15]. Besides, the prediction in current works is in the global level, in which
top-k ranked pairs among the entire candidate pairs are selected as the predicted links
(k is the number of links in the testing dataset). In the global level prediction, there is
no control of generating prediction for particular individuals; however, we believe
that it is it is more useful if the prediction is for individuals, especially for junior au-
thors. They usually don’t have sufficient coauthors, and are more eager to form new
connections.
Data sparseness is recognized as a major problem for the prediction of coauthors
for junior researchers. To relieve data sparseness, content information was commonly
used. Related techniques in expert search [1] utilize content information to find rele-
vant experts. In the recommender system domain, a hybrid method combining both
content information and social network feature was often used to solve the cold start
problem [14,11,19]. In previous research that considered social network information,
either the local network features (e.g. the direct connections) or global network fea-
tures (e.g. shortest path) were used respectively. Our experiments showed that comb-
ing the local and global network features significantly improve the prediction perfor-
mance, no matter content information was added or not.
Since multiple features are used in this task, a following question is how to com-
bine them effectively. Linear combination is a simple solution, but it is difficult to
scale different scores and tune the parameters. An alternative method is to treat the
prediction as a binary classification problem [16,20,3] based on multiple features.
However, to train a binary classifier, we need to use both positive subjects (real coau-
thor pairs) and negative subjects (real non-coauthor pairs). Negative subject sampling
is difficult because not observing a coauthor link does not imply two authors not are
real non-coauthor pair. It may because the coverage of the dataset is limited.
To sum up, the focus of this paper is to predict coauthors for junior researchers.
Multiple features including local network features, global network features and con-
tent features are considered to improve the prediction performance. In the remainder
sections of this paper, we first review related work in section 2. Then, in section 3, a
new approach to combine multiple evidences using regularization framework is pro-
posed. Then, we described the datasets and evaluation metrics in section 4. In addi-
tion, Empirical results analysis is discussed in section 5. In the final section, we sum-
marize our findings and propose future directions.
2 Related Works
In the literature, coauthor prediction has been modeled as a similarity measuring prob-
lem, a recommendation problem, or a classification problem. When viewed as a simi-
larity measuring problem, the similarities between any two authors are calculated, and
then the author pairs are ranked and those in top positions are chosen as the predicted
links [12]. The core of this approach is to define the vertex similarity [5]. Authors
with high vertex similarity are assumed to have high probabilities of collaboration.
Network topological features are usually used to measure the vertex similarity. Both
local network measures such as common neighbor, Jaccard similarity, Adamic/Adar,
preferential attachment, and global network measures such as the shortest path, sim-
Rank, and Katz index have been used before. All of these measures are mentioned
and compared in [12].
Coauthor prediction can also be viewed as a personalized recommendation prob-
lem. The Collaborative Filtering (CF) method was extended in [19] for people-to-
people recommendation; however, CF suffers from the cold-start problem when data
is sparse. This problem is particularly important in our task because the junior re-
searchers are usually lacking of coauthor information. A hybrid method that combines
both social network information and content information can be adopted to relieve the
data sparseness. The combination can be a simple linear combination [11,4], a regu-
larization based combination [14], or a filtering based combination [17].
Other researchers [3,20,16] found that besides local and global network topological
features, other features can also help improve the prediction performance. For exam-
ple, the authors’ keywords matching, the publication classification code matching
[3,16,11] and the meta-path in heterogeneous information networks [20] were all
found useful. In order to combine multiple features, the coauthor prediction was mod-
eled as a binary classification problem.
The expert search in the information retrieval domain is also a related work. Relat-
ed techniques of expert search were not well-studied until TREC’s expert finding task
[6], in which researchers are required to build an algorithm and rank candidates based
on their relevance to the user issued queries. The widely adopted method for expert
search is to construct expert profiles using the their previous publications or co-
occurrence texts [1]. Expert search didn’t model users’ social context, which make it
less useful than social network based method [11]. However, combining the expert
profiles and social context information performs better than using them separately.
3 Methodology
3.1 Problem Definition
The prediction task is formalized as follows: we divide the dataset into the training
dataset and the testing dataset . The division criteria are described in section 4.
The test documents are defined as those documents with junior researchers
as the first authors. Each document in is further presented by a triple: , which indicates the authors of : is the first author ( is a junior re-
searcher), represents all the authors of the document and represents the metadata
such as title and/or abstract. The junior researchers are defined as those people who
published at least one first-author paper in , and at least one but no more than five
papers in . Our goal is to predict the collaborations between and the rest of the
authors . However, if and any author in are coauthors in , then that
coauthor link is not included in our prediction because we are predicting the new
coauthor links. is used to simulate ’s topic interest in document , and we as-
sume that has already known this information before he/she wants to build connec-
tions with authors .
3.2 Baseline Models
In terms of the baselines, we adopted two link prediction measures, i.e. the Adam-
ic/Adar index, and the Katz index to represent the best practices using the local net-
work topology features and the global network topology features. We also adopted the
Balog Model 2, which is served as the best practice in content-based method. Besides,
we considered the standard Collaborative Filtering algorithm which has been found as
an effective method in recommendation systems. We adopt the similarity measuring
approach for link prediction, the core of which is to rank candidate based on
his/her similarity with author .
The Adamic/Adar index [10] (AA) is a typical local network feature based meth-
od. In our task, we compute the similarity between candidate and , i.e. ,
using Formula (1). denotes a set of neighbors of author , and denotes the
size of .
∑
(1)
The Katz [9] (Katz) index takes into account of the global network structure. It is
defined as the summarization of all paths between candidate and , which is
computed using Formula (2). is all the length path between and . is
the damping factor that controls the weight of the path.
∑ (2)
In the content-based baseline model Balog Model 2 (ES) [1] the content similarity
is calculated between the topic interest of and that of in paper using Formula
(3). The topic interest is represented by the bag-of-words in and it is used to mim-
ic user query in ES. is estimated using the standard language modeling ap-
proach in information retrieval, and is the association between author and
document . In this paper, we used the uniform association for multi-authored papers,
and each author receives the same weight of association regardless of author order.
∑ (3)
The fourth baseline is the user-based Collaborative Filtering (CF) algorithm [2].
The traditional scenario of CF consists of users, items and users’ ratings on items.
However, in the case of people-to-people recommendation, the user and item are both
people and there are no explicit ratings on items. In order to apply the CF into the
coauthor prediction, we treat people as both the user and the item, and the number of
papers two people coauthored as the people’s rating on each other, i.e. users’ ratings
on items. Using the simple average weighted aggregation, the similarity between
and is calculated using Formula (4), in which is k most nearest neighbors of .
is ’s ) rating on , i.e. the number of coauthored papers of and
. measures the similarity of rating on items between user and , which is calculated by the cosine similarity of their coauthors (see Formula (5).). is
the normalized term.
∑ (4)
(5)
3.3 Multiple Objective Optimization using Regularization
Each of the baseline models only considered one type of feature. Since the combina-
tion of multiple features has proven to be useful in many works [11,4,20,3,16] , a
following problem is to combine multiple features more effectively. The simple linear
combination works only when features in the combination are independent. As men-
tioned in [12], when is small, Katz is very similar to the neighborhood based ap-
proach such as AA, which means these two features are not independent to each oth-
er. Therefore, here we propose to use a regularization based approach as suggested in
paper [7].
Our first regularization based combination approach is named as AAN, in which
local network feature based method AA is set as the base, and the objective is to com-
bine features from global networks and/or content information. For each document
in , we need to rank ca for based on their similarity score vector . is initial-
ized as a zero vector. is updated according to an objective function defined in
formula (6), in which denotes the final score vector, ( is the adjacent ma-
trix of coauthor networks) is the difference matrix, and ‖ ‖ denotes the L2 norm of a
vector. helps propagate local similarity scores through the global net-
work while‖ ‖ ensures the final score do not go far away from , and
is the importance parameter. To minimize the objective function, we set derivation
of to equals to 0, and the closed-form solution is shown in Formula (7). How-
ever, solving the inverse of a matrix is time consuming. An alternative method is to
use the power iteration method as suggested in [7]. In each iteration, we can update
the score using Formula (8) and the final solution for the iteration is
.
‖ ‖ (6)
(7)
(8)
For the comparison purpose, we also proposed a linear combination model AANL
that combines both local and global network feature. We computed two different
similarity scores: the Adamic/Adar score and the Katz score
. Then, the two scores are combined using Formula (9), in which indi-
cates the importance of Katz score.
(9)
In order to introduce the second regularization based combination approach AANE,
we first define a simple linear combination model AAE (shown in Formula 11) which
incorporate content information with AA. AANE then incorporate both content and
global network information with AA. The objective function in AANE is defined
in Formula (11). The closed solution of Formula (11) is Formula (12). The power
iteration method can also be used for AANE to optimize the objective function.
(10)
‖ ‖
‖ ‖
(11)
Where,
(12)
4 Dataset and Evaluation Design
The dataset used in this study contains 151,165 ACM hosted conference papers that
were published between 2000 and 2011 in the ACM Digital Library. Each paper in
the dataset includes a title and an abstract. The authors of these papers were disam-
biguated using the ACM author identifiers (In the ACM Digital Library, each author
is assigned a unique identifier number). In total, there are 209,592 unique authors.
Coauthor relations are extracted to create a coauthor network. A link between two
authors is added if they co-published at least one paper.
The dataset is divided into three parts according to publishing time for evaluation:
T1= [t2000, t2003], T2= [t2004, t2007] and T3= [t2008, t2011]. There are 3,760 papers in T2,
and 5,914 papers in T3 that have junior researchers as the first author. These papers
were selected for evaluation. T2 is the testing set when using T1 as the training set,
while T2 is the training set when using T3 as the testing set. Therefore, as the two
dataset used for evaluation are named asT1-T2 and T2-T3.
Two evaluation metrics were used. The first metric is the accuracy in top-10 posi-
tions (WTP), which examines whether the correct coauthor is ranked within the top-
10 positions. However, the exact ranking position information is lost in this case. If
two algorithms both can recommend results in top-10 positions, we cannot distinguish
their performance using WTP. Therefore, another evaluation metric mean reciprocal
rank (MRR) [22] was also used as it reflects the exact ranking position.
5 Result Analysis and Discussion
5.1 Parameter Selection
AA and ES were implemented directly as there are no explicit parameters in these two
algorithms need to be tuned. For other algorithms, parameters were tuned and the one
with best performance were selected. When the performances on WTP and MRR have
conflictions, the parameter that has better performance on WTP was chosen.
For the user-based Collaborative Filtering (CF) algorithm, as shown in Formula
(4), we tried different values of (1, 3, 5, 7 and 9), and finally chose 5 because it has
the best performance in terms of both MRR and WTP. This means that the 5 nearest
neighbors were selected as the similar users. In the Katz index method, we follow the
Gauss-Southwell algorithm [8]. A set of damping factors (i.e. the ) values are adopt-
ed and compared, including 0.1, 0.05, 0.005, 0.0005. Finally, were selected
as it is the one with best performance on both MRR and WTP. In AANL, is set to
be 0.95 because it gives the best performance on both WTP and MRR.
For the rest three models AAN, AAE and AANE, both of the parameters and
are ranging from . We set different values for the parameters from 0 to 1, with
0.1 as gradient step and chose the one with best performance ones: for T1-
T2, and for T2-T3 in AAN; in T1-T2, is in T2-T3 for
AAE, , in T1-T2 and , in T2-T3 for AANE. In
AAE, the ES scores are usually small; therefore, we use a heuristic method to multi-
ple them by 1000 in order to be able to combine with other scores.
5.2 Comparative Evaluation of Eight Models
The result analysis on each metric consists of two parts: a bar chart on how each
model performed and a statistical test to reveal the significance of experimental re-
sults. Non-parametric test Wilcoxon Signed Ranks was used since the normality was
not satisfied. The following results show the comparisons of eight models: AA (For-
mula 1), Katz (Formula 2), ES (Formula 3), CF (Formula 4), AAE (Formula 10),
AANL (Formula 9), AAN (Formula 8) and AANE (Formula 12).
The evaluation results on WTP is shown in Figure 1 and results on MRR is shown
in Figure 2. We found that the four proposed hybrid models (AAE, AANL, AAN and
AANE) are all significantly better than the single feature based models (ES, CF, AA
and Katz) on both WTP and MRR. It may suggest that different features actually re-
veal different aspects of data, and combing them can improve the performance. The
previous studies either only considered the local network feature or only the global
network feature. The fact that AAN and AANL performs significantly better than
both AA and Katz indicates that combining the local network features and the global
network can improve the prediction accuracy. We also found that the regularization
based model AAN is significantly better than linear combination models AANL. This
indicates that the regularization based approach is a better approach for multiple fea-
ture combination compared to the simple linear combination. Among all the eight
models, AANE performs the best. This indicates that incorporating all three features
together using regularization based approach produce the best predication accuracy.
Fig. 1. WTP evaluation with stand errors
Fig. 2. MRR evaluation with stand errors
Some previous works found that combining content and network feature improves
the predication. In our results, AAE is significantly better than content only model
(ES) and network only model (CF, AA and Katz), which confirms the previous find-
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
ES CF AA Katz AAE AANL AAN AANE
T1-T2 T2-T3
0.00
0.01
0.02
0.03
0.04
0.05
0.06
ES CF AA Katz AAE AANL AAN AANE
T1-T2 T2-T3
ings. In addition, we actually found that combing the local network feature and the
global network feature helps more than combing the local network features with the
content features. This is supported by the fact that AAN and AANL are significantly
better than AAE on both WTP and MRR. The reason that AANL is better than AAE
only on WTP is that MRR measures the exact rank positions while using content in-
formation can avoid ranking the right candidate in extreme low positions. We also
found that pure network features based methods, i.e. CF, AA and Katz, are signifi-
cantly better (p<0.001) than the pure content based method ES in terms of WTP eval-
uation. Content match focuses on finding potential coauthors using topic interest
match, but people are unlikely to form coauthor relationships if they are not reachable
to each other in the network even though they share similar topic interest. However,
ES seems to be superior to CF on MRR, which may suggest that although it is unable
to rank the right candidates to the top positions, it is also unlikely to rank the right
candidates in extreme low positions.
AA and Katz have conflict performance on WTP and MRR. We found that in the
T1-T2 dataset, Katz is significantly better than AA on WTP while it seems to be
worse than AA on MRR evaluation in T2-T3. We think that AA suffers from data
sparseness problem because it only considers local network feature. However, when
only the global network feature is included in Katz, it introduces many noises.
6 Conclusion
In this paper, we look into the coauthor prediction problem for junior researchers
who were actually ignored in previous works. The global network, the local network
and the content-based feature were found to be useful in previous works for link pre-
diction. We proposed two regularization based models to combine multiple features
and optimize them simultaneously. Comparing to the four baseline models that each
consider only a single feature, our proposed models performed significantly better on
the predication accuracy. A particularly interesting finding is that propagating feature
from the local network to the global network improves the performance significantly
compared to the model that combine content and local network feature. This indicates
that although many previous works focused on combing the content feature and the
local network features, they actually didn’t take full advantage of the network features
by not taking global network feature into account. Most importantly, the results show
that our proposed regularization approach is better than simple linear combination and
can be easily expanded to multiple features combination. In the next step, we will
further explore the propagation method for multiple features combination, such as
random walk or belief propagation.
References
1. Balog K, Azzopardi L, Rijke Md (2006) Formal models for expert finding in enterprise
corpora. Paper presented at the Proceedings of the 29th annual international ACM SIGIR
conference on Research and development in information retrieval.
2. Breese JS, Heckerman D, Kadie C (1998) Empirical analysis of predictive algorithms for
collaborative filtering. Paper presented at the Proceedings of the Fourteenth conference on
Uncertainty in artificial intelligence, Madison, Wisconsin,
3. Chao Wang VS, Srinivasan Parthasarathy Local Probabilistic Models for Link Prediction.
In: Seventh IEEE International Conference on Data Mining, 2008.
4. Chen H-H, Gou L, Zhang X, Giles CL (2011) CollabSeer: a search engine for
collaboration discovery. Paper presented at the Proceedings of the 11th annual
international ACM/IEEE joint conference on Digital libraries, Ottawa, Ontario, Canada,
5. Chen H-H, Gou L, Zhang X, Giles CL (2012) Discovering missing links in networks using
vertex similarity measures. Paper presented at the Proceedings of the 27th Annual ACM
Symposium on Applied Computing, Trento, Italy,
6. Craswell N, de Vries, A. P. and Soboroff, I Overview of the trec-2005 enterprise track. In:
Proceedings of the 14th Text REtrieval Conference, 2005.
7. Deng H, Han J, Lyu MR, King I (2012) Modeling and exploiting heterogeneous
bibliographic networks for expertise ranking. Paper presented at the Proceedings of the
12th ACM/IEEE-CS joint conference on Digital Libraries, Washington, DC, USA,
8. Francesco Bonchia PE, David F. Gleichc, Chen Greifd & Laks V.S. Lakshmanand (2011)
Fast Matrix Computations for Pairwise and Columnwise Commute Times and Katz
Scores. Internet Mathematics 8 (1-2)
9. Katz L (1953) A new status index derived from sociometric analysis. Psychometrika 18
(1):39-43
10. Lada Adamic EA (2002) Friends and Neighbors on the Web. Social Networks 25:211-230
11. Lee D, Brusilovsky P, Schleyer T (2011) Recommending Future Collaborators using
Social Features and MeSH terms. Paper presented at the Proceedings of the 74th Annual
Meeting of the American Society for Information Science and Technology
12. Liben-Nowell D, Kleinberg J (2003) The link prediction problem for social networks.
Paper presented at the Proceedings of the twelfth international conference on Information
and knowledge management, New Orleans, LA, USA,
13. Lotka AJ (1926) The frequency distribution of scientific productivity. Journal of the
Washington Academy of Sciences 16 (12):317–324
14. Ma H, Zhou D, Liu C, Lyu MR, King I (2011) Recommender systems with social
regularization. Paper presented at the Proceedings of the fourth ACM international
conference on Web search and data mining, Hong Kong, China,
15. Ming-Sheng Shang LL, Wei Zeng, Yi-Cheng Zhang and Tao Zhou (2009) Relevance is
more significant than correlation: Information filtering on sparse data. EPL, 88 (6)
16. Mohammad Al Hasan VC, Saeed Salem , Mohammed Link Prediction Using Supervised
Learning. In: Proc. of SDM 06 workshop on Link Analysis, Counterterrorism and
Security, 2006.
17. Paweena Chaiwanarom RI, Chidchanok Lursinsap (2010) Finding potential research
collaborators in four degrees of separation. Paper presented at the ADMA,
18. R.L. Kahn DJP (1994) Interdisciplinary collaborations are a scientific and social
imperative. The Scientist
19. Xiongcai Cai MB, Alfred Krzywicki, Wayne Wobcke, Yang Sok Kim, Paul Compton,
Ashesh Mahidadia (2011) Collaborative Filtering for People to People Recommendation in
Social Networks. LNCS 6464:476-485
20. Yizhou Sun RB, Manish Gupta, Charu C. Aggarwal, Jiawei Han Co-Author Relationship
Prediction in Heterogeneous Bibliographic Networks. In: International Conference on
Advances in Social Networks Analysis and Mining, 2011. pp 121-128