[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...

Predicting Social Network Measures using Machine Learning Approach

Rados�aw Michalski, Przemys�aw Kazienko, Dawid Król Institute of Informatics, Wroc�aw University of Technology, Wroc�aw, Poland

[email protected], [email protected], [email protected]

Abstract—The link prediction problem in social networks defined as a task to predict whether a link between two particular nodes will appear in the future is still a broadly researched topic in the field of social network analysis. However, another relevant problem is solved in the paper instead of individual link forecasting: prediction of key network measures values, what is a more time saving approach. Two machine learning techniques were examined: time series forecasting and classification. Both of them were tested on two real-life social network datasets.

Keywords-social network, social network analysis, social networks measures, time series forecasting, classification

I. INTRODUCTION The problem of predicting the future in social networks

may be considered as crucial for owners and managers of these networks. The typical questions one may ask as regards dynamics of the social network is: what is the possibility that the network will grow up dynamically or collapse instantly, how the node degree distribution will look like or how many connected components will exist in the network. There are two typical steps most often required to accomplish the goal, Firstly, to predict the future image of the network, that means to predict all the nodes and links in the network, and, secondly, to analyze the network by calculating all the overall measures needed. The problem is that both steps are time consuming, that is why a different approach is proposed in this paper: by using machine learning tools to predict how particular network measures will change in the following period. In that case, there is no need to obtain precise image of all network components, so the resources needed for calculation are significantly reduced, but the question if the results are still valuable remain.

II. RELATED WORK The classic link prediction problem in social networks was

originally defined almost ten years ago by Liben-Nowell and Kleinberg [1]. In the original formulation of the problem, it was expected to find the probability of existence of a link between two nodes in the future. Generally, the task of link prediction can be seen as the problem of completing an adjacency matrix which represents the structure of a network in time t+1, while having previous snapshots of the network. In that case the only thing to be predicted are the values of the matrix, not the network as a whole, what narrows the problem of the whole network prediction, but still is a hard task to solve.

One of the standard approaches to the link prediction problem is to regard it as a binary classification problem of the elements of the adjacency matrix [2]. To solve the problem variety of models were built and evaluated [3]. Some were based on graph structural properties [1], [4], other used nodes attribute information [5], and others were using both - attributes and structural features [3]. Another approach was presented in [6], where authors decided to use graph evolution rules (GER) to predict not only links between existing nodes but between old and new nodes as well.

Authors of this paper propose not to predict the existence of a link in the next snapshot of the network, but to omit the problem having precise reason to do that – we are not interested in particular network view but in discovering the knowledge about chosen global network structural measures like centrality, link count, density, average distance and so on [7]. In that case authors decided to use classical machine learning tools and techniques to predict precise values of those or to classify them. Unfortunately, there is no research directly regarding prediction of measures in social networks, but, of course, a lot of research may be found regarding time-series prediction and classification on different data. Thus in this paper authors decided to check whether it is possible to correctly and effectively find values of chosen social network measures or, if that fails, approximate changes of those.

III. PROBLEM DEFINITION. THE CONCEPT OF THE SOLUTION

To define the problem in more formal way, the concept of the temporal social network [8] should be defined. For the purpose of the this article, the temporal social network is defined as follows: a temporal social network TSN is a list of following social networks created in particular time windows - SN(V,E), where V – is the set of vertices and E is the set of directed edges <x,y>:x,y∈V

miVyxyxEmiEVSNT

NmTTTTSN

ii

iiii

m

,...,2,1,,:,,...,2,1),,(

,,....,, 21

=∈>=<==

∈>=< (1)

Then the problem of predicting the values of social network measures may be defined as a task to find a value of a particular measure Mk or values of selected measures M = <M1, M2, …, Mn> in the timeframe tm+1 for a given temporal network TSN as defined in (1).

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.183

1088

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.183

1056

IV. EXPERIMENT SETUP

A. Datasets Authors examined two real-life datasets by creating four

temporal social networks TSN. These are described below.

1) University Social Network

The university social network was built using the e-mail server logs obtained from Wroc�aw University of Technology. The whole dataset consists of e-mails sent within the period from February 2006 to October 2007 and these are 5,845 nodes (distinct university employees’ email addresses) and 149,344 edges (internal e-mails sent).

Using the data two temporal social networks were built – first one by creating timeframes of a length of fifteen days (consisting of 40 timeframes) and the second one of seven days (86 timeframes) – in both cases time windows were non-overlapping. Those TSNs were labeled TSN11 and TSN12, accurately.

The authors decided to build directed graph with the weight of an edge between node i and j is as follows:

��

� �� (2)

where �� is the number of e-mails sent by node i to node j and �� is a total number of e-mails sent by member i.

2) Privately Held Company Social Network

Another dataset was obtained from privately held manufacturing company and it consists of e-mail server logs as well. The correspondence was exchanged between company employees within the period from January 2010 and September 2010. The network consists of 177 nodes and 5,166 edges. This network was also split in two ways to build the TSN: by using seven days (38 timeframes, TSN21) and three days periods (90 timeframes – TSN22).

B. Measures For each Ti in every TSN most important from network

global point of view measures were calculated. They are presented in Table 1 and are described in more detail in [7]. For metrics calculation ORA software was used [10].

TABLE I. METRICS CALCULATED

Measure label Measure name

M1 Density

M2 Link count

M3 Network centralization in-degree

M4 Network centralization out-degree

M5 Average distance

M6 Network centralization betweenness

M7 Node count

M8 Transitivity

M9 Density clustering coefficient

C. Time-Series Forecasting Approach The goal of the first part of experiment was to forecast

precise value of network measure Mi by analysing the available time series. To make the results more representative the length for a learning set was narrowed to thirty and twenty five elements respectively and so-built sets moved across all TSNs – then for every learning set type results were averaged. To evaluate base learners there were two parameters measured – prediction error (mean absolute percentage error) and time of prediction. Table II presents the forecasters used in the experiment.

TABLE II. TIME-SERIES FORECASTERS EVALUATED

Base learner label Base learner name

BL1 Linear regression

BL2 Multilayer perceptron

BL3 Additive regression

BL4 M5P

BL5 Decision stump

BL6 SVM

D. Regular Classification Approach After conducting time-series forecasting authors decided to

perform classification on the same TSNs. The goal was the same, however instead of forecasting precise numeric values of measures, for all of those there were classes defined in which measure values could fit – based on the average change of the value of a particular measure recalculated for every TSN. The method for creating classes and further evaluation is as follows:

− for each measure in every TSN an average value of the change of the measure is calculated – Aki,

− for each Aki there are two groups of classes defined, one consisting of three classes and the second one of eleven classes – GC1 and GC2 which are presented in Table III,

− two types of learning sets were created – each one containing five elements, but the first one (L1) contained only five values of the measure evaluated, but the second one contained a matrix of all measures values (L2),

− for every Li the classifier was learned and later evaluated. Results were averaged for every Li, because Li was moving across the TSN. Classification was performed by using algorithms presented in Table IV. For the method of validation 10-fold cross-validation was utilized as the most commonly used [11].

10891057

TABLE III. GROUPS OF CLASSES DEFINITION

GC1

Class Values

�� <0*Aki, 0.2*Aki)

� � <0.2*Aki, 0.4*Aki)

�� <0.4*Aki, 0.6*Aki)

�� <0.6*Aki, 0.8*Aki)

�� <0.8*Aki, 1.0*Aki)

�� <1.0*Aki, 1.2*Aki)

�� <1.2*Aki, 1.4*Aki)

�� <1.4*Aki, 1.6*Aki)

�� <1.6*Aki, 1.8*Aki)

�� <1.8*Aki, 2.0*Aki)

�� 2.0*Aki

GC2

�� <0*Aki, 1.0*Aki)

� <1.0*Aki, 2.0*Aki)

�� 2.0*Aki

TABLE IV. CLASSIFICATION ALGORITHMS EVALUATED

Algorithm label Algorithm name

C1 Naïve Bayes

C2 J48 decision tree

C3 Decision stump

C4 SMO

C5 Multilayer perceptron

C6 SVM

For both - time-series forecasting and classification -

experiments were conducted in WEKA Data Mining Software [9].

V. RESULTS

A. Time-Series Forecasting Results for time-series forecasting are presented in

Figure I. Errors were calculated as mean absolute percentage error and time is presented in milliseconds. In most cases it can be seen that longer time-series period provides better forecasting results. It should be also taken into account that slower algorithms not necessarily provide better results. The difference between time windows while creating TSNs is also important for forecasting, especially while dealing with small windows (TSN21 vs. TSN22).

Figure 1. Comparison of forecasting accuracy for chosen measures and algorithms

B. Classification Results for classification are presented in Figure II.

Percentage of correctly classified elements (accuracy) and classification time (milliseconds) were measured, but due to the fact that classification time was similar, only results for

accuracy are presented. As it was written in Section IV, experiments were performed also while using just a single measure in the learning set, however results of those were generally unacceptable, so it seems that it is easier to predict crucial changes, not slight ones.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

M1 M2 M3 M4 M5 M6 M7 M8 M9

Erro

r (M

APE

)

Measures

BL1, TSN21, 30 el.

BL5, TSN21, 30 el.

BL1, TSN22, 30 el.

BL5, TSN22, 30 el.

BL1, TSN21, 25 el.

BL5, TSN11, 25 el.

BL1, TSN22, 25 el.

BL5, TSN22, 25 el.

10901058

Figure 2. Comparison of classification accuracy using two datasets, two classification algorithms and two ranges of classes

VI. CONCLUSIONS AND FUTURE WORK The purpose of the paper was to present how suitable are

various machine learning approaches to predict global social network measures, in particular by using time-series forecasting and regular classification approach. To summarize the results briefly, it can be seen that accuracy results are far from expected, however some measures are more likely to be predicted with better results, i.e. average distance or total link count. Generally, the results for those are quite stable among all evaluated algorithms, that is why the following conclusion may be derived: there is no clear winners among algorithms. Hence, it may be reasonable to use the faster ones – the results will remain similar.

Two scenarios were studied: (i) more detailed eleven classes reflecting changes more precisely and (ii) only three classes corresponding to changes in measures more roughly. The experimental results for the former were rather not satisfying: accuracy below 25%. For three classes only – a simpler problem, the results were much better: accuracy values about 60%. It means that it is suggested to choose small number of classes representing only crucial changes in the measure changes, rather than corresponding to smaller, detailed differences in their values.

When thinking about future work, an interesting thing would be to find the correlation between network measures changes with general network changes, i.e. represented as a graph edit distance [12] or any other social network evolution measure [13]. In that case by calculating distance between two snapshots of a network it may be interesting to see if there is some regularity in changes of chosen network measures which may be later correlated with GED results.

ACKNOWLEDGMENT The work was partially supported by fellowship co-

financed by the European Union within the European Social Fund, the Polish Ministry of Science and Higher Education, the research project 2010-13. Calculations have been carried out in Wroclaw Centre for Networking and Supercomputing (http://www.wcss.wroc.pl), grant No 177.

REFERENCES [1] D. L. Nowell and J. Kleinberg, "The link prediction problem for social

networks," in Proceedings of the twelfth international conference on Information and knowledge management, ser. CIKM '03. New York, NY, USA: ACM, 2003, pp. 556-559.

[2] H. Kashima, T. Kato, Y. Yamanishi, M. Sugiyama, and K. Tsuda (2009). Link propagation: A fast semi-supervised learning algorithm for link prediction. In Proceedings of 2009 SIAM International Conference on Data Mining (SDM), pp. 1099-1110.

[3] L. Getoor and C. P. Diehl, "Link mining: a survey," SIGKDD Explor. Newsl., vol. 7, no. 2, pp. 3-12, Dec. 2005.

[4] G. Rossetti, M. Berlingerio, F. Giannotti, "Scalable Link Prediction on Multidimensional Networks," Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, pp.979-986, Dec. 2011.

[5] A. Popescul, R. Popescul, and L. H. Ungar, "Statistical relational learning for link prediction," in Workshop on Learning Statistical Models from Relational Data at the International Joint Conference on Articial Intelligence, 2003.

[6] B. Bringmann, M. Berlingerio, F. Bonchi, and A. Gionis, "Learning and predicting the evolution of social networks," Intelligent Systems, IEEE, vol. 25, no. 4, pp. 26-35, Jul. 2010.

[7] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications. Cambridge: Cambridge University Press, 1997.

[8] D. Kempe, J. Kleinberg, and A. Kumar, "Connectivity and inference problems for temporal networks," pp. 504-513, 2000.

[9] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The WEKA data mining software: an update," SIGKDD Explorations, vol. 11, no. 1, 2009.

[10] K. M. Carley, "Ora: Organization risk analyzer," CMU CASOS, Tech. Rep. January, 2004.

[11] R. Kohavi, "A study of Cross-Validation and bootstrap for accuracy estimation and model selection," in IJCAI, 1995, pp. 1137-1145.

[12] A. Sanfeliu and K.-S. Fu, "A distance measure between attributed relational graphs for pattern recognition." IEEE transactions on systems, man, and cybernetics, vol. 13, no. 3, pp. 353-362, 1983.

[13] R. Michalski, S. Palus, P. Bródka, P. Kazienko, and K. Juszczyszyn, “Modelling social network evolution,” in Social Informatics, ser. Lecture Notes in Computer Science, A. Datta, S. Shulman, B. Zheng, S.-D. Lin, A. Sun, and E.-P. Lim, Eds. Springer, Berlin / Heidelberg, 2011, vol. 6984, pp. 283–286.

0

0.2

0.4

0.6

0.8

1

M1 M4 M5

Acc

urac

y

Measures

GC1, C1, TSN12

GC1, C4, TSN12

GC1, C1, TSN21

GC1, C4, TSN21

GC2, C1, TSN12

GC2, C4, TSN12

GC2, C1, TSN21

GC2, C4, TSN21

10911059

[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...

Documents