# [ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...

Post on 09-Apr-2017

215 views

Embed Size (px)

TRANSCRIPT

Predicting Social Network Measures using Machine Learning Approach

Radosaw Michalski, Przemysaw Kazienko, Dawid Krl Institute of Informatics, Wrocaw University of Technology, Wrocaw, Poland

radoslaw.michalski@pwr.wroc.pl, kazienko@pwr.wroc.pl, dawid.krol.pwr@gmail.com

AbstractThe link prediction problem in social networks defined as a task to predict whether a link between two particular nodes will appear in the future is still a broadly researched topic in the field of social network analysis. However, another relevant problem is solved in the paper instead of individual link forecasting: prediction of key network measures values, what is a more time saving approach. Two machine learning techniques were examined: time series forecasting and classification. Both of them were tested on two real-life social network datasets.

Keywords-social network, social network analysis, social networks measures, time series forecasting, classification

I. INTRODUCTION The problem of predicting the future in social networks

may be considered as crucial for owners and managers of these networks. The typical questions one may ask as regards dynamics of the social network is: what is the possibility that the network will grow up dynamically or collapse instantly, how the node degree distribution will look like or how many connected components will exist in the network. There are two typical steps most often required to accomplish the goal, Firstly, to predict the future image of the network, that means to predict all the nodes and links in the network, and, secondly, to analyze the network by calculating all the overall measures needed. The problem is that both steps are time consuming, that is why a different approach is proposed in this paper: by using machine learning tools to predict how particular network measures will change in the following period. In that case, there is no need to obtain precise image of all network components, so the resources needed for calculation are significantly reduced, but the question if the results are still valuable remain.

II. RELATED WORK The classic link prediction problem in social networks was

originally defined almost ten years ago by Liben-Nowell and Kleinberg [1]. In the original formulation of the problem, it was expected to find the probability of existence of a link between two nodes in the future. Generally, the task of link prediction can be seen as the problem of completing an adjacency matrix which represents the structure of a network in time t+1, while having previous snapshots of the network. In that case the only thing to be predicted are the values of the matrix, not the network as a whole, what narrows the problem of the whole network prediction, but still is a hard task to solve.

One of the standard approaches to the link prediction problem is to regard it as a binary classification problem of the elements of the adjacency matrix [2]. To solve the problem variety of models were built and evaluated [3]. Some were based on graph structural properties [1], [4], other used nodes attribute information [5], and others were using both - attributes and structural features [3]. Another approach was presented in [6], where authors decided to use graph evolution rules (GER) to predict not only links between existing nodes but between old and new nodes as well.

Authors of this paper propose not to predict the existence of a link in the next snapshot of the network, but to omit the problem having precise reason to do that we are not interested in particular network view but in discovering the knowledge about chosen global network structural measures like centrality, link count, density, average distance and so on [7]. In that case authors decided to use classical machine learning tools and techniques to predict precise values of those or to classify them. Unfortunately, there is no research directly regarding prediction of measures in social networks, but, of course, a lot of research may be found regarding time-series prediction and classification on different data. Thus in this paper authors decided to check whether it is possible to correctly and effectively find values of chosen social network measures or, if that fails, approximate changes of those.

III. PROBLEM DEFINITION. THE CONCEPT OF THE SOLUTION

To define the problem in more formal way, the concept of the temporal social network [8] should be defined. For the purpose of the this article, the temporal social network is defined as follows: a temporal social network TSN is a list of following social networks created in particular time windows - SN(V,E), where V is the set of vertices and E is the set of directed edges :x,yV

miVyxyxEmiEVSNT

NmTTTTSN

ii

iiii

m

,...,2,1,,:,,...,2,1),,(

,,....,, 21

=>==< (1)

Then the problem of predicting the values of social network measures may be defined as a task to find a value of a particular measure Mk or values of selected measures M = in the timeframe tm+1 for a given temporal network TSN as defined in (1).

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.183

1088

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.183

1056

IV. EXPERIMENT SETUP

A. Datasets Authors examined two real-life datasets by creating four

temporal social networks TSN. These are described below.

1) University Social Network

The university social network was built using the e-mail server logs obtained from Wrocaw University of Technology. The whole dataset consists of e-mails sent within the period from February 2006 to October 2007 and these are 5,845 nodes (distinct university employees email addresses) and 149,344 edges (internal e-mails sent).

Using the data two temporal social networks were built first one by creating timeframes of a length of fifteen days (consisting of 40 timeframes) and the second one of seven days (86 timeframes) in both cases time windows were non-overlapping. Those TSNs were labeled TSN11 and TSN12, accurately.

The authors decided to build directed graph with the weight of an edge between node i and j is as follows:

(2)

where is the number of e-mails sent by node i to node j and is a total number of e-mails sent by member i.

2) Privately Held Company Social Network

Another dataset was obtained from privately held manufacturing company and it consists of e-mail server logs as well. The correspondence was exchanged between company employees within the period from January 2010 and September 2010. The network consists of 177 nodes and 5,166 edges. This network was also split in two ways to build the TSN: by using seven days (38 timeframes, TSN21) and three days periods (90 timeframes TSN22).

B. Measures For each Ti in every TSN most important from network

global point of view measures were calculated. They are presented in Table 1 and are described in more detail in [7]. For metrics calculation ORA software was used [10].

TABLE I. METRICS CALCULATED

Measure label Measure name

M1 Density

M2 Link count

M3 Network centralization in-degree

M4 Network centralization out-degree

M5 Average distance

M6 Network centralization betweenness

M7 Node count

M8 Transitivity

M9 Density clustering coefficient

C. Time-Series Forecasting Approach The goal of the first part of experiment was to forecast

precise value of network measure Mi by analysing the available time series. To make the results more representative the length for a learning set was narrowed to thirty and twenty five elements respectively and so-built sets moved across all TSNs then for every learning set type results were averaged. To evaluate base learners there were two parameters measured prediction error (mean absolute percentage error) and time of prediction. Table II presents the forecasters used in the experiment.

TABLE II. TIME-SERIES FORECASTERS EVALUATED

Base learner label Base learner name

BL1 Linear regression

BL2 Multilayer perceptron

BL3 Additive regression

BL4 M5P

BL5 Decision stump

BL6 SVM

D. Regular Classification Approach After conducting time-series forecasting authors decided to

perform classification on the same TSNs. The goal was the same, however instead of forecasting precise numeric values of measures, for all of those there were classes defined in which measure values could fit based on the average change of the value of a particular measure recalculated for every TSN. The method for creating classes and further evaluation is as follows:

for each measure in every TSN an average value of the change of the measure is calculated Aki,

for each Aki there are two groups of classes defined, one consisting of three classes and the second one of eleven classes GC1 and GC2 which are presented in Table III,

two types of learning sets were created each one containing five elements, but the first one (L1) contained only five values of the measure evaluated, but the second one contained a matrix of all measures values (L2),

for every Li the classifier was learned and later evaluated. Results were averaged for every Li, because Li was moving across the TSN. Classification was performed by using algorithms presented in Table IV. For the method of validation 10-fold cross-validation was utilized as the most commonly used [11].

10891057

TABLE III. GROUPS OF CLASSES DEFINITION

GC1

Class Values

Recommended