[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...

8
Classification Analysis in Complex Online Social Networks Using Semantic Web Technologies Marek Opuszko Department of Business Informatics Faculty of Economics and Business Administration Friedrich-Schiller-University of Jena Jena, Germany 07743 Email: [email protected] Telephone: (0049) 3641–943314 Johannes Ruhland Department of Business Informatics Faculty of Economics and Business Administration Friedrich-Schiller-University of Jena Jena, Germany 07743 Email: [email protected] Telephone: (0049) 3641–943310 Abstract—The Semantic Web enables people and computers to interact and exchange information. Based on Semantic Web technologies, different machine learning applications have been designed. Particularly important is the possibility to create complex metadata descriptions for any problem domain, based on pre-defined ontologies. In this paper we evaluate the use of a semantic similarity measure based on pre-defined ontologies as an input for a classification analysis in the context of social network analysis. A link prediction between actors of two real world social networks is performed, which could serve as a recommendation system. The social networks involve different types of relations and nodes. We measure the prediction performance based on a semantic similarity measure as well as traditional approaches. The findings demonstrate that the prediction accuracy based on the semantic similarity is comparable to traditional approaches and shows that data mining on complex social networks using ontology-based metadata can be considered as a very promising approach. I. I NTRODUCTION Todays online social networks like Facebook, Orkut, QZone, etc. connect multiple millions of people and allow its users to share information, express themselves and interact in various ways. The overwhelming size 1 and the organisation of the resulting flow of information is a major challenge of the Web 2.0 [1] and the analysis of social web data. Contemporary so- cial network applications comprise very rich and diverse data reaching from relational data like friend relations or discussion group memberships to profile data like age, gender, etc. Classi- cal social network analysis algorithms usually cannot entirely represent this kind of data without some loss of information [2]. Social networking sites often comprise multiple relations and multiple types of nodes simultaneously. Users often belong to multiple social networks leaving information artifacts on various platforms and applications in an unstructured way. According to Mika [3] the idea of the Semantic Web is to extend unstructured information with machine processable descriptions to provide missing knowledge if required. The Semantic Web frameworks and the corresponding technologies 1 Facebook alone shared around 10% of the world population with 721 million users in November 2011: https://www.facebook.com/notes/facebook- data-team/anatomy-of-facebook/10150388519243859 represent a possible answer to this problem, using rich typed graph models (e.g. RDF 2 ) and schema definition frameworks (RDFS 2 and OWL 2 ). The vision of the Semantic Web, as coined by Berners-Lee and colleagues [4], is a common framework in which data is stored and shared in a machine-processable way. The content of the Semantic Web is represented by formal ontologies, providing shared conceptualizations of specific domains [5]. Emerged as an extension from the WWW, this technology provides a universally usable tool to model any problem domain. Although the Semantic Web focuses on data stored on the web, this also implies the simultaneous representation of network and feature vector data (in other words, a flat file) in one consistent data representation in the context of (social) network analysis. The possibility to represent any problem domain including multiple network relations and multiple network node types in a flexible, formalized way, quickly gained attention in the data mining and machine learning community. The question arose how data, once stored in the Semantic Web format, could be analyzed directly, avoiding a potentially costly transformation process [6] [7]. Soon after its development, the concept of the Semantic Web led to the emergence of new disciplines, for instance Semantic Web Mining [8]. In the history of machine learning a variety of algorithms for different problem domains has been developed, such as clustering, classification and pattern recognition. Still, the incorporation of all background knowledge available, with other words, the best data representation, is a major issue in data mining [9]. Through the rich type graph models, Semantic Web technologies promise to enhance the incorporation of all knowledge available. Besides, a growing number of recent decision making problems have to take into consideration very different kinds of data simultaneously, i.e. (social) network data and ego-centered (feature vector 3 ) data [10]. Usually 2 Semantic Web, W3C, http://www.w3.org/2001/sw/ 3 In traditional social network analysis literature these terms are usually referred to as actor attributes. We will, however, use the term feature vector, as our datasets include different node types and the term actor attributes could be misunderstood as users’ or people’s attributes only. 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4799-2/12 $26.00 © 2012 IEEE DOI 10.1109/ASONAM.2012.179 1064 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4799-2/12 $26.00 © 2012 IEEE DOI 10.1109/ASONAM.2012.179 1032

Upload: j

Post on 23-Dec-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

Classification Analysis in Complex Online SocialNetworks Using Semantic Web Technologies

Marek Opuszko

Department of Business Informatics

Faculty of Economics and Business Administration

Friedrich-Schiller-University of Jena

Jena, Germany 07743

Email: [email protected]

Telephone: (0049) 3641–943314

Johannes Ruhland

Department of Business Informatics

Faculty of Economics and Business Administration

Friedrich-Schiller-University of Jena

Jena, Germany 07743

Email: [email protected]

Telephone: (0049) 3641–943310

Abstract—The Semantic Web enables people and computersto interact and exchange information. Based on Semantic Webtechnologies, different machine learning applications have beendesigned. Particularly important is the possibility to createcomplex metadata descriptions for any problem domain, basedon pre-defined ontologies. In this paper we evaluate the use of asemantic similarity measure based on pre-defined ontologies as aninput for a classification analysis in the context of social networkanalysis. A link prediction between actors of two real world socialnetworks is performed, which could serve as a recommendationsystem. The social networks involve different types of relationsand nodes. We measure the prediction performance based on asemantic similarity measure as well as traditional approaches.The findings demonstrate that the prediction accuracy based onthe semantic similarity is comparable to traditional approachesand shows that data mining on complex social networks usingontology-based metadata can be considered as a very promisingapproach.

I. INTRODUCTION

Todays online social networks like Facebook, Orkut, QZone,

etc. connect multiple millions of people and allow its users to

share information, express themselves and interact in various

ways. The overwhelming size1 and the organisation of the

resulting flow of information is a major challenge of the Web

2.0 [1] and the analysis of social web data. Contemporary so-

cial network applications comprise very rich and diverse data

reaching from relational data like friend relations or discussion

group memberships to profile data like age, gender, etc. Classi-

cal social network analysis algorithms usually cannot entirely

represent this kind of data without some loss of information

[2]. Social networking sites often comprise multiple relations

and multiple types of nodes simultaneously. Users often belong

to multiple social networks leaving information artifacts on

various platforms and applications in an unstructured way.

According to Mika [3] the idea of the Semantic Web is

to extend unstructured information with machine processable

descriptions to provide missing knowledge if required. The

Semantic Web frameworks and the corresponding technologies

1Facebook alone shared around 10% of the world population with 721million users in November 2011: https://www.facebook.com/notes/facebook-data-team/anatomy-of-facebook/10150388519243859

represent a possible answer to this problem, using rich typed

graph models (e.g. RDF2) and schema definition frameworks

(RDFS2 and OWL2).

The vision of the Semantic Web, as coined by Berners-Lee

and colleagues [4], is a common framework in which data is

stored and shared in a machine-processable way. The content

of the Semantic Web is represented by formal ontologies,

providing shared conceptualizations of specific domains [5].

Emerged as an extension from the WWW, this technology

provides a universally usable tool to model any problem

domain. Although the Semantic Web focuses on data stored

on the web, this also implies the simultaneous representation

of network and feature vector data (in other words, a flat file)

in one consistent data representation in the context of (social)

network analysis.

The possibility to represent any problem domain including

multiple network relations and multiple network node types in

a flexible, formalized way, quickly gained attention in the data

mining and machine learning community. The question arose

how data, once stored in the Semantic Web format, could be

analyzed directly, avoiding a potentially costly transformation

process [6] [7]. Soon after its development, the concept of the

Semantic Web led to the emergence of new disciplines, for

instance Semantic Web Mining [8].

In the history of machine learning a variety of algorithms

for different problem domains has been developed, such as

clustering, classification and pattern recognition. Still, the

incorporation of all background knowledge available, with

other words, the best data representation, is a major issue in

data mining [9]. Through the rich type graph models, Semantic

Web technologies promise to enhance the incorporation of all

knowledge available. Besides, a growing number of recent

decision making problems have to take into consideration very

different kinds of data simultaneously, i.e. (social) network

data and ego-centered (feature vector3) data [10]. Usually

2Semantic Web, W3C, http://www.w3.org/2001/sw/3In traditional social network analysis literature these terms are usually

referred to as actor attributes. We will, however, use the term feature vector,as our datasets include different node types and the term actor attributes couldbe misunderstood as users’ or people’s attributes only.

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.179

1064

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.179

1032

Page 2: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

Case 1

Variable C

Variable B

Like: Deep Purple

Case 2

Variable C

Variable B

Like: AC/DC

Root Category:

Music

Sub Category: Rock music

Case 3

Variable C

Variable B

Like: Mozart

Sub Category: 70s Rock Music

Sub Category: 80s Rock Music

Sub Category: Classical Music

Sub Category Classical era

Music

Sub Category: Baroque music

Instance of Relation/Attribute Subclass of

Raw Data

Ontology

Fig. 1. Exemplary Metadata and corresponding Ontology

both data are treated separately, facing a possible loss of

background knowledge leading to a potential lack of analysis

accuracy.

Figure 1 shows a simple example of Semantic Web metadata

and a corresponding ontology. Focusing on the first variable

(representing an outcome of musical preference) of the three

exemplary cases, a traditional feature vector format would

most likely end up in three dichotomous variables for the

musical preference. Beside the laborious creation of a single

variable for every musical preference outcome, the relation

between the outcomes of this variable will generally not be

captured in a traditional file format. As a result, it might be

omitted by traditional distance metrics like euclidean distance

or the jaccard index. Although hard to operationalize, a

different distance between AC/DC, Deep Purple and Mozartin terms of musical distance seems to exist. As seen in Figure

1, an ontology may be able to capture this information. Figure

1 further highlights that Semantic Web data is basically a

graph containing all information. In contrast to the majority of

traditional machine learning algorithms, where data is usually

processed in the form of feature vector data, stored in an n-by-

p data matrix containing n objects and p variables (features)

for each object. According to the different formats, traditional

machine learning algorithms might not be able to process

Semantic Web data directly. We propose to examine traditional

methods, extend them or create new ones in order exploit the

possible benefits of the rich type data format. Although metrics

directly gained from Semantic Web data have been introduced

[11] and researchers started exploiting the rich typed Semantic

Web graphs for social network analysis [1], further evaluations

in comparison to traditional approaches are needed.

In this paper we will focus on the similarity of (or distance

between) objects in a social network. We will investigate

the usefulness of a Semantic Web based similarity measure

directly calculated from the underlying RDF representation of

the Semantic Web data, in comparison to traditional methods.

Two real-world datasets (a sports community and a facebook4

student community) comprising both social network and ego-

centered data, are used to perform a link prediction between

actors of the social networks. In the context of social media,

a working system could be used to suggest possible acquain-

tances in social networks, suggest products for advertisement

or propose information artifacts (e.g. videos, music, discussion

groups) to social media users. On the other hand, this could

also include the avoidance of unwanted or inadequate informa-

tion and media. A further goal of the paper is to demonstrate

and to assess the applicability of Semantic Web technologies

to support social network analysis on complex datasets.

II. RELATED WORK

The expressive power of Semantic Web technologies has

been used in various fields and reach from geo-spatial ap-

plications using semantic geo-catalogues [12] to information

retrieval in the mental health domain [13]. In order to process

Semantic Web data, two approaches have been introduced to

exploit the wealth of machine learning algorithms available.

One possibility is to preprocess and transform the data to work

with traditional methods, i.e. Instance Extraction. Instance ex-

traction is, however, a non-trivial process and requires a lot of

domain knowledge. In an RDF graph, for example, all data is

interconnected and all relations can be made explicit. Grimnes

and colleagues [14] describe the process as the extraction of

the relevant subgraph to a single resource. So the question

of relevance in the problem domain has to be answered.

Furthermore, if the data has to be preprocessed to be handled

by traditional methods, additional costly effort is required

to transform the data into an adequate format. Besides, new

sources of error usually arise during a transformation.

Another approach is to change existing algorithms to work

on graph-based relational data or, if that is not possible, to

create new ones. This approach is in fact of growing interest

as new data sources in the form of linked data increasingly

become available. Huang and Colleagues [15] defined a sta-

tistical learning framework based on a relational graphical

model on which machine learning is possible. Moreover, we

can observe that an increasing number of organizations and

companies store data in a graph-based relational manner, so a

direct utilization of the data is strongly demanded.

Following the second approach, researchers developed

methods to measure the similarity between any two objects

in ontology-based metadata to later serve as an input for

traditional data mining algorithms e.g. hierarchical clustering.

An advantage of this approach is that any traditional algorithm

processing a distance measure can be applied directly, avoiding

a transfer step. Maedche and Zacharias [11] introduced a

promising approach by defining a set of similarity measures to

compare ontology-based metadata. A generalized framework

based on the work of Maedche and Zacharias [11] has been

introduced by Lula and Paliwoda-Pkekosz [16]. Grimnes and

colleagues [14] investigated on Instance Based Clustering of

4www.facebook.com

10651033

Page 3: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

Semantic Web resources and showed that the ontology-based

distance measure introduced by Maedche and Zacharias [11]

performs well for cluster analysis in comparison to other

distance measures, such as Feature Vector Distance Measure

and Graph Based Distance Measure. The overall performance

is, however, hard to evaluate as it differs according to the

quality measure Grimne and colleagues used [14]. As the work

related to the semantic similarity proposed by Maedche and

Zacharias [11] mainly focused on clustering, we propose an

evaluation of this approach for other methods. Furthermore,

Emde and Wettschereck [7] used relational instance-based

learning for clustering analysis. Both approaches did not

include ontological background knowledge.

In the field of social network analysis, however, the use

of Semantic Web technologies is a growing research field.

In this context, Ereto and colleagues [1] extended traditional

operators using a semantic web framework SemSNA. They

included the semantics of a traditional graph-based represen-

tation thus being able to deal with the various relations and

interactions when analyzing such data. The result is a complex

graph of the social network additionally comprising vertices

representing graph-based metrics like diameter, centralities or

path calculations. Although the SemSNA approach enriches the

representation of social networks, it focuses on the traditional

network metrics and omits the incorporation of other relations

outside the scope of traditional social network analysis. Pao-

lillo and Wright [17] worked on modeling profile data in the

FOAF5 format in order to identify groups of interest based

on the LiveJournal6 network. Middleton and colleagues [18]

explored an ontological approach to user profiling to use in

recommender systems. They could increase the recommenda-

tion accuracy by 7-15% using ontological inference.

The recent development in this field shows the potential

of this approach. Based on these findings, we will evaluate a

similarity measure derived from the Semantic Web background

in a classification analysis and measure its performance. A

classification analysis has the major advantage that the predic-

tion accuracy can be measured directly, thus allowing precise

conclusions. After recent research solved important issues to

perform data mining on Semantic Web data, the question

remains what drawbacks exist in comparison to classical

methods that work on feature vector data. Disadvantages of the

universal modeling feature, that Semantic Web technologies

offer could be loss of accuracy and increased computational

time. Moreover, the methods have been mainly tested on data

from the Semantic Web background and instances have been

extracted from Semantic Web data, whereas this paper rather

follows an opposite approach.

We will perform a link prediction analysis on two social

network/media datasets. Link prediction is an important task

in network science and has numerous applications in various

fields e.g. recommending friends on social networking sites

[19] or predicting coauthors in large coauthorship networks

5http://xmlns.com/foaf/spec/6www.livejournal.com

[20]. Kautz and colleagues [21], for instance, investigated

in social networks on how to find companions, assistants or

colleagues. As link prediction is merely an exemplary task to

complete in this paper, we will not illustrate further issues of

link prediction here and refer to Lichtenwalter et al. [22] for a

good overview of the recent development in this field. We will

use the graph-based baseline predictor common neighbors as

reference for a state of the art link prediction method. Common

neighbors has been widely used for link prediction and usually

ranks among the best predictors [19] or [20]. As stated by

Lichtenwalter et al. [22], common neighbors is the number

of network neighbors that actor x and y have in common. In

collaboration networks, for instance, Newman [23] showed a

correlation between the number of common neighbors and the

probability for a future collaboration.

In detail, the task examined in this paper is to predict a

relation between two arbitrary actors of a social network. The

prediction is based on the similarity between those two actors.

In order to perform the link prediction we created three differ-

ent input metrics. The first metric is created from a Semantic

Web representation (RDF) of all datasets according to the

ontology-based similarity measure proposed by Maedche and

Zacharias [11]. Secondly, a traditional feature vector from the

profile data of the social network users is created and finally

a graph-based baseline predictor common neighbors is used.

III. DATA PREPARATION AND MODELING

Different standards for a semantic description have been

introduced in the past, including different levels of expres-

siveness. These standards offer a formal way to specify shared

vocabularies that are used to create statements about resources.

Possible standards are the Resource Description Framework

(RDF) [24], the Resource Description Framework Schema

RDF(S) [25] and the Web Ontology Language (OWL) [26]. A

short overview of ontology languages for the Semantic Web

to represent knowledge give Pulido and colleagues [27]. We

will use a RDF graph representation to represent the social

networks. The networks are typed using existing ontologies

like the FOAF ontology to represent persons’ profiles. As

the existing ontologies do not include all properties necessary

to model our problem domain, we extended them with the

missing properties.

Two real world datasets are used to evaluate our approach.

Both datasets comprise network as well as ego centered data

of two different online communities. In order to assess the

quality of an analysis based on technologies derived from the

Semantic Web, we firstly created a flat file (feature vector)

representation of both datasets comprising the profile data of

the players/users. Secondly, datasets in a RDF representation

were created including the modeling of ontologies adequate to

the specific problem domains. We extracted a social network

from both datasets. These social networks are used to perform

a link prediction between actors of the social networks. The

link prediction is evaluated based on three different input

metric combinations. We will compare a semantic similarity

metric derived from the RDF graph, the baseline predictor

10661034

Page 4: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

Player

Person Tournament

Fun Master

Category

age gender

year

points lat lng

lat lng subclass of

subclass of

plays with

plays

Player B Tourn. 2

22

Player A

Cup

A+

subclass of

2009

Ontology

Metadata Tourn. 1

Instance of Relation/Attribute Subclass of

Fig. 2. Ontology and example metadata of dataset one

common neighbors as well as non-network metrics from the

users’ profiles, formatted as a traditional feature vector.

The first dataset comprises data of a nationwide beach-

volleyball online community. The Beach Volleyball League

concerned provided us with anonymized data of 2262 Players

and 359 hobby tournaments in the years 2002 to 2010,

gathered from the community web-site’s database. Each tour-

nament is characterized by the geographical location, the skill

level and the date. Figure 2 shows the resulting ontology

including some sample data. The ontologies were created

using the FOAF and WSG 847 vocabulary to model players’

attributes and the relations between them. The figure depicts

a sample snapshot of two players and two tournaments and

the corresponding relations. As seen in Figure 2, players

attend tournaments and therefore form a two-mode social

network through their attendance. The feature vector, i.e. ego-

centered, data has been extracted directly from the player table

and includes: gender, age, geographical location (latitude,

longitude), points achieved in last season and overall points.

As the players form a social network through their partici-

pation in tournaments, we transformed the two-mode network

into an undirected one-mode network, only considering the

players’ relationships if they have co-attended a tournament.

We used this player network to evaluate the prediction. The

relation between players is weighted with the number of co-

attendances. Players without any relation were excluded. The

resulting network included 2, 175 vertices and 105, 922 edges.

The network can be characterized as a dense network in the

context of social networks with a density of .044 and an

average degree of 97.39, meaning that on average, every player

has a relation with nearly 100 other players. An average local

clustering coefficient of .727 and a diameter of 5 characterize

the network as a small world network as stated by Watts and

Strogatz [28]. Furthermore, the degree distribution among all

vertices depicts a scale-free net-work, following [29].

7http://www.w3.org/2003/01/geo/

User

Person

Like

Category

hometown

url

timezone

location

about

category id category name

subclass of

Knows user

Has Like

Ontology

Instance of Relation/Attribute Subclass of

Group

id name

Has Category

Has Category

Is in group

Is subcategory of

Is parent Category of

gender

Fig. 3. Ontology of dataset two (facebook)

The second dataset has been extracted from a survey among

692 Facebook users of university freshman, four months

after the semester start. A Facebook application has been

programmed where students were able to get information

about campus and university issues. We asked the students to

grant us access to their personal profiles including their friend

list, group memberships, event attendances and like relations.

Figure 3 highlights the resulting ontology. Facebook users

form social relations through becoming a friend. To become

friends in this context not necessarily means a strong social

relationship. The term acquaintance is surely more adequate

here. Users on Facebook can further join groups where users

with common interest can interchange information. Users on

Facebook are able to create and attend events. Typical events

are music concerts or cultural events of any kind as well as

personal events like private birthday parties. A very interesting

concept is the possibility for every Facebook user to likeany object (referenced through an URL) on the web and in

particular on Facebook. The like (or I like) function can be

read in two directions. Users show with this function that they

may actually like something, e.g. movie stars, music bands or

writers and books, on the other hand, users simultaneously

share the object they like with their friends or with other

Facebook users. The second functionality may be interpreted

as a recommendation rather than an actual evaluation. As a

summary, Facebook likes can be read both, as a set of personal

attitudes and interests and also as information that users share

among each other. The underlying idea of the decision support

system in this work is that users who share similar groups,

events and likes, are more likely to be connected as Facebook

friends.

As seen in Figure 3, users have ego-centered data such

as gender, hometown, time-zone, actual location, country and

also relational data like friend relations or memberships in

groups and items users like on Facebook. Groups and likesare characterized by variables like a unique URL or a name.

10671035

Page 5: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

Likes and groups are attached to a certain category whereas

categories can be recursively related to other categories. The

categories are derived from the Facebook website and reach

from root categories like Interests, Music or TV to sub granu-

lar categories like Local Businesses. Dataset two comprises

45, 108 relations to 24, 648 unique likes. There are 6, 894memberships in 5, 993 unique groups. The social network is

formed by the friend relation on Facebook. Based on this

user-to-user relation, a social network is created. The network

shows 3, 028 connections between the 692 users, leading to

an average vertex degree (number of friends) of 8.75. The

resulting network is less dense than the network in dataset

one, showing a higher network diameter with a value of 13, a

lower clustering coefficient .349 and a lower general density

of .013.

A RDF graph containing all relations has been created for

every dataset. The semantic similarity measures following the

approach of [11] are calculated based on this RDF graph

as a base for the later classification analysis. The semantic

similarity metric between two instances (in this case users)

Ii and Ij is a weighted combination of three dimensions of

similarity: taxonomy similarity TS, relational similarity RS and

the attribute similarity AS.

The taxonomy similarity AS is determined by the corre-

sponding concepts of Ii and Ij , and their position in the

concept hierarchy (or concept taxonomy, as part of an on-

tology). In other words, the similarity is reflected by the

amount of overlap in the super-classes of each instance’s class,

as Grimnes and colleagues [14] point out. Referring to the

example in Figure 1, the pairing AC/DC and Deep Purple will

have a higher taxonomy similarity than the pairing AC/DC and

Mozart, according to the fact that the first pairing shares more

common superconcepts.

The relational similarity RS is computed on the basis of the

relations of Ii and Ij to other objects. Maedche and Zacharias

[11] made the assumption that two instances are more likely

similar, if they both have relations to a third instance, rather

than if they have relations to different instances. Hence, the

similarity of two instances depends on the similarity of the in-

stances they have relations to. To finally calculate the relational

similarity between two instances, the similarity is recursively

applied again, including the taxonomy and attribute similarity

of the related instances. As this may lead to infinite cycles,

a maximum recursion depth has to be defined. According to

dataset two, two users in Facebook form a relational similarity

RS through their relations to any object (friend, like, group

etc.) on Facebook and the relations between those objects.

Finally the attribute similarity AS focuses on similar at-

tributes to assess the similarity between two instances. Unlike

relations, the properties take literal values. Since attributes can

be very different, such as names, text, date or geographical

information, adequately comparing them is quite difficult. For

this reason, attribute similarity AS is based on type-specific

distance measures such as Levenshtein edit distance [30] for

literals and simple numeric difference for numbers etc.

Since this is just a very brief overview of the algorithm,

we refer to Maedche and Zacharias [11] for further details on

the calculation. In contrast to Maedche and Zacharias [11],

who used a weighted combined metric, we will evaluate each

dimension TS,RS and AS separately.

The feature-vector metrics included all player/user infor-

mation for a pairing per tuple. For the feature-vector metrics

of dataset one, we further added the sum of the overall

league points a player achieved so far (separated into three

classes according to different regional leagues) and computed

possible interaction variables, namely gender difference and

age difference. Finally, each tuple in the feature vector metrics

comprises the information about the two pairing players/users

and the outcome class. Due to the analysis layout, the outcome

variable is dichotomous in the form of L (Link) vs. NL (No

Link), respectively 1 vs. 0. We further calculated the commonneighborhood as the number of shared neighbors based on

the two social networks to compare the introduced metrics

with a baseline predictor. To finally create samples for the

prediction task, 1,000 player/user pairings have been chosen

randomly from the datasets. The 1,000 pairings included 500

pairings where the two players/users actually had a relation

and 500 pairings where no relation existed. This forms two

groups in our outcome class, L and NL. In each sample both

groups are equally distributed. Nonetheless, it should be kept

in mind that the networks only showed 4.4% (dataset one)

and 1.3% (dataset two) of all possible links. In real world

networks, however, the NL group will be very likely highly

overrepresented. The procedure has been performed identically

for dataset two, here comprising 1,000 user pairings in each

group.

IV. RESULTS

Before performing the classification analysis, it is advisable

to comprehensively examine the statistical properties of the

semantic similarity metric, due to the fact, that they have

been mainly tested for clustering applications and the three

dimensions were usually combined to one metric [14] or

[11]. On this account we performed an ANOVA to assess

the discriminative power of each metric dimension TS,RSand AS. We will later perform a binary logistic regression,

a discriminant analysis and use a decision tree to evaluate

the predictive power of the different metrics. Figure 4 shows

boxplots of all semantic similarity dimensions grouped by the

outcome class in dataset one.

The plot highlights obvious mean differences for each di-

mension in the outcome class. The relational similarity RS in

particular shows a distinct mean difference, having comparable

standard deviation dimensions. Taxonomy similarity TS also

shows a mean difference whereas the standard deviation of

both groups is quite similar. The attribute similarity ASshows a very high standard deviation, possibly leading to false

predictions in some regions, if the prediction would rely on

this element only.

However, the increased weight for the relational similarity

when combining all three measures, as proposed by Maedche

and Zacharias [11], seems reasonable as it promises the highest

10681036

Page 6: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

����� ����� �����

����������� ������������������������� ������

�����

���

������

���

�����

����

���� ���� ����

�������� ��!"#�$���%������ �!

Fig. 4. Boxplots of semantic similarity components in outcome class ofdataset one (TS = taxonomy similarity, RS = relational similarity, AS =attribute similarity). The class NL here represents the class of pairings withno link, whereas L represents the class with an existing link between the twopairing players.

discriminative power when visually analyzing the boxplots in

Figure 4.

The similarities of the Facebook dataset (dataset two)

showed a very similar result. However, taxonomy similarity

TS did not show a significant difference in the outcome class

in this dataset, due to the fact that there are practically no

taxonomic differences for users on Facebook, as there are no

classes, categories or hierarchies for users.

To evaluate the visual impressions statistically, a one-way

ANOVA on group differences for each similarity measure was

performed. The results depict a significant group difference for

every semantic similarity dimension, except TS in dataset two.

Again, relational similarity RS (μ = .52, σ = .06 in group

NL and μ = .64, σ = .08 in group L in dataset one, μ = .01,

σ = .02 in group NL and μ = .13, σ = .13 in group L in

dataset two) shows the highest effect size regarding Cohen’s d(1.76 for dataset one, 1.30 for dataset two) as already assumed

by the analysis of the boxplot charts [31]. Worth mentioning is

that all semantic similarity dimensions (besides TS in dataset

two) show a significant mean difference of p < .01 in the

outcome groups. These findings statistically prove the ability

of the semantic similarity metric to discriminate the outcome

groups. The high values of Cohen’s d effect size measure

further depict a genuine difference between linked and not

linked pairs in terms of their semantic similarity.

After having laid the statistical foundations, the questions of

the predictive power of the semantic similarity metric (TS,RSand AS) is examined. The results are shown in Table I. First

Accuracy (dataset one) Method Input metric(s) TP TN Overall

Logistic binary regression

Feature Vector* 81.6% 79.7% 80.8%

Semantic Similarity 94.6% 92.6% 93.6% Common neighbors 96% 95% 95.6%

Discriminant analysis (cross validated)

Feature Vector* 73.3% 87.1% 79.1% Semantic Similarity 95.8% 89.2% 92.5% Common neighbors 98.4% 74.5% 86.4%

Decision Tree (C 4.5) (10Fold cross validated)

Feature Vector 85.3% 81.9% 83.5% Semantic Similarity 98.6% 93.4% 96.09% Common neighbors 90.0% 90.4% 90.20%

Accuracy (dataset two) Logistic binary regression

Feature Vector* 44.6% 59.8% 52.2% Semantic Similarity 70.3% 94.6% 82.4% Common neighbors 60.9% 97.9% 79.4%

Discriminant analysis (cross validated)

Feature Vector* 44.5% 59.6% 52.0% Semantic Similarity 59.6% 98.7% 79.1% Common neighbors 60.9% 97.9% 79.4%

Decision Tree (C 4.5) (10Fold cross validated)

Feature Vector* 40.9% 71.0% 55.9% Semantic Similarity 67.9% 96.8% 82.35% Common neighbors 60.9% 97.9% 79.4%

* Due to missing values, not all cases could be included leading to slightly unequal class distributions.

TABLE IRESULTS OF CLASS PREDICTION USING DIFFERENT PREDICTION

METHODS (TP = TRUE POSITIVE RATE, TN = TRUE NEGATIVE RATE);THE SEMANTIC SIMILARITY METRIC COMPRISES TS ,RS AND AS

of all, a logistic binary regression (cut value = .5) has been

performed for each dataset and all metrics. The regression

showed a significant relation for all three predictor variables

TS, RS and AS.

As seen in Table I the logistic binary regression leads to

a predictive accuracy with 93.6% in dataset one and 82.4%in dataset two for the semantic similarity. In dataset two, the

regression preforms best with an overall accuracy of 82.4 for

the semantic similarity metric.

The analysis of the feature-vector metrics shows the lowest

predictive accuracy in both datasets, regardless the method

used. The common neighborhood metric and the semantic

similarity metric show comparable results, indicating a slight

advantage for the semantic similarity metric. Note worthily

is the consistently high predictive accuracy of the semantic

similarity metric regardless the method used in both datasets.

The predictive accuracy is quite similar in both groups L and

NL in dataset one. Interestingly, this is not the case in dataset

two. Here we can observe very high true negative rates up

to 98.7%. The true positive rates in contast, never reach a

level over 68%. This shows that using the semantic similarity

metric and performing a discriminant analysis, almost every

non-friend could be identified correctly in the sample.

As it is vital to investigate the nature of the relationship

of the predictor variable, in particular the dimensions of

the semantic similarity metric, a discriminant analysis was

conducted. Discriminant analysis is used to examine how

to best possible separate a set of groups (here L vs. NL)

using several predictors. Relational similarity RS showed,

with a structure matrix value of .634, that it is the most

10691037

Page 7: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

important predictor for dataset one in differentiating the two

groups, followed by the attribute similarity AS with a value

of .385. Again, taxonomy similarity TS seems to have a low

contribution to the group differentiating with a value of −.088.

As expected, the prediction results of the discriminant analysis

are comparable to the binary logistic regression, since both

methods are different ways of achieving the same result. It

should be noted that within the classification of the feature

vector metrics not all instances could be included as the

dataset suffered from missing values in several variables. That

is also why a decision tree classifier (C 4.5, [32]) has been

tested on both datasets. This classification method shows a

high predictive accuracy in both datasets when relying on the

semantic similarity metric, as seen in Table I. In summary,

the metrics lead to quite different results in the two datasets,

emphasizing the influence of the underlying data. The semantic

similarity metric seems to benefits from a more comprehen-

sive information, i.e. incorporation of background knowledge,

which leads to an advantage in prediction accuracy.

V. CONCLUSIONS AND FUTURE WORK

In this paper we investigated one important challenge for

performing data mining and machine learning directly on

Semantic Web data in the context of social network analysis.

We created a RDF graph from existing data and computed sim-

ilarities between instances of that graph. We further calculated

traditional metrics to compare both approaches. We evaluated

the methods by performing a classification analysis in the form

of a link prediction between any two actors of two real world

social networks. The results show that the semantic similarity

based on the RDF graph performs note worthily. We conclude

that the incorporation of all available background knowledge

yields to an improvement of existing methods.

Nevertheless, the results depict that the prediction heavily

relies on the data structure. The results of the baseline pre-

dictor common neighborhood also show that Semantic Web

technologies not necessarily outperform traditional methods.

However, the ontologies used in this paper were rather simple

and served to examine a new approach in social network anal-

ysis. The results indicate that analyses on highly interrelated

data might benefit from Semantic Web technologies.

Another advantage is the expandability of Semantic Web

data approach. New ontologies can be easily attached to

already existing metadata or more heavyweight ontologies

could be merged with an existing one [33]. Furthermore, the

accessibility to free and comprehensive ontologies is growing

rapidly. These ontologies could enhance the analysis process.

To give one example, a similarity analysis of customers with

hobbies in the field of music or movies, could be improved

by music8 and movie9 ontologies, which move the analysis

from treating this information in a way of nominal values to

a relational graph, where each instance is interconnected. In a

traditional layout, these values would be mainly compared in

8http://www.musicontology.com9http://www.movieontology.org

the way equal or not equal , whereas the use of ontologies may

offer an answer to how related two values are. Additionally,

Semantic Web technologies offer possibilities to automatically

derive new data from existing one through inference engines.

Since this approach could further improve the predictive power

of analyses, it should be considered for future work. Again, we

would like to emphasize that the main purpose of the analysis

in this paper is not to find the optimal predictor for one specific

problem but, to investigate the potential of Semantic Web

technologies in the field of social network analysis.

Nonetheless, the study revealed several unanswered ques-

tion and issues in this field. A major disadvantage is the

question of interpretability. As seen in the result section, no

conclusions can be made, what predictor in the original RDF

graph of the semantic similarity had the biggest influence on

its predictive power. In the case when causal explanations are

needed, this inevitably leads to further effort for the analyst.

In many areas, however, this is an important issue. Reviewing

the analysis in this paper, the results of the semantic similarity

metric give almost no explanation what exactly made two

players or two users similar, despite the fact that their relationto each other had the highest discriminative power. However,

this also applies for the common neighborhood metric.

Another issue is the weighted combination of the three

components of the similarity measure, proposed by Maedche

and Zacharias [11]. The results of this study show that it is

advisable to treat each component separately. If the problem

domain, including the ontology, is more or less static, the

optimal weights for a single problem (e.g. classification) could

be assessed in an optimization process for a representative

sample of metadata, if available. Moreover, the ontology used

in this paper was rather simple in comparison to existing

ontologies in other domains. The similarity measure should

therefore be evaluated upon complex ontologies and very large

datasets to gain a more precise picture. Nonetheless, this paper

improved upon past research by showing the potential of

Semantic Web technologies as a basis for representing and

analyzing social network data.

REFERENCES

[1] G. Ereteo, M. Buffa, F. Gandon, and O. Corby, “Analysis of a realonline social network using semantic web frameworks,” in Proceedingsof the 8th International Semantic Web Conference, ser. ISWC ’09.Berlin, Heidelberg: Springer-Verlag, 2009, pp. 180–195. [Online].Available: http://dx.doi.org/10.1007/978-3-642-04930-9 12

[2] G. Ereteo, M. Buffa, F. Gandon, P. Grohan, M. Leitzelman, andP. Sander, “A State of the Art on Social Network Analysis and itsApplications on a Semantic Web,” in Proc. SDoW2008 (Social Dataon the Web), Workshop held with the 7th International Semantic WebConference, Karlsruhe, Germany, October 2008.

[3] P. Mika, Social networks and the Semantic Web, ser.Semantic web and beyond. Springer, 2007. [Online]. Available:http://books.google.de/books?id=tcYJ8a32nFUC

[4] T. Berners-Lee, J. Hendler, and O. Lassila, “The Se-mantic Web (Berners-Lee et. al 2001),” Scientific Ameri-can, vol. 284, pp. 28–37, May 2001. [Online]. Avail-able: http://www.sciam.com/print version.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21

[5] T. R. Gruber, “Toward Principles for the Design of Ontologies Usedfor Knowledge Sharing,” International Journal of Human-Computer

10701038

Page 8: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

Studies, vol. 43, no. 5-6, pp. 907–928, Nov. 1995. [Online]. Available:http://tomgruber.org/writing/onto-design.htm

[6] A. Delteil, C. Faron-Zucker, and R. Dieng, “Learning Ontologiesfrom RDF annotations,” in IJCAI 2001 Workshop on OntologyLearning, Proceedings of the Second Workshop on Ontology LearningOL 2001, Seattle, USA, August 4, 2001 (Held in conjunction withthe 17th International Conference on Artificial Intelligence IJCAI2001), ser. CEUR Workshop Proceedings, A. Maedche, S. Staab,C. Nedellec, and E. H. Hovy, Eds., vol. 38. CEUR-WS.org, 2001.[Online]. Available: http://dx.doi.org/http://SunSITE.Informatik.RWTH-Aachen.DE/Publications/CEUR-WS/Vol-38/delteil ol.pdf

[7] W. Emde and D. Wettschereck, “Relational Instance BasedLearning,” in Machine Learning - Proceedings 13th InternationalConference on Machine Learning, L. Saitta, Ed. MorganKaufmann Publishers, 1996, pp. 122 – 130. [Online]. Available:http://citeseer.ist.psu.edu/emde96relational.html

[8] B. Berendt, A. Hotho, and S. Gerd, “Towards Semantic WebMining,” in Proceedings of the First International Semantic WebConference on The Semantic Web, ser. ISWC ’02. London,UK, UK: Springer-Verlag, 2002, pp. 264–278. [Online]. Available:http://portal.acm.org/citation.cfm?id=646996.711414

[9] J. Han and M. Kamber, Data Mining: Concepts and Techniques,2nd ed. Morgan Kaufmann, Jan. 2006. [Online]. Available:http://www.worldcat.org/isbn/1558609016

[10] P. Gloor, J. S. Krauss, S. Nann, K. Fischbach, and D. Schoder, “WebScience 2.0: Identifying Trends Through Semantic Social NetworkAnalysis,” Social Science Research Network Working Paper Series,Nov. 2008. [Online]. Available: http://ssrn.com/abstract=1299869

[11] A. Maedche and V. Zacharias, “Clustering Ontology-Based Metadata inthe Semantic Web,” in Proceedings of the 6th European Conference onPrinciples of Data Mining and Knowledge Discovery, ser. PKDD ’02.London, UK: Springer-Verlag, 2002, pp. 348–360. [Online]. Available:http://portal.acm.org/citation.cfm?id=645806.670157

[12] F. Farazi, V. Maltese, F. Giunchiglia, and A. Ivanyukovich, “Afaceted ontology for a semantic geo-catalogue,” in Proceedings ofthe 8th extended semantic web conference on The semanic web:research and applications - Volume Part II, ser. ESWC’11. Berlin,Heidelberg: Springer-Verlag, 2011, pp. 169–182. [Online]. Available:http://dl.acm.org/citation.cfm?id=2017936.2017950

[13] M. Hadzic and E. Chang, “Advances in web semantics i,” T. S. Dillon,E. Chang, R. Meersman, and K. Sycara, Eds. Berlin, Heidelberg:Springer-Verlag, 2009, ch. Web Semantics for Intelligent and DynamicInformation Retrieval Illustrated within the Mental Health Domain,pp. 260–275. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-89784-2 10

[14] G. A. Grimnes, P. Edwards, and A. Preece, “Instance based clustering ofsemantic web resources,” in Proceedings of the 5th European semanticweb conference on The semantic web: research and applications, ser.ESWC’08. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 303–317.[Online]. Available: http://portal.acm.org/citation.cfm?id=1789425

[15] Y. Huang, V. Tresp, and H.-P. Kriegel, “Multivariate predictionfor learning in relational graphs,” 2009. [Online]. Available:http://snap.stanford.edu/nipsgraphs2009/papers/huang-paper.pdf

[16] P. Lula and G. Paliwoda-Pkekosz, “An ontology-based clusteranalysis framework,” in Proceedings of the first internationalworkshop on Ontology-supported business intelligence, ser. OBI’08. New York, NY, USA: ACM, 2008. [Online]. Available:http://doi.acm.org/10.1145/1452567.1452574

[17] J. C. Paolillo and E. Wright, Social network analysis on the semanticweb: Techniques and challenges for visualizing foaf. Springer, 2005,pp. 229–242. [Online]. Available: http://www.blogninja.com/vsw-draft-paolillo-wright-foaf.pdf

[18] S. E. Middleton, N. Shadbolt, and D. D. Roure, “Ontological userprofiling in recommender systems,” ACM Transactions on InformationSystems (TOIS), vol. 22, no. 1, pp. 54–88, January 2004. [Online].Available: http://eprints.ecs.soton.ac.uk/8926/

[19] J. Chen, W. Geyer, C. Dugan, M. Muller, and I. Guy, “Makenew friends, but keep the old: recommending people on socialnetworking sites,” in Proceedings of the 27th international conferenceon Human factors in computing systems, ser. CHI ’09. NewYork, NY, USA: ACM, 2009, pp. 201–210. [Online]. Available:http://dx.doi.org/10.1145/1518701.1518735

[20] D. Liben-Nowell and J. Kleinberg, “The link-prediction problem for

social networks,” J. Am. Soc. Inf. Sci., vol. 58, no. 7, pp. 1019–1031,2007. [Online]. Available: http://dx.doi.org/10.1002/asi.20591

[21] H. Kautz, B. Selman, and M. Shah, “Referral Web: combining socialnetworks and collaborative filtering,” Commun. ACM, vol. 40, no. 3, pp.63–65, Mar. 1997.

[22] R. N. Lichtenwalter, J. T. Lussier, and N. V. Chawla, “New perspectivesand methods in link prediction,” in KDD ’10: Proceedings of the16th ACM SIGKDD international conference on Knowledge discoveryand data mining. New York, NY, USA: ACM, 2010, pp. 243–252.[Online]. Available: http://dx.doi.org/10.1145/1835804.1835837

[23] M. E. J. Newman, “Clustering and preferential attachment in growingnetworks,” Physical Review E, vol. 64, no. 2, pp. 025 102+, Jul. 2001.[Online]. Available: http://dx.doi.org/10.1103/PhysRevE.64.025102

[24] G. Klyne and J. J. Carroll, “Resource Description Framework (RDF):Concepts and Abstract Syntax,” http://www.w3.org/TR/rdf-concepts/,Feb. 2004. [Online]. Available: http://www.w3.org/TR/rdf-concepts/

[25] D. Brickley and R. V. Guha, “RDF Vocabulary Description Language1.0: RDF Schema,” http://www.w3.org/TR/rdf-schema/, Feb. 2004.[Online]. Available: http://www.w3.org/TR/rdf-schema/

[26] M. Smith, C. Welty, and D. McGuinness, “OWL web ontology guide.”http://www.w3.org/TR/owl-guide/, 2004.

[27] J. R. G. Pulido, M. A. G. Ruiz, R. Herrera, E. Cabello,S. Legrand, and D. Elliman, “Ontology languages for thesemantic web: A never completely updated review,” Know.-BasedSyst., vol. 19, no. 7, pp. 489–497, 2006. [Online]. Available:http://dx.doi.org/10.1016/j.knosys.2006.04.013

[28] D. J. Watts and S. H. Strogatz, “Collective dynamics of ’small-world’networks.” Nature, vol. 393, no. 6684, pp. 440–442, Jun. 1998.[Online]. Available: http://dx.doi.org/10.1038/30918

[29] A.-L. Barabasi and R. Albert, “Emergence of Scaling in RandomNetworks,” Science, vol. 286, no. 5439, pp. 509–512, Oct. 1999.[Online]. Available: http://dx.doi.org/10.1126/science.286.5439.509

[30] V. I. Levenshtein, “Binary Codes Capable of Correcting Deletions,Insertions and Reversals,” Soviet Physics Doklady, vol. 10, pp. 707+,Feb. 1966. [Online]. Available: http://adsabs.harvard.edu/cgi-bin/nph-bib query?bibcode=1966SPhD...10..707L

[31] J. Cohen, Statistical power analysis for the behavioral sciences : JacobCohen., 2nd ed. Lawrence Erlbaum, Jan. 1988. [Online]. Available:http://www.worldcat.org/isbn/0805802835

[32] J. R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kauf-mann Series in Machine Learning), 1st ed. Morgan Kaufmann, Oct.1992. [Online]. Available: http://www.worldcat.org/isbn/1558602380

[33] N. F. Noy and M. A. Musen, “The prompt suite: interactivetools for ontology merging and mapping,” Int. J. Hum.-Comput.Stud., vol. 59, pp. 983–1024, December 2003. [Online]. Available:http://dl.acm.org/citation.cfm?id=965943.965952

10711039