[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...
TRANSCRIPT
Classification Analysis in Complex Online SocialNetworks Using Semantic Web Technologies
Marek Opuszko
Department of Business Informatics
Faculty of Economics and Business Administration
Friedrich-Schiller-University of Jena
Jena, Germany 07743
Email: [email protected]
Telephone: (0049) 3641–943314
Johannes Ruhland
Department of Business Informatics
Faculty of Economics and Business Administration
Friedrich-Schiller-University of Jena
Jena, Germany 07743
Email: [email protected]
Telephone: (0049) 3641–943310
Abstract—The Semantic Web enables people and computersto interact and exchange information. Based on Semantic Webtechnologies, different machine learning applications have beendesigned. Particularly important is the possibility to createcomplex metadata descriptions for any problem domain, basedon pre-defined ontologies. In this paper we evaluate the use of asemantic similarity measure based on pre-defined ontologies as aninput for a classification analysis in the context of social networkanalysis. A link prediction between actors of two real world socialnetworks is performed, which could serve as a recommendationsystem. The social networks involve different types of relationsand nodes. We measure the prediction performance based on asemantic similarity measure as well as traditional approaches.The findings demonstrate that the prediction accuracy based onthe semantic similarity is comparable to traditional approachesand shows that data mining on complex social networks usingontology-based metadata can be considered as a very promisingapproach.
I. INTRODUCTION
Todays online social networks like Facebook, Orkut, QZone,
etc. connect multiple millions of people and allow its users to
share information, express themselves and interact in various
ways. The overwhelming size1 and the organisation of the
resulting flow of information is a major challenge of the Web
2.0 [1] and the analysis of social web data. Contemporary so-
cial network applications comprise very rich and diverse data
reaching from relational data like friend relations or discussion
group memberships to profile data like age, gender, etc. Classi-
cal social network analysis algorithms usually cannot entirely
represent this kind of data without some loss of information
[2]. Social networking sites often comprise multiple relations
and multiple types of nodes simultaneously. Users often belong
to multiple social networks leaving information artifacts on
various platforms and applications in an unstructured way.
According to Mika [3] the idea of the Semantic Web is
to extend unstructured information with machine processable
descriptions to provide missing knowledge if required. The
Semantic Web frameworks and the corresponding technologies
1Facebook alone shared around 10% of the world population with 721million users in November 2011: https://www.facebook.com/notes/facebook-data-team/anatomy-of-facebook/10150388519243859
represent a possible answer to this problem, using rich typed
graph models (e.g. RDF2) and schema definition frameworks
(RDFS2 and OWL2).
The vision of the Semantic Web, as coined by Berners-Lee
and colleagues [4], is a common framework in which data is
stored and shared in a machine-processable way. The content
of the Semantic Web is represented by formal ontologies,
providing shared conceptualizations of specific domains [5].
Emerged as an extension from the WWW, this technology
provides a universally usable tool to model any problem
domain. Although the Semantic Web focuses on data stored
on the web, this also implies the simultaneous representation
of network and feature vector data (in other words, a flat file)
in one consistent data representation in the context of (social)
network analysis.
The possibility to represent any problem domain including
multiple network relations and multiple network node types in
a flexible, formalized way, quickly gained attention in the data
mining and machine learning community. The question arose
how data, once stored in the Semantic Web format, could be
analyzed directly, avoiding a potentially costly transformation
process [6] [7]. Soon after its development, the concept of the
Semantic Web led to the emergence of new disciplines, for
instance Semantic Web Mining [8].
In the history of machine learning a variety of algorithms
for different problem domains has been developed, such as
clustering, classification and pattern recognition. Still, the
incorporation of all background knowledge available, with
other words, the best data representation, is a major issue in
data mining [9]. Through the rich type graph models, Semantic
Web technologies promise to enhance the incorporation of all
knowledge available. Besides, a growing number of recent
decision making problems have to take into consideration very
different kinds of data simultaneously, i.e. (social) network
data and ego-centered (feature vector3) data [10]. Usually
2Semantic Web, W3C, http://www.w3.org/2001/sw/3In traditional social network analysis literature these terms are usually
referred to as actor attributes. We will, however, use the term feature vector,as our datasets include different node types and the term actor attributes couldbe misunderstood as users’ or people’s attributes only.
2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
978-0-7695-4799-2/12 $26.00 © 2012 IEEE
DOI 10.1109/ASONAM.2012.179
1064
2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
978-0-7695-4799-2/12 $26.00 © 2012 IEEE
DOI 10.1109/ASONAM.2012.179
1032
Case 1
Variable C
Variable B
Like: Deep Purple
Case 2
Variable C
Variable B
Like: AC/DC
Root Category:
Music
Sub Category: Rock music
Case 3
Variable C
Variable B
Like: Mozart
Sub Category: 70s Rock Music
Sub Category: 80s Rock Music
Sub Category: Classical Music
Sub Category Classical era
Music
Sub Category: Baroque music
Instance of Relation/Attribute Subclass of
Raw Data
Ontology
Fig. 1. Exemplary Metadata and corresponding Ontology
both data are treated separately, facing a possible loss of
background knowledge leading to a potential lack of analysis
accuracy.
Figure 1 shows a simple example of Semantic Web metadata
and a corresponding ontology. Focusing on the first variable
(representing an outcome of musical preference) of the three
exemplary cases, a traditional feature vector format would
most likely end up in three dichotomous variables for the
musical preference. Beside the laborious creation of a single
variable for every musical preference outcome, the relation
between the outcomes of this variable will generally not be
captured in a traditional file format. As a result, it might be
omitted by traditional distance metrics like euclidean distance
or the jaccard index. Although hard to operationalize, a
different distance between AC/DC, Deep Purple and Mozartin terms of musical distance seems to exist. As seen in Figure
1, an ontology may be able to capture this information. Figure
1 further highlights that Semantic Web data is basically a
graph containing all information. In contrast to the majority of
traditional machine learning algorithms, where data is usually
processed in the form of feature vector data, stored in an n-by-
p data matrix containing n objects and p variables (features)
for each object. According to the different formats, traditional
machine learning algorithms might not be able to process
Semantic Web data directly. We propose to examine traditional
methods, extend them or create new ones in order exploit the
possible benefits of the rich type data format. Although metrics
directly gained from Semantic Web data have been introduced
[11] and researchers started exploiting the rich typed Semantic
Web graphs for social network analysis [1], further evaluations
in comparison to traditional approaches are needed.
In this paper we will focus on the similarity of (or distance
between) objects in a social network. We will investigate
the usefulness of a Semantic Web based similarity measure
directly calculated from the underlying RDF representation of
the Semantic Web data, in comparison to traditional methods.
Two real-world datasets (a sports community and a facebook4
student community) comprising both social network and ego-
centered data, are used to perform a link prediction between
actors of the social networks. In the context of social media,
a working system could be used to suggest possible acquain-
tances in social networks, suggest products for advertisement
or propose information artifacts (e.g. videos, music, discussion
groups) to social media users. On the other hand, this could
also include the avoidance of unwanted or inadequate informa-
tion and media. A further goal of the paper is to demonstrate
and to assess the applicability of Semantic Web technologies
to support social network analysis on complex datasets.
II. RELATED WORK
The expressive power of Semantic Web technologies has
been used in various fields and reach from geo-spatial ap-
plications using semantic geo-catalogues [12] to information
retrieval in the mental health domain [13]. In order to process
Semantic Web data, two approaches have been introduced to
exploit the wealth of machine learning algorithms available.
One possibility is to preprocess and transform the data to work
with traditional methods, i.e. Instance Extraction. Instance ex-
traction is, however, a non-trivial process and requires a lot of
domain knowledge. In an RDF graph, for example, all data is
interconnected and all relations can be made explicit. Grimnes
and colleagues [14] describe the process as the extraction of
the relevant subgraph to a single resource. So the question
of relevance in the problem domain has to be answered.
Furthermore, if the data has to be preprocessed to be handled
by traditional methods, additional costly effort is required
to transform the data into an adequate format. Besides, new
sources of error usually arise during a transformation.
Another approach is to change existing algorithms to work
on graph-based relational data or, if that is not possible, to
create new ones. This approach is in fact of growing interest
as new data sources in the form of linked data increasingly
become available. Huang and Colleagues [15] defined a sta-
tistical learning framework based on a relational graphical
model on which machine learning is possible. Moreover, we
can observe that an increasing number of organizations and
companies store data in a graph-based relational manner, so a
direct utilization of the data is strongly demanded.
Following the second approach, researchers developed
methods to measure the similarity between any two objects
in ontology-based metadata to later serve as an input for
traditional data mining algorithms e.g. hierarchical clustering.
An advantage of this approach is that any traditional algorithm
processing a distance measure can be applied directly, avoiding
a transfer step. Maedche and Zacharias [11] introduced a
promising approach by defining a set of similarity measures to
compare ontology-based metadata. A generalized framework
based on the work of Maedche and Zacharias [11] has been
introduced by Lula and Paliwoda-Pkekosz [16]. Grimnes and
colleagues [14] investigated on Instance Based Clustering of
4www.facebook.com
10651033
Semantic Web resources and showed that the ontology-based
distance measure introduced by Maedche and Zacharias [11]
performs well for cluster analysis in comparison to other
distance measures, such as Feature Vector Distance Measure
and Graph Based Distance Measure. The overall performance
is, however, hard to evaluate as it differs according to the
quality measure Grimne and colleagues used [14]. As the work
related to the semantic similarity proposed by Maedche and
Zacharias [11] mainly focused on clustering, we propose an
evaluation of this approach for other methods. Furthermore,
Emde and Wettschereck [7] used relational instance-based
learning for clustering analysis. Both approaches did not
include ontological background knowledge.
In the field of social network analysis, however, the use
of Semantic Web technologies is a growing research field.
In this context, Ereto and colleagues [1] extended traditional
operators using a semantic web framework SemSNA. They
included the semantics of a traditional graph-based represen-
tation thus being able to deal with the various relations and
interactions when analyzing such data. The result is a complex
graph of the social network additionally comprising vertices
representing graph-based metrics like diameter, centralities or
path calculations. Although the SemSNA approach enriches the
representation of social networks, it focuses on the traditional
network metrics and omits the incorporation of other relations
outside the scope of traditional social network analysis. Pao-
lillo and Wright [17] worked on modeling profile data in the
FOAF5 format in order to identify groups of interest based
on the LiveJournal6 network. Middleton and colleagues [18]
explored an ontological approach to user profiling to use in
recommender systems. They could increase the recommenda-
tion accuracy by 7-15% using ontological inference.
The recent development in this field shows the potential
of this approach. Based on these findings, we will evaluate a
similarity measure derived from the Semantic Web background
in a classification analysis and measure its performance. A
classification analysis has the major advantage that the predic-
tion accuracy can be measured directly, thus allowing precise
conclusions. After recent research solved important issues to
perform data mining on Semantic Web data, the question
remains what drawbacks exist in comparison to classical
methods that work on feature vector data. Disadvantages of the
universal modeling feature, that Semantic Web technologies
offer could be loss of accuracy and increased computational
time. Moreover, the methods have been mainly tested on data
from the Semantic Web background and instances have been
extracted from Semantic Web data, whereas this paper rather
follows an opposite approach.
We will perform a link prediction analysis on two social
network/media datasets. Link prediction is an important task
in network science and has numerous applications in various
fields e.g. recommending friends on social networking sites
[19] or predicting coauthors in large coauthorship networks
5http://xmlns.com/foaf/spec/6www.livejournal.com
[20]. Kautz and colleagues [21], for instance, investigated
in social networks on how to find companions, assistants or
colleagues. As link prediction is merely an exemplary task to
complete in this paper, we will not illustrate further issues of
link prediction here and refer to Lichtenwalter et al. [22] for a
good overview of the recent development in this field. We will
use the graph-based baseline predictor common neighbors as
reference for a state of the art link prediction method. Common
neighbors has been widely used for link prediction and usually
ranks among the best predictors [19] or [20]. As stated by
Lichtenwalter et al. [22], common neighbors is the number
of network neighbors that actor x and y have in common. In
collaboration networks, for instance, Newman [23] showed a
correlation between the number of common neighbors and the
probability for a future collaboration.
In detail, the task examined in this paper is to predict a
relation between two arbitrary actors of a social network. The
prediction is based on the similarity between those two actors.
In order to perform the link prediction we created three differ-
ent input metrics. The first metric is created from a Semantic
Web representation (RDF) of all datasets according to the
ontology-based similarity measure proposed by Maedche and
Zacharias [11]. Secondly, a traditional feature vector from the
profile data of the social network users is created and finally
a graph-based baseline predictor common neighbors is used.
III. DATA PREPARATION AND MODELING
Different standards for a semantic description have been
introduced in the past, including different levels of expres-
siveness. These standards offer a formal way to specify shared
vocabularies that are used to create statements about resources.
Possible standards are the Resource Description Framework
(RDF) [24], the Resource Description Framework Schema
RDF(S) [25] and the Web Ontology Language (OWL) [26]. A
short overview of ontology languages for the Semantic Web
to represent knowledge give Pulido and colleagues [27]. We
will use a RDF graph representation to represent the social
networks. The networks are typed using existing ontologies
like the FOAF ontology to represent persons’ profiles. As
the existing ontologies do not include all properties necessary
to model our problem domain, we extended them with the
missing properties.
Two real world datasets are used to evaluate our approach.
Both datasets comprise network as well as ego centered data
of two different online communities. In order to assess the
quality of an analysis based on technologies derived from the
Semantic Web, we firstly created a flat file (feature vector)
representation of both datasets comprising the profile data of
the players/users. Secondly, datasets in a RDF representation
were created including the modeling of ontologies adequate to
the specific problem domains. We extracted a social network
from both datasets. These social networks are used to perform
a link prediction between actors of the social networks. The
link prediction is evaluated based on three different input
metric combinations. We will compare a semantic similarity
metric derived from the RDF graph, the baseline predictor
10661034
Player
Person Tournament
Fun Master
Category
age gender
year
points lat lng
lat lng subclass of
subclass of
plays with
plays
Player B Tourn. 2
22
Player A
Cup
A+
subclass of
2009
Ontology
Metadata Tourn. 1
Instance of Relation/Attribute Subclass of
Fig. 2. Ontology and example metadata of dataset one
common neighbors as well as non-network metrics from the
users’ profiles, formatted as a traditional feature vector.
The first dataset comprises data of a nationwide beach-
volleyball online community. The Beach Volleyball League
concerned provided us with anonymized data of 2262 Players
and 359 hobby tournaments in the years 2002 to 2010,
gathered from the community web-site’s database. Each tour-
nament is characterized by the geographical location, the skill
level and the date. Figure 2 shows the resulting ontology
including some sample data. The ontologies were created
using the FOAF and WSG 847 vocabulary to model players’
attributes and the relations between them. The figure depicts
a sample snapshot of two players and two tournaments and
the corresponding relations. As seen in Figure 2, players
attend tournaments and therefore form a two-mode social
network through their attendance. The feature vector, i.e. ego-
centered, data has been extracted directly from the player table
and includes: gender, age, geographical location (latitude,
longitude), points achieved in last season and overall points.
As the players form a social network through their partici-
pation in tournaments, we transformed the two-mode network
into an undirected one-mode network, only considering the
players’ relationships if they have co-attended a tournament.
We used this player network to evaluate the prediction. The
relation between players is weighted with the number of co-
attendances. Players without any relation were excluded. The
resulting network included 2, 175 vertices and 105, 922 edges.
The network can be characterized as a dense network in the
context of social networks with a density of .044 and an
average degree of 97.39, meaning that on average, every player
has a relation with nearly 100 other players. An average local
clustering coefficient of .727 and a diameter of 5 characterize
the network as a small world network as stated by Watts and
Strogatz [28]. Furthermore, the degree distribution among all
vertices depicts a scale-free net-work, following [29].
7http://www.w3.org/2003/01/geo/
User
Person
Like
Category
hometown
url
timezone
location
about
category id category name
subclass of
Knows user
Has Like
Ontology
Instance of Relation/Attribute Subclass of
Group
id name
Has Category
Has Category
Is in group
Is subcategory of
Is parent Category of
gender
Fig. 3. Ontology of dataset two (facebook)
The second dataset has been extracted from a survey among
692 Facebook users of university freshman, four months
after the semester start. A Facebook application has been
programmed where students were able to get information
about campus and university issues. We asked the students to
grant us access to their personal profiles including their friend
list, group memberships, event attendances and like relations.
Figure 3 highlights the resulting ontology. Facebook users
form social relations through becoming a friend. To become
friends in this context not necessarily means a strong social
relationship. The term acquaintance is surely more adequate
here. Users on Facebook can further join groups where users
with common interest can interchange information. Users on
Facebook are able to create and attend events. Typical events
are music concerts or cultural events of any kind as well as
personal events like private birthday parties. A very interesting
concept is the possibility for every Facebook user to likeany object (referenced through an URL) on the web and in
particular on Facebook. The like (or I like) function can be
read in two directions. Users show with this function that they
may actually like something, e.g. movie stars, music bands or
writers and books, on the other hand, users simultaneously
share the object they like with their friends or with other
Facebook users. The second functionality may be interpreted
as a recommendation rather than an actual evaluation. As a
summary, Facebook likes can be read both, as a set of personal
attitudes and interests and also as information that users share
among each other. The underlying idea of the decision support
system in this work is that users who share similar groups,
events and likes, are more likely to be connected as Facebook
friends.
As seen in Figure 3, users have ego-centered data such
as gender, hometown, time-zone, actual location, country and
also relational data like friend relations or memberships in
groups and items users like on Facebook. Groups and likesare characterized by variables like a unique URL or a name.
10671035
Likes and groups are attached to a certain category whereas
categories can be recursively related to other categories. The
categories are derived from the Facebook website and reach
from root categories like Interests, Music or TV to sub granu-
lar categories like Local Businesses. Dataset two comprises
45, 108 relations to 24, 648 unique likes. There are 6, 894memberships in 5, 993 unique groups. The social network is
formed by the friend relation on Facebook. Based on this
user-to-user relation, a social network is created. The network
shows 3, 028 connections between the 692 users, leading to
an average vertex degree (number of friends) of 8.75. The
resulting network is less dense than the network in dataset
one, showing a higher network diameter with a value of 13, a
lower clustering coefficient .349 and a lower general density
of .013.
A RDF graph containing all relations has been created for
every dataset. The semantic similarity measures following the
approach of [11] are calculated based on this RDF graph
as a base for the later classification analysis. The semantic
similarity metric between two instances (in this case users)
Ii and Ij is a weighted combination of three dimensions of
similarity: taxonomy similarity TS, relational similarity RS and
the attribute similarity AS.
The taxonomy similarity AS is determined by the corre-
sponding concepts of Ii and Ij , and their position in the
concept hierarchy (or concept taxonomy, as part of an on-
tology). In other words, the similarity is reflected by the
amount of overlap in the super-classes of each instance’s class,
as Grimnes and colleagues [14] point out. Referring to the
example in Figure 1, the pairing AC/DC and Deep Purple will
have a higher taxonomy similarity than the pairing AC/DC and
Mozart, according to the fact that the first pairing shares more
common superconcepts.
The relational similarity RS is computed on the basis of the
relations of Ii and Ij to other objects. Maedche and Zacharias
[11] made the assumption that two instances are more likely
similar, if they both have relations to a third instance, rather
than if they have relations to different instances. Hence, the
similarity of two instances depends on the similarity of the in-
stances they have relations to. To finally calculate the relational
similarity between two instances, the similarity is recursively
applied again, including the taxonomy and attribute similarity
of the related instances. As this may lead to infinite cycles,
a maximum recursion depth has to be defined. According to
dataset two, two users in Facebook form a relational similarity
RS through their relations to any object (friend, like, group
etc.) on Facebook and the relations between those objects.
Finally the attribute similarity AS focuses on similar at-
tributes to assess the similarity between two instances. Unlike
relations, the properties take literal values. Since attributes can
be very different, such as names, text, date or geographical
information, adequately comparing them is quite difficult. For
this reason, attribute similarity AS is based on type-specific
distance measures such as Levenshtein edit distance [30] for
literals and simple numeric difference for numbers etc.
Since this is just a very brief overview of the algorithm,
we refer to Maedche and Zacharias [11] for further details on
the calculation. In contrast to Maedche and Zacharias [11],
who used a weighted combined metric, we will evaluate each
dimension TS,RS and AS separately.
The feature-vector metrics included all player/user infor-
mation for a pairing per tuple. For the feature-vector metrics
of dataset one, we further added the sum of the overall
league points a player achieved so far (separated into three
classes according to different regional leagues) and computed
possible interaction variables, namely gender difference and
age difference. Finally, each tuple in the feature vector metrics
comprises the information about the two pairing players/users
and the outcome class. Due to the analysis layout, the outcome
variable is dichotomous in the form of L (Link) vs. NL (No
Link), respectively 1 vs. 0. We further calculated the commonneighborhood as the number of shared neighbors based on
the two social networks to compare the introduced metrics
with a baseline predictor. To finally create samples for the
prediction task, 1,000 player/user pairings have been chosen
randomly from the datasets. The 1,000 pairings included 500
pairings where the two players/users actually had a relation
and 500 pairings where no relation existed. This forms two
groups in our outcome class, L and NL. In each sample both
groups are equally distributed. Nonetheless, it should be kept
in mind that the networks only showed 4.4% (dataset one)
and 1.3% (dataset two) of all possible links. In real world
networks, however, the NL group will be very likely highly
overrepresented. The procedure has been performed identically
for dataset two, here comprising 1,000 user pairings in each
group.
IV. RESULTS
Before performing the classification analysis, it is advisable
to comprehensively examine the statistical properties of the
semantic similarity metric, due to the fact, that they have
been mainly tested for clustering applications and the three
dimensions were usually combined to one metric [14] or
[11]. On this account we performed an ANOVA to assess
the discriminative power of each metric dimension TS,RSand AS. We will later perform a binary logistic regression,
a discriminant analysis and use a decision tree to evaluate
the predictive power of the different metrics. Figure 4 shows
boxplots of all semantic similarity dimensions grouped by the
outcome class in dataset one.
The plot highlights obvious mean differences for each di-
mension in the outcome class. The relational similarity RS in
particular shows a distinct mean difference, having comparable
standard deviation dimensions. Taxonomy similarity TS also
shows a mean difference whereas the standard deviation of
both groups is quite similar. The attribute similarity ASshows a very high standard deviation, possibly leading to false
predictions in some regions, if the prediction would rely on
this element only.
However, the increased weight for the relational similarity
when combining all three measures, as proposed by Maedche
and Zacharias [11], seems reasonable as it promises the highest
10681036
����� ����� �����
�
�
�
�
����������� ������������������������� ������
�����
���
������
���
�����
����
�
���� ���� ����
�
�
�
�
�������� ��!"#�$���%������ �!
Fig. 4. Boxplots of semantic similarity components in outcome class ofdataset one (TS = taxonomy similarity, RS = relational similarity, AS =attribute similarity). The class NL here represents the class of pairings withno link, whereas L represents the class with an existing link between the twopairing players.
discriminative power when visually analyzing the boxplots in
Figure 4.
The similarities of the Facebook dataset (dataset two)
showed a very similar result. However, taxonomy similarity
TS did not show a significant difference in the outcome class
in this dataset, due to the fact that there are practically no
taxonomic differences for users on Facebook, as there are no
classes, categories or hierarchies for users.
To evaluate the visual impressions statistically, a one-way
ANOVA on group differences for each similarity measure was
performed. The results depict a significant group difference for
every semantic similarity dimension, except TS in dataset two.
Again, relational similarity RS (μ = .52, σ = .06 in group
NL and μ = .64, σ = .08 in group L in dataset one, μ = .01,
σ = .02 in group NL and μ = .13, σ = .13 in group L in
dataset two) shows the highest effect size regarding Cohen’s d(1.76 for dataset one, 1.30 for dataset two) as already assumed
by the analysis of the boxplot charts [31]. Worth mentioning is
that all semantic similarity dimensions (besides TS in dataset
two) show a significant mean difference of p < .01 in the
outcome groups. These findings statistically prove the ability
of the semantic similarity metric to discriminate the outcome
groups. The high values of Cohen’s d effect size measure
further depict a genuine difference between linked and not
linked pairs in terms of their semantic similarity.
After having laid the statistical foundations, the questions of
the predictive power of the semantic similarity metric (TS,RSand AS) is examined. The results are shown in Table I. First
Accuracy (dataset one) Method Input metric(s) TP TN Overall
Logistic binary regression
Feature Vector* 81.6% 79.7% 80.8%
Semantic Similarity 94.6% 92.6% 93.6% Common neighbors 96% 95% 95.6%
Discriminant analysis (cross validated)
Feature Vector* 73.3% 87.1% 79.1% Semantic Similarity 95.8% 89.2% 92.5% Common neighbors 98.4% 74.5% 86.4%
Decision Tree (C 4.5) (10Fold cross validated)
Feature Vector 85.3% 81.9% 83.5% Semantic Similarity 98.6% 93.4% 96.09% Common neighbors 90.0% 90.4% 90.20%
Accuracy (dataset two) Logistic binary regression
Feature Vector* 44.6% 59.8% 52.2% Semantic Similarity 70.3% 94.6% 82.4% Common neighbors 60.9% 97.9% 79.4%
Discriminant analysis (cross validated)
Feature Vector* 44.5% 59.6% 52.0% Semantic Similarity 59.6% 98.7% 79.1% Common neighbors 60.9% 97.9% 79.4%
Decision Tree (C 4.5) (10Fold cross validated)
Feature Vector* 40.9% 71.0% 55.9% Semantic Similarity 67.9% 96.8% 82.35% Common neighbors 60.9% 97.9% 79.4%
* Due to missing values, not all cases could be included leading to slightly unequal class distributions.
TABLE IRESULTS OF CLASS PREDICTION USING DIFFERENT PREDICTION
METHODS (TP = TRUE POSITIVE RATE, TN = TRUE NEGATIVE RATE);THE SEMANTIC SIMILARITY METRIC COMPRISES TS ,RS AND AS
of all, a logistic binary regression (cut value = .5) has been
performed for each dataset and all metrics. The regression
showed a significant relation for all three predictor variables
TS, RS and AS.
As seen in Table I the logistic binary regression leads to
a predictive accuracy with 93.6% in dataset one and 82.4%in dataset two for the semantic similarity. In dataset two, the
regression preforms best with an overall accuracy of 82.4 for
the semantic similarity metric.
The analysis of the feature-vector metrics shows the lowest
predictive accuracy in both datasets, regardless the method
used. The common neighborhood metric and the semantic
similarity metric show comparable results, indicating a slight
advantage for the semantic similarity metric. Note worthily
is the consistently high predictive accuracy of the semantic
similarity metric regardless the method used in both datasets.
The predictive accuracy is quite similar in both groups L and
NL in dataset one. Interestingly, this is not the case in dataset
two. Here we can observe very high true negative rates up
to 98.7%. The true positive rates in contast, never reach a
level over 68%. This shows that using the semantic similarity
metric and performing a discriminant analysis, almost every
non-friend could be identified correctly in the sample.
As it is vital to investigate the nature of the relationship
of the predictor variable, in particular the dimensions of
the semantic similarity metric, a discriminant analysis was
conducted. Discriminant analysis is used to examine how
to best possible separate a set of groups (here L vs. NL)
using several predictors. Relational similarity RS showed,
with a structure matrix value of .634, that it is the most
10691037
important predictor for dataset one in differentiating the two
groups, followed by the attribute similarity AS with a value
of .385. Again, taxonomy similarity TS seems to have a low
contribution to the group differentiating with a value of −.088.
As expected, the prediction results of the discriminant analysis
are comparable to the binary logistic regression, since both
methods are different ways of achieving the same result. It
should be noted that within the classification of the feature
vector metrics not all instances could be included as the
dataset suffered from missing values in several variables. That
is also why a decision tree classifier (C 4.5, [32]) has been
tested on both datasets. This classification method shows a
high predictive accuracy in both datasets when relying on the
semantic similarity metric, as seen in Table I. In summary,
the metrics lead to quite different results in the two datasets,
emphasizing the influence of the underlying data. The semantic
similarity metric seems to benefits from a more comprehen-
sive information, i.e. incorporation of background knowledge,
which leads to an advantage in prediction accuracy.
V. CONCLUSIONS AND FUTURE WORK
In this paper we investigated one important challenge for
performing data mining and machine learning directly on
Semantic Web data in the context of social network analysis.
We created a RDF graph from existing data and computed sim-
ilarities between instances of that graph. We further calculated
traditional metrics to compare both approaches. We evaluated
the methods by performing a classification analysis in the form
of a link prediction between any two actors of two real world
social networks. The results show that the semantic similarity
based on the RDF graph performs note worthily. We conclude
that the incorporation of all available background knowledge
yields to an improvement of existing methods.
Nevertheless, the results depict that the prediction heavily
relies on the data structure. The results of the baseline pre-
dictor common neighborhood also show that Semantic Web
technologies not necessarily outperform traditional methods.
However, the ontologies used in this paper were rather simple
and served to examine a new approach in social network anal-
ysis. The results indicate that analyses on highly interrelated
data might benefit from Semantic Web technologies.
Another advantage is the expandability of Semantic Web
data approach. New ontologies can be easily attached to
already existing metadata or more heavyweight ontologies
could be merged with an existing one [33]. Furthermore, the
accessibility to free and comprehensive ontologies is growing
rapidly. These ontologies could enhance the analysis process.
To give one example, a similarity analysis of customers with
hobbies in the field of music or movies, could be improved
by music8 and movie9 ontologies, which move the analysis
from treating this information in a way of nominal values to
a relational graph, where each instance is interconnected. In a
traditional layout, these values would be mainly compared in
8http://www.musicontology.com9http://www.movieontology.org
the way equal or not equal , whereas the use of ontologies may
offer an answer to how related two values are. Additionally,
Semantic Web technologies offer possibilities to automatically
derive new data from existing one through inference engines.
Since this approach could further improve the predictive power
of analyses, it should be considered for future work. Again, we
would like to emphasize that the main purpose of the analysis
in this paper is not to find the optimal predictor for one specific
problem but, to investigate the potential of Semantic Web
technologies in the field of social network analysis.
Nonetheless, the study revealed several unanswered ques-
tion and issues in this field. A major disadvantage is the
question of interpretability. As seen in the result section, no
conclusions can be made, what predictor in the original RDF
graph of the semantic similarity had the biggest influence on
its predictive power. In the case when causal explanations are
needed, this inevitably leads to further effort for the analyst.
In many areas, however, this is an important issue. Reviewing
the analysis in this paper, the results of the semantic similarity
metric give almost no explanation what exactly made two
players or two users similar, despite the fact that their relationto each other had the highest discriminative power. However,
this also applies for the common neighborhood metric.
Another issue is the weighted combination of the three
components of the similarity measure, proposed by Maedche
and Zacharias [11]. The results of this study show that it is
advisable to treat each component separately. If the problem
domain, including the ontology, is more or less static, the
optimal weights for a single problem (e.g. classification) could
be assessed in an optimization process for a representative
sample of metadata, if available. Moreover, the ontology used
in this paper was rather simple in comparison to existing
ontologies in other domains. The similarity measure should
therefore be evaluated upon complex ontologies and very large
datasets to gain a more precise picture. Nonetheless, this paper
improved upon past research by showing the potential of
Semantic Web technologies as a basis for representing and
analyzing social network data.
REFERENCES
[1] G. Ereteo, M. Buffa, F. Gandon, and O. Corby, “Analysis of a realonline social network using semantic web frameworks,” in Proceedingsof the 8th International Semantic Web Conference, ser. ISWC ’09.Berlin, Heidelberg: Springer-Verlag, 2009, pp. 180–195. [Online].Available: http://dx.doi.org/10.1007/978-3-642-04930-9 12
[2] G. Ereteo, M. Buffa, F. Gandon, P. Grohan, M. Leitzelman, andP. Sander, “A State of the Art on Social Network Analysis and itsApplications on a Semantic Web,” in Proc. SDoW2008 (Social Dataon the Web), Workshop held with the 7th International Semantic WebConference, Karlsruhe, Germany, October 2008.
[3] P. Mika, Social networks and the Semantic Web, ser.Semantic web and beyond. Springer, 2007. [Online]. Available:http://books.google.de/books?id=tcYJ8a32nFUC
[4] T. Berners-Lee, J. Hendler, and O. Lassila, “The Se-mantic Web (Berners-Lee et. al 2001),” Scientific Ameri-can, vol. 284, pp. 28–37, May 2001. [Online]. Avail-able: http://www.sciam.com/print version.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21
[5] T. R. Gruber, “Toward Principles for the Design of Ontologies Usedfor Knowledge Sharing,” International Journal of Human-Computer
10701038
Studies, vol. 43, no. 5-6, pp. 907–928, Nov. 1995. [Online]. Available:http://tomgruber.org/writing/onto-design.htm
[6] A. Delteil, C. Faron-Zucker, and R. Dieng, “Learning Ontologiesfrom RDF annotations,” in IJCAI 2001 Workshop on OntologyLearning, Proceedings of the Second Workshop on Ontology LearningOL 2001, Seattle, USA, August 4, 2001 (Held in conjunction withthe 17th International Conference on Artificial Intelligence IJCAI2001), ser. CEUR Workshop Proceedings, A. Maedche, S. Staab,C. Nedellec, and E. H. Hovy, Eds., vol. 38. CEUR-WS.org, 2001.[Online]. Available: http://dx.doi.org/http://SunSITE.Informatik.RWTH-Aachen.DE/Publications/CEUR-WS/Vol-38/delteil ol.pdf
[7] W. Emde and D. Wettschereck, “Relational Instance BasedLearning,” in Machine Learning - Proceedings 13th InternationalConference on Machine Learning, L. Saitta, Ed. MorganKaufmann Publishers, 1996, pp. 122 – 130. [Online]. Available:http://citeseer.ist.psu.edu/emde96relational.html
[8] B. Berendt, A. Hotho, and S. Gerd, “Towards Semantic WebMining,” in Proceedings of the First International Semantic WebConference on The Semantic Web, ser. ISWC ’02. London,UK, UK: Springer-Verlag, 2002, pp. 264–278. [Online]. Available:http://portal.acm.org/citation.cfm?id=646996.711414
[9] J. Han and M. Kamber, Data Mining: Concepts and Techniques,2nd ed. Morgan Kaufmann, Jan. 2006. [Online]. Available:http://www.worldcat.org/isbn/1558609016
[10] P. Gloor, J. S. Krauss, S. Nann, K. Fischbach, and D. Schoder, “WebScience 2.0: Identifying Trends Through Semantic Social NetworkAnalysis,” Social Science Research Network Working Paper Series,Nov. 2008. [Online]. Available: http://ssrn.com/abstract=1299869
[11] A. Maedche and V. Zacharias, “Clustering Ontology-Based Metadata inthe Semantic Web,” in Proceedings of the 6th European Conference onPrinciples of Data Mining and Knowledge Discovery, ser. PKDD ’02.London, UK: Springer-Verlag, 2002, pp. 348–360. [Online]. Available:http://portal.acm.org/citation.cfm?id=645806.670157
[12] F. Farazi, V. Maltese, F. Giunchiglia, and A. Ivanyukovich, “Afaceted ontology for a semantic geo-catalogue,” in Proceedings ofthe 8th extended semantic web conference on The semanic web:research and applications - Volume Part II, ser. ESWC’11. Berlin,Heidelberg: Springer-Verlag, 2011, pp. 169–182. [Online]. Available:http://dl.acm.org/citation.cfm?id=2017936.2017950
[13] M. Hadzic and E. Chang, “Advances in web semantics i,” T. S. Dillon,E. Chang, R. Meersman, and K. Sycara, Eds. Berlin, Heidelberg:Springer-Verlag, 2009, ch. Web Semantics for Intelligent and DynamicInformation Retrieval Illustrated within the Mental Health Domain,pp. 260–275. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-89784-2 10
[14] G. A. Grimnes, P. Edwards, and A. Preece, “Instance based clustering ofsemantic web resources,” in Proceedings of the 5th European semanticweb conference on The semantic web: research and applications, ser.ESWC’08. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 303–317.[Online]. Available: http://portal.acm.org/citation.cfm?id=1789425
[15] Y. Huang, V. Tresp, and H.-P. Kriegel, “Multivariate predictionfor learning in relational graphs,” 2009. [Online]. Available:http://snap.stanford.edu/nipsgraphs2009/papers/huang-paper.pdf
[16] P. Lula and G. Paliwoda-Pkekosz, “An ontology-based clusteranalysis framework,” in Proceedings of the first internationalworkshop on Ontology-supported business intelligence, ser. OBI’08. New York, NY, USA: ACM, 2008. [Online]. Available:http://doi.acm.org/10.1145/1452567.1452574
[17] J. C. Paolillo and E. Wright, Social network analysis on the semanticweb: Techniques and challenges for visualizing foaf. Springer, 2005,pp. 229–242. [Online]. Available: http://www.blogninja.com/vsw-draft-paolillo-wright-foaf.pdf
[18] S. E. Middleton, N. Shadbolt, and D. D. Roure, “Ontological userprofiling in recommender systems,” ACM Transactions on InformationSystems (TOIS), vol. 22, no. 1, pp. 54–88, January 2004. [Online].Available: http://eprints.ecs.soton.ac.uk/8926/
[19] J. Chen, W. Geyer, C. Dugan, M. Muller, and I. Guy, “Makenew friends, but keep the old: recommending people on socialnetworking sites,” in Proceedings of the 27th international conferenceon Human factors in computing systems, ser. CHI ’09. NewYork, NY, USA: ACM, 2009, pp. 201–210. [Online]. Available:http://dx.doi.org/10.1145/1518701.1518735
[20] D. Liben-Nowell and J. Kleinberg, “The link-prediction problem for
social networks,” J. Am. Soc. Inf. Sci., vol. 58, no. 7, pp. 1019–1031,2007. [Online]. Available: http://dx.doi.org/10.1002/asi.20591
[21] H. Kautz, B. Selman, and M. Shah, “Referral Web: combining socialnetworks and collaborative filtering,” Commun. ACM, vol. 40, no. 3, pp.63–65, Mar. 1997.
[22] R. N. Lichtenwalter, J. T. Lussier, and N. V. Chawla, “New perspectivesand methods in link prediction,” in KDD ’10: Proceedings of the16th ACM SIGKDD international conference on Knowledge discoveryand data mining. New York, NY, USA: ACM, 2010, pp. 243–252.[Online]. Available: http://dx.doi.org/10.1145/1835804.1835837
[23] M. E. J. Newman, “Clustering and preferential attachment in growingnetworks,” Physical Review E, vol. 64, no. 2, pp. 025 102+, Jul. 2001.[Online]. Available: http://dx.doi.org/10.1103/PhysRevE.64.025102
[24] G. Klyne and J. J. Carroll, “Resource Description Framework (RDF):Concepts and Abstract Syntax,” http://www.w3.org/TR/rdf-concepts/,Feb. 2004. [Online]. Available: http://www.w3.org/TR/rdf-concepts/
[25] D. Brickley and R. V. Guha, “RDF Vocabulary Description Language1.0: RDF Schema,” http://www.w3.org/TR/rdf-schema/, Feb. 2004.[Online]. Available: http://www.w3.org/TR/rdf-schema/
[26] M. Smith, C. Welty, and D. McGuinness, “OWL web ontology guide.”http://www.w3.org/TR/owl-guide/, 2004.
[27] J. R. G. Pulido, M. A. G. Ruiz, R. Herrera, E. Cabello,S. Legrand, and D. Elliman, “Ontology languages for thesemantic web: A never completely updated review,” Know.-BasedSyst., vol. 19, no. 7, pp. 489–497, 2006. [Online]. Available:http://dx.doi.org/10.1016/j.knosys.2006.04.013
[28] D. J. Watts and S. H. Strogatz, “Collective dynamics of ’small-world’networks.” Nature, vol. 393, no. 6684, pp. 440–442, Jun. 1998.[Online]. Available: http://dx.doi.org/10.1038/30918
[29] A.-L. Barabasi and R. Albert, “Emergence of Scaling in RandomNetworks,” Science, vol. 286, no. 5439, pp. 509–512, Oct. 1999.[Online]. Available: http://dx.doi.org/10.1126/science.286.5439.509
[30] V. I. Levenshtein, “Binary Codes Capable of Correcting Deletions,Insertions and Reversals,” Soviet Physics Doklady, vol. 10, pp. 707+,Feb. 1966. [Online]. Available: http://adsabs.harvard.edu/cgi-bin/nph-bib query?bibcode=1966SPhD...10..707L
[31] J. Cohen, Statistical power analysis for the behavioral sciences : JacobCohen., 2nd ed. Lawrence Erlbaum, Jan. 1988. [Online]. Available:http://www.worldcat.org/isbn/0805802835
[32] J. R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kauf-mann Series in Machine Learning), 1st ed. Morgan Kaufmann, Oct.1992. [Online]. Available: http://www.worldcat.org/isbn/1558602380
[33] N. F. Noy and M. A. Musen, “The prompt suite: interactivetools for ontology merging and mapping,” Int. J. Hum.-Comput.Stud., vol. 59, pp. 983–1024, December 2003. [Online]. Available:http://dl.acm.org/citation.cfm?id=965943.965952
10711039