[IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - Classification Analysis in Complex Online Social Networks Using Semantic Web Technologies

Download [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - Classification Analysis in Complex Online Social Networks Using Semantic Web Technologies

Post on 23-Dec-2016

215 views

Category:

Documents

2 download

TRANSCRIPT

  • Classication Analysis in Complex Online SocialNetworks Using Semantic Web Technologies

    Marek OpuszkoDepartment of Business Informatics

    Faculty of Economics and Business AdministrationFriedrich-Schiller-University of Jena

    Jena, Germany 07743Email: marek.opuszko@uni-jena.deTelephone: (0049) 3641943314

    Johannes RuhlandDepartment of Business Informatics

    Faculty of Economics and Business AdministrationFriedrich-Schiller-University of Jena

    Jena, Germany 07743Email: j.ruhland@wiwi.uni-jena.deTelephone: (0049) 3641943310

    AbstractThe Semantic Web enables people and computersto interact and exchange information. Based on Semantic Webtechnologies, different machine learning applications have beendesigned. Particularly important is the possibility to createcomplex metadata descriptions for any problem domain, basedon pre-dened ontologies. In this paper we evaluate the use of asemantic similarity measure based on pre-dened ontologies as aninput for a classication analysis in the context of social networkanalysis. A link prediction between actors of two real world socialnetworks is performed, which could serve as a recommendationsystem. The social networks involve different types of relationsand nodes. We measure the prediction performance based on asemantic similarity measure as well as traditional approaches.The ndings demonstrate that the prediction accuracy based onthe semantic similarity is comparable to traditional approachesand shows that data mining on complex social networks usingontology-based metadata can be considered as a very promisingapproach.

    I. INTRODUCTION

    Todays online social networks like Facebook, Orkut, QZone,etc. connect multiple millions of people and allow its users toshare information, express themselves and interact in variousways. The overwhelming size1 and the organisation of theresulting ow of information is a major challenge of the Web2.0 [1] and the analysis of social web data. Contemporary so-cial network applications comprise very rich and diverse datareaching from relational data like friend relations or discussiongroup memberships to prole data like age, gender, etc. Classi-cal social network analysis algorithms usually cannot entirelyrepresent this kind of data without some loss of information[2]. Social networking sites often comprise multiple relationsand multiple types of nodes simultaneously. Users often belongto multiple social networks leaving information artifacts onvarious platforms and applications in an unstructured way.According to Mika [3] the idea of the Semantic Web isto extend unstructured information with machine processabledescriptions to provide missing knowledge if required. TheSemantic Web frameworks and the corresponding technologies

    1Facebook alone shared around 10% of the world population with 721million users in November 2011: https://www.facebook.com/notes/facebook-data-team/anatomy-of-facebook/10150388519243859

    represent a possible answer to this problem, using rich typedgraph models (e.g. RDF2) and schema denition frameworks(RDFS2 and OWL2).

    The vision of the Semantic Web, as coined by Berners-Leeand colleagues [4], is a common framework in which data isstored and shared in a machine-processable way. The contentof the Semantic Web is represented by formal ontologies,providing shared conceptualizations of specic domains [5].Emerged as an extension from the WWW, this technologyprovides a universally usable tool to model any problemdomain. Although the Semantic Web focuses on data storedon the web, this also implies the simultaneous representationof network and feature vector data (in other words, a at le)in one consistent data representation in the context of (social)network analysis.

    The possibility to represent any problem domain includingmultiple network relations and multiple network node types ina exible, formalized way, quickly gained attention in the datamining and machine learning community. The question arosehow data, once stored in the Semantic Web format, could beanalyzed directly, avoiding a potentially costly transformationprocess [6] [7]. Soon after its development, the concept of theSemantic Web led to the emergence of new disciplines, forinstance Semantic Web Mining [8].

    In the history of machine learning a variety of algorithmsfor different problem domains has been developed, such asclustering, classication and pattern recognition. Still, theincorporation of all background knowledge available, withother words, the best data representation, is a major issue indata mining [9]. Through the rich type graph models, SemanticWeb technologies promise to enhance the incorporation of allknowledge available. Besides, a growing number of recentdecision making problems have to take into consideration verydifferent kinds of data simultaneously, i.e. (social) networkdata and ego-centered (feature vector3) data [10]. Usually

    2Semantic Web, W3C, http://www.w3.org/2001/sw/3In traditional social network analysis literature these terms are usually

    referred to as actor attributes. We will, however, use the term feature vector,as our datasets include different node types and the term actor attributes couldbe misunderstood as users or peoples attributes only.

    2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

    978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.179

    1064

    2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

    978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.179

    1032

  • Case 1

    Variable C

    Variable B

    Like: Deep Purple

    Case 2

    Variable C

    Variable B

    Like: AC/DC

    Root Category:

    Music

    Sub Category: Rock music

    Case 3

    Variable C

    Variable B

    Like: Mozart

    Sub Category: 70s Rock Music

    Sub Category: 80s Rock Music

    Sub Category: Classical Music

    Sub Category Classical era

    Music

    Sub Category: Baroque music

    Instance of Relation/Attribute Subclass of

    Raw Data

    Ontology

    Fig. 1. Exemplary Metadata and corresponding Ontology

    both data are treated separately, facing a possible loss ofbackground knowledge leading to a potential lack of analysisaccuracy.

    Figure 1 shows a simple example of Semantic Web metadataand a corresponding ontology. Focusing on the rst variable(representing an outcome of musical preference) of the threeexemplary cases, a traditional feature vector format wouldmost likely end up in three dichotomous variables for themusical preference. Beside the laborious creation of a singlevariable for every musical preference outcome, the relationbetween the outcomes of this variable will generally not becaptured in a traditional le format. As a result, it might beomitted by traditional distance metrics like euclidean distanceor the jaccard index. Although hard to operationalize, adifferent distance between AC/DC, Deep Purple and Mozartin terms of musical distance seems to exist. As seen in Figure1, an ontology may be able to capture this information. Figure1 further highlights that Semantic Web data is basically agraph containing all information. In contrast to the majority oftraditional machine learning algorithms, where data is usuallyprocessed in the form of feature vector data, stored in an n-by-p data matrix containing n objects and p variables (features)for each object. According to the different formats, traditionalmachine learning algorithms might not be able to processSemantic Web data directly. We propose to examine traditionalmethods, extend them or create new ones in order exploit thepossible benets of the rich type data format. Although metricsdirectly gained from Semantic Web data have been introduced[11] and researchers started exploiting the rich typed SemanticWeb graphs for social network analysis [1], further evaluationsin comparison to traditional approaches are needed.

    In this paper we will focus on the similarity of (or distancebetween) objects in a social network. We will investigatethe usefulness of a Semantic Web based similarity measuredirectly calculated from the underlying RDF representation ofthe Semantic Web data, in comparison to traditional methods.

    Two real-world datasets (a sports community and a facebook4

    student community) comprising both social network and ego-centered data, are used to perform a link prediction betweenactors of the social networks. In the context of social media,a working system could be used to suggest possible acquain-tances in social networks, suggest products for advertisementor propose information artifacts (e.g. videos, music, discussiongroups) to social media users. On the other hand, this couldalso include the avoidance of unwanted or inadequate informa-tion and media. A further goal of the paper is to demonstrateand to assess the applicability of Semantic Web technologiesto support social network analysis on complex datasets.

    II. RELATED WORK

    The expressive power of Semantic Web technologies hasbeen used in various elds and reach from geo-spatial ap-plications using semantic geo-catalogues [12] to informationretrieval in the mental health domain [13]. In order to processSemantic Web data, two approaches have been introduced toexploit the wealth of machine learning algorithms available.One possibility is to preprocess and transform the data to workwith traditional methods, i.e. Instance Extraction. Instance ex-traction is, however, a non-trivial process and requires a lot ofdomain knowledge. In an RDF graph, for example, all data isinterconnected and all relations can be made explicit. Grimnesand colleagues [14] describe the process as the extraction ofthe relevant subgraph to a single resource. So the questionof relevance in the problem domain has to be answered.Furthermore, if the data has to be preprocessed to be handledby traditional methods, additional costly effort is requiredto transform the data into an adequate format. Besides, newsources of error usually arise during a transformation.

    Another approach is to change existing algorithms to workon graph-based relational data or, if that is not possible, tocreate new ones. This approach is in fact of growing interestas new data sources in the form of linked data increasinglybecome available. Huang and Colleagues [15] dened a sta-tistical learning framework based on a relational graphicalmodel on which machine learning is possible. Moreover, wecan observe that an increasing number of organizations andcompanies store data in a graph-based relational manner, so adirect utilization of the data is strongly demanded.

    Following the second approach, researchers developedmethods to measure the similarity between any two objectsin ontology-based metadata to later serve as an input fortraditional data mining algorithms e.g. hierarchical clustering.An advantage of this approach is that any traditional algorithmprocessing a distance measure can be applied directly, avoidinga transfer step. Maedche and Zacharias [11] introduced apromising approach by dening a set of similarity measures tocompare ontology-based metadata. A generalized frameworkbased on the work of Maedche and Zacharias [11] has beenintroduced by Lula and Paliwoda-Pkekosz [16]. Grimnes andcolleagues [14] investigated on Instance Based Clustering of

    4www.facebook.com

    10651033

  • Semantic Web resources and showed that the ontology-baseddistance measure introduced by Maedche and Zacharias [11]performs well for cluster analysis in comparison to otherdistance measures, such as Feature Vector Distance Measureand Graph Based Distance Measure. The overall performanceis, however, hard to evaluate as it differs according to thequality measure Grimne and colleagues used [14]. As the workrelated to the semantic similarity proposed by Maedche andZacharias [11] mainly focused on clustering, we propose anevaluation of this approach for other methods. Furthermore,Emde and Wettschereck [7] used relational instance-basedlearning for clustering analysis. Both approaches did notinclude ontological background knowledge.

    In the eld of social network analysis, however, the useof Semantic Web technologies is a growing research eld.In this context, Ereto and colleagues [1] extended traditionaloperators using a semantic web framework SemSNA. Theyincluded the semantics of a traditional graph-based represen-tation thus being able to deal with the various relations andinteractions when analyzing such data. The result is a complexgraph of the social network additionally comprising verticesrepresenting graph-based metrics like diameter, centralities orpath calculations. Although the SemSNA approach enriches therepresentation of social networks, it focuses on the traditionalnetwork metrics and omits the incorporation of other relationsoutside the scope of traditional social network analysis. Pao-lillo and Wright [17] worked on modeling prole data in theFOAF5 format in order to identify groups of interest basedon the LiveJournal6 network. Middleton and colleagues [18]explored an ontological approach to user proling to use inrecommender systems. They could increase the recommenda-tion accuracy by 7-15% using ontological inference.

    The recent development in this eld shows the potentialof this approach. Based on these ndings, we will evaluate asimilarity measure derived from the Semantic Web backgroundin a classication analysis and measure its performance. Aclassication analysis has the major advantage that the predic-tion accuracy can be measured directly, thus allowing preciseconclusions. After recent research solved important issues toperform data mining on Semantic Web data, the questionremains what drawbacks exist in comparison to classicalmethods that work on feature vector data. Disadvantages of theuniversal modeling feature, that Semantic Web technologiesoffer could be loss of accuracy and increased computationaltime. Moreover, the methods have been mainly tested on datafrom the Semantic Web background and instances have beenextracted from Semantic Web data, whereas this paper ratherfollows an opposite approach.

    We will perform a link prediction analysis on two socialnetwork/media datasets. Link prediction is an important taskin network science and has numerous applications in variouselds e.g. recommending friends on social networking sites[19] or predicting coauthors in large coauthorship networks

    5http://xmlns.com/foaf/spec/6www.livejournal.com

    [20]. Kautz and colleagues [21], for instance, investigatedin social networks on how to nd companions, assistants orcolleagues. As link prediction is merely an exemplary task tocomplete in this paper, we will not illustrate further issues oflink prediction here and refer to Lichtenwalter et al. [22] for agood overview of the recent development in this eld. We willuse the graph-based baseline predictor common neighbors asreference for a state of the art link prediction method. Commonneighbors has been widely used for link prediction and usuallyranks among the best predictors [19] or [20]. As stated byLichtenwalter et al. [22], common neighbors is the numberof network neighbors th...

Recommended

View more >