[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...

Download [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - Classification Analysis in Complex Online Social Networks Using Semantic Web Technologies

Post on 23-Dec-2016




2 download

Embed Size (px)


  • Classication Analysis in Complex Online SocialNetworks Using Semantic Web Technologies

    Marek OpuszkoDepartment of Business Informatics

    Faculty of Economics and Business AdministrationFriedrich-Schiller-University of Jena

    Jena, Germany 07743Email: marek.opuszko@uni-jena.deTelephone: (0049) 3641943314

    Johannes RuhlandDepartment of Business Informatics

    Faculty of Economics and Business AdministrationFriedrich-Schiller-University of Jena

    Jena, Germany 07743Email: j.ruhland@wiwi.uni-jena.deTelephone: (0049) 3641943310

    AbstractThe Semantic Web enables people and computersto interact and exchange information. Based on Semantic Webtechnologies, different machine learning applications have beendesigned. Particularly important is the possibility to createcomplex metadata descriptions for any problem domain, basedon pre-dened ontologies. In this paper we evaluate the use of asemantic similarity measure based on pre-dened ontologies as aninput for a classication analysis in the context of social networkanalysis. A link prediction between actors of two real world socialnetworks is performed, which could serve as a recommendationsystem. The social networks involve different types of relationsand nodes. We measure the prediction performance based on asemantic similarity measure as well as traditional approaches.The ndings demonstrate that the prediction accuracy based onthe semantic similarity is comparable to traditional approachesand shows that data mining on complex social networks usingontology-based metadata can be considered as a very promisingapproach.


    Todays online social networks like Facebook, Orkut, QZone,etc. connect multiple millions of people and allow its users toshare information, express themselves and interact in variousways. The overwhelming size1 and the organisation of theresulting ow of information is a major challenge of the Web2.0 [1] and the analysis of social web data. Contemporary so-cial network applications comprise very rich and diverse datareaching from relational data like friend relations or discussiongroup memberships to prole data like age, gender, etc. Classi-cal social network analysis algorithms usually cannot entirelyrepresent this kind of data without some loss of information[2]. Social networking sites often comprise multiple relationsand multiple types of nodes simultaneously. Users often belongto multiple social networks leaving information artifacts onvarious platforms and applications in an unstructured way.According to Mika [3] the idea of the Semantic Web isto extend unstructured information with machine processabledescriptions to provide missing knowledge if required. TheSemantic Web frameworks and the corresponding technologies

    1Facebook alone shared around 10% of the world population with 721million users in November 2011: https://www.facebook.com/notes/facebook-data-team/anatomy-of-facebook/10150388519243859

    represent a possible answer to this problem, using rich typedgraph models (e.g. RDF2) and schema denition frameworks(RDFS2 and OWL2).

    The vision of the Semantic Web, as coined by Berners-Leeand colleagues [4], is a common framework in which data isstored and shared in a machine-processable way. The contentof the Semantic Web is represented by formal ontologies,providing shared conceptualizations of specic domains [5].Emerged as an extension from the WWW, this technologyprovides a universally usable tool to model any problemdomain. Although the Semantic Web focuses on data storedon the web, this also implies the simultaneous representationof network and feature vector data (in other words, a at le)in one consistent data representation in the context of (social)network analysis.

    The possibility to represent any problem domain includingmultiple network relations and multiple network node types ina exible, formalized way, quickly gained attention in the datamining and machine learning community. The question arosehow data, once stored in the Semantic Web format, could beanalyzed directly, avoiding a potentially costly transformationprocess [6] [7]. Soon after its development, the concept of theSemantic Web led to the emergence of new disciplines, forinstance Semantic Web Mining [8].

    In the history of machine learning a variety of algorithmsfor different problem domains has been developed, such asclustering, classication and pattern recognition. Still, theincorporation of all background knowledge available, withother words, the best data representation, is a major issue indata mining [9]. Through the rich type graph models, SemanticWeb technologies promise to enhance the incorporation of allknowledge available. Besides, a growing number of recentdecision making problems have to take into consideration verydifferent kinds of data simultaneously, i.e. (social) networkdata and ego-centered (feature vector3) data [10]. Usually

    2Semantic Web, W3C, http://www.w3.org/2001/sw/3In traditional social network analysis literature these terms are usually

    referred to as actor attributes. We will, however, use the term feature vector,as our datasets include different node types and the term actor attributes couldbe misunderstood as users or peoples attributes only.

    2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

    978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.179


    2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

    978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.179


  • Case 1

    Variable C

    Variable B

    Like: Deep Purple

    Case 2

    Variable C

    Variable B

    Like: AC/DC

    Root Category:


    Sub Category: Rock music

    Case 3

    Variable C

    Variable B

    Like: Mozart

    Sub Category: 70s Rock Music

    Sub Category: 80s Rock Music

    Sub Category: Classical Music

    Sub Category Classical era


    Sub Category: Baroque music

    Instance of Relation/Attribute Subclass of

    Raw Data


    Fig. 1. Exemplary Metadata and corresponding Ontology

    both data are treated separately, facing a possible loss ofbackground knowledge leading to a potential lack of analysisaccuracy.

    Figure 1 shows a simple example of Semantic Web metadataand a corresponding ontology. Focusing on the rst variable(representing an outcome of musical preference) of the threeexemplary cases, a traditional feature vector format wouldmost likely end up in three dichotomous variables for themusical preference. Beside the laborious creation of a singlevariable for every musical preference outcome, the relationbetween the outcomes of this variable will generally not becaptured in a traditional le format. As a result, it might beomitted by traditional distance metrics like euclidean distanceor the jaccard index. Although hard to operationalize, adifferent distance between AC/DC, Deep Purple and Mozartin terms of musical distance seems to exist. As seen in Figure1, an ontology may be able to capture this information. Figure1 further highlights that Semantic Web data is basically agraph containing all information. In contrast to the majority oftraditional machine learning algorithms, where data is usuallyprocessed in the form of feature vector data, stored in an n-by-p data matrix containing n objects and p variables (features)for each object. According to the different formats, traditionalmachine learning algorithms might not be able to processSemantic Web data directly. We propose to examine traditionalmethods, extend them or create new ones in order exploit thepossible benets of the rich type data format. Although metricsdirectly gained from Semantic Web data have been introduced[11] and researchers started exploiting the rich typed SemanticWeb graphs for social network analysis [1], further evaluationsin comparison to traditional approaches are needed.

    In this paper we will focus on the similarity of (or distancebetween) objects in a social network. We will investigatethe usefulness of a Semantic Web based similarity measuredirectly calculated from the underlying RDF representation ofthe Semantic Web data, in comparison to traditional methods.

    Two real-world datasets (a sports community and a facebook4

    student community) comprising both social network and ego-centered data, are used to perform a link prediction betweenactors of the social networks. In the context of social media,a working system could be used to suggest possible acquain-tances in social networks, suggest products for advertisementor propose information artifacts (e.g. videos, music, discussiongroups) to social media users. On the other hand, this couldalso include the avoidance of unwanted or inadequate informa-tion and media. A further goal of the paper is to demonstrateand to assess the applicability of Semantic Web technologiesto support social network analysis on complex datasets.


    The expressive power of Semantic Web technologies hasbeen used in various elds and reach from geo-spatial ap-plications using semantic geo-catalogues [12] to informationretrieval in the mental health domain [13]. In order to processSemantic Web data, two approaches have been introduced toexploit the wealth of machine learning algorithms available.One possibility is to preprocess and transform the data to workwith traditional methods, i.e. Instance Extraction. Instance ex-traction is, however, a non-trivial process and requires a lot ofdomain knowledge. In an RDF graph, for example, all data isinterconnected and all relations can be made explicit. Grimnesand colleagues [14] describe the process as the extraction ofthe relevant subgraph to a single resource. So the questionof relevance in the problem domain has to be an


View more >