sup (semantic user profiling)

SUP – Semantic User Profiling

Emanuela Boroș, Alexandru-Lucian Gînscă

UAIC: Faculty of Computer Science, “Alexandru Ioan Cuza” University, Romania {emanuela.boros, lucian.ginsca}@info.uaic.ro

Abstract. We present in this rapport a model for a user’s profile based on

multiple social network accounts and influence services. In the modeling process we make use of well established vocabularies, but we also create our own model especially for data regarding influence. We built a web application with the purpose of offering an accessible interface for accessing the knowledgebase, but also allowing the user to have his social graph semantically modeled.

1 Introduction

Using the information given by the current social networks (Twitter and Facebook),

SUP (Semantic User Profiling) is a Web platform able to manage user profiles. A user

profile is modeled semantically, and exposed on the related standards. It also provides

means for estimating a user's reputation based on multiple criteria, using social scoring services such as Klout and PeerIndex. The user has the satisfaction of viewing

his social graph that also can be queried using a SPARQL service. The core principles

behind this application are constructed around the visually attractive method of seeing

a user’s semantic profile. The next concerns more the functional properties of the

application. SUP extends a standard CRUD architecture into sophisticated web

application, the presentation and data model logic is properly separated (clients can

provide the user interface and servers can handle storage and application modeling

logic), the storage is handled nicely by Virtuoso triple store, end-to-end consistency in

data (JSON/JavaScript), smooth communication and interaction from client to server

and back, preserved clean encapsulated interfaces and lightweight RESTful web

services. The final result is a web application with effective user experience that

brings together the cumulative advances of modern JavaScript and web architecture design patterns, JSON, RDF, AJAX, REST style, and thin server architecture.

2 Global Architecture

The primary purpose of this data-driven application is being able to visualize it in the most pleasuring way it can be. A query is being passed to the application and it

returns a bunch of matching responses, in the order of relevance, mapped in a

standardized way. This process needs a light updater for the web page which means

asynchronous functionality, a creative way for visualizing the updates, an end-to-end

consistency in data and a lightweight CRUD style data provider. In order to obtain

this, the architecture of SUP (Semantic User Profile) has been designed following a

three-tier approach such as a light model-view-controller. The architecture combines

the different technologies coming from Javascript/JQuery/Ajax and Java worlds. The

presentation layer is Javascript-driven with Ajax for pushing information while the business and data layers are realized through Java EE technologies. Following this

thought, the application takes the best of both worlds: the dynamic, personalized user

experience we expect of immersive Web applications and the simple, scalable

architecture we expect from RESTful applications. Here below we provide further

details about the three specific tiers.

Figure 1: SUP global architecture

Presentation Layer

This layer has been developed as a single web page. The parent page has the primary purpose of satisfying the common user of the application that is looking for a creative

way of visualizing personal data and the child page regards the specialized users that

are looking for a representational state of their Sparql queries. The communication

between the two higher tiers is carried on through Ajax, with the client submitting

requests to the logic tier and receiving back JSON data representing the content of the

response, which is then parsed and used to activate proper interaction in the user

interface. The presentation implies data received from server represented in two ways:

one for the graph form of data visualization and the other one for the raw result for the

Sparql queries, which comes in xml format.

The main keywords for this tier are: Html, Css, Javascript, Ajax, Protovis, Twitter@Anywhere, Facebook Javascript SDK.

First of all, there is an important need for maintaining a user’s profile data. More data pushes from the server implies this simple way of distributing processing to the

clients. This fact transforms the application into a proper scalable web application.

The fact that Ajax lets the interaction with the server without a full refresh puts the

option of a stateful client back on the table. This has profound implications for the

architectural possibilities for dynamic immersive Web applications. The RESTful

services (Visualization and Sparql web services) are the data providers for the Ajax

updates. The primary type of response that we use is JSON, for its special quality of being human readable and easy to process.

The business and functional components of the application require minimal

information from the main social networks that are used as data providers. These are completed using Twitter@Anywhere1 and Facebook Javascript SDK2. Twitter

@Anywhere is an easy-to-deploy solution for bringing the Twitter communication

platform to a web page. It is used to build the integration with "Connect to Twitter."

The Facebook JavaScript SDK provides simple client-side functionality for accessing

Facebook's API calls. The social plugins are used in order to obtain an access token

for the communication with Facebook.

The creation and population of the graphs that are needed for visualizing the data for every semantic profile is done with the use of Protovis. The common forms of

visualization are the social graph and the timeline. This are provided with JSON

results after the RESTful services are also provided with query-specific results (this

discussion will be continued in the next section).

1 https://dev.twitter.com/docs/anywhere/welcome 2 https://developers.facebook.com/docs/reference/javascript/

Business Logic Layer

The business logic of the application is implemented through a collection of Java RESTful Web Services which are deployed on Tomcat 6 server. The services are used

for sending further Sparql queries and receiving from the Virtuoso triple store specific

responses. These are processed and made prettier for the user interface to get them.

This tier has the great property of using REST web services which are lightweight (no

complex markups) with human readable results and easy to build - no toolkits

required. We take advantage of using them for a CRUD way of getting our need data

for creating semantic profiles.

Data Layer

The data tier is mainly represented by a component for accessing and managing the

RDF/OWL model. This component queries and manages RDF triples RDF triples

with the OpenLink Software's Virtuoso3 which is a database server that can also store

(and, as part of its original specialty, serve as an efficient interface to databases of) relational data and XML. The primary data which consists of details of users’ profiles

from different social networks and different scores of their influence in online

medium is gathered using implementations of common used social medias and social

scoring applications: Twitter, Facebook, Klout and PeerIndex.

For Klout and PeerIndex, we created our personal API’s implementations. They are the main providers for influence scoring computing. For Twitter, we used Twitter4J4

which is a library for easily integration of the Twitter service with built-in OAuth

support and zero dependency and for Facebook, we chose RestFB5 which is a simple

and flexible Facebook Graph API and Old REST API client written in Java.

The reasoning over specific data is explained in the Data Acquisition and Influence model sections.

3 General Model and Vocabularies

Vocabularies. Besides the rdf, rdfs, owl and our own vocabularies developed with the

purpose of modeling influence information, we mainly use the foaf and sioc

vocabularies.

3 http://docs.openlinksw.com/ 4 http://twitter4j.org/en/index.html 5 http://restfb.com/

Table 1: Used terms sample

SIOC FOAF FOAF

sioc:user foaf:Agent foaf:birthdate

sioc:follows foaf:onlineAcount foaf:firstName

sioc:userAcount foaf:knows foaf:lastName

sioc:avatar foaf:nick foaf:homepage

sioc:creatorOf foaf:img

sioc:post foaf:mbox

In figure 2, we can see a part of the model, containing information about three users

and their friends. The visualization was done with Gravity using the RDF generated

by the Jena API.

Figure 2: Model sample with Gravity

In Figure 3, there is a visualisation of the same snippet of the model, this time with

Welkin. A node was highlighted for more information.

Figure 3: Model sample with Welkin

4 Data acquisition

Data acquisition regards the knowledge model of SUP. The raw data is obtained from the main social networks APIs implementations. The data is directly imported from

the web, mainly Twitter and Facebook. For Twitter and Facebook data acquisition, we

created wrappers for the libraries used to apply to our data needs. Both of them need

the application to be registered in order to acquire consumer keys, and consumer

secrets in advance.

The Twitter API6 consists of three parts: two REST APIs and a Streaming API. The Twitter REST API is the core API set, it allows developers to access core Twitter

data, it contains most of the methods and functions that would be used to utilize

Twitter data in an application, and it supports three formats (or endpoints) for each

method: XML, Atom, and JSON formats. This includes update timelines, status data,

and user information. The Search API methods give developers methods to interact

with Twitter Search and trends data. The main concern for us is the effects on rate

limiting and output format which can become easily an important issue of using this

API. We use a Java library recognized by Twitter for a simple implementation of the REST Twitter API, Twitter4J. The data extracted with the library is mainly consisted

by user personal information, details about friends and followers and latest tweets.

Basically, the methods that Twitter offer resources have this pattern:

Resource URL: https://api.twitter.com/1/users/show.json

6 https://dev.twitter.com/docs

GET followers/ids Returns an array of numeric IDs for every user following the

specified user. This method is powerful when used in conjunction with users/lookup.

GET friends/ids Returns an array of numeric IDs for every user the specified user is

following. This method is powerful when used in conjunction with users/lookup.

GET users/show Returns extended information of a given user, specified by ID or

screen name as per the required id parameter. The author's most recent status will be

returned inline. Users follow their interests on Twitter through both one-way and

mutual following relationships.

The responses we are aiming for have the JSON structure:

{

"profile_image_url":

"http://a3.twimg.com/profile_images/689684365/api_normal.png",

"location": "San Francisco, CA",

"follow_request_sent": false,

"id_str": "6253282",

"profile_link_color": "0000ff",

"is_translator": false,

"contributors_enabled": true,

"url": "http://dev.twitter.com",

"favourites_count": 15,

"id ": 6253282 }

Facebook Graph API7 presents a simple, consistent view of the Facebook social graph, uniformly representing objects in the graph (e.g., people, photos, events, and

pages) and the connections between them (e.g., friend relationships, shared content,

and photo tags). For Facebook data acquisition, we use RestFB java library. RestFB

already maps objects to Json so the data is received in this format:

{

"id": "220439",

"name": "Facebook User",

"first_name": "Facebook",

"last_name": "User",

"link": "https://www.facebook.com/facebook.user",

"username": "facebook.user",

"gender": "male",

"locale": "en_US"

}

For proper usage of this library, we created a wrapper with already built-in Facebook Graph specific queries. This way, we minimized the effort of repeatedly creating

different queries. Finally, Facebook offers us personal data, extended details for

friends and personal feed.

7 https://developers.facebook.com/docs/reference/api/

The process of data acquisition combined with social scores is explained in the figure below.

Figure 4: Data acquisition workflow

5 Influence model

We are interested in discovering features related to a user’s influence on a certain

social network, the influence of his friend and creating a model using RDFS and

OWL for these influence components. We use two services that are known for their

work in social network influence analysis, Klout8 and PeerIndex9.

Klout. We included in our model, besides the Klout score, other influence related

concepts that Klout offers. Next, we present the four influence scores that Klout provides. Most of the descriptions were taken from the Klout’s website and serve the

purpose of giving a better understanding of the different notions regarding influence

thar are being introduced in the model.

8 http://klout.com/ 9 http://www.peerindex.com/

Klout Score: The Klout Score is the measurement of the user’s overall online

influence. The score ranges from 1 to 100 with higher scores representing a wider and

stronger sphere of influence.

Amplification Probability: Klout describes the Amplification Probability as: "the

likelihood that your content will be acted upon. The ability to create content that

compels others to respond and high-velocity content that spreads into networks

beyond your own is a key component of influence."

Network: The network effect that an author has and it is a measure of the influence of the people the author is reaching. Klout describes it as "the influence level of your

engaged audience."

True Reach: The True Reach score from Klout measures how many people an author

influences.

In Figure 5, a snippet from the RDF/XML file describing the Klout score is shown.

Figure 5: Klout score in RDF

Next, we will present some of the 17 klout classes. In our model, the klout class

concept is defined using the owl:oneOf construct and enumerating the instances.

Broadcaster: The user broadcasts appreciated content that spreads fast. He is an essential information source in his industry. He has a large and diverse audience.

Celebrity: The user reached a maximum point of audience. People share his content

in great numbers. He is probably famous in real life and has numerous fans.

Curator: The user highlights the most interesting people and finds the best content on

the web and share it to a wide audience. He is a critical information source.

Feeder: The user’s audience relies on him for a steady flow of information about his

industry or topic.

Observer: He doesn’t share very much, but follows the social web. He prefers to

observe more than sharing.

Klout also offers lists of maximum five influencers and one of maximum five

influences. We caught this aspect in the isInfluencedBy and influences relations, as seen in Figure 6.

Figure 6: Klout influence relations in RDF

PeerIndex. Although PeerIndex relies on fewer data sources than Klout, we desired

to have an alternative to the klout score. Next, we will present descriptions of the four

influence scores, as given by PeerIndex.

PeerIndex score: A user’s overall PeerIndex score is a relative measure of his online

authority. The PeerIndex Score reflects the impact of his online activities, and the

extent to which he has built up social and reputational capital on the web.

In Figure 7, a snippet from the RDF/XML file describing the PeerIndex score is

shown.

Figure 7: PeerIndex score in RDF

Authority Score: Authority is the measure of trust calculating how much others rely

on the user’s recommendations and opinion in general and on particular topics.

PeerIndex calculates the authority in eight benchmark topics for every profile. These

are used to generate the overall Authority Score as well as produce the PeerIndex

Footprint diagram. The Authority Score is a relative positioning against everyone else in each benchmark topic. The rank is a normalized measure against all the other

authorities in the topic area.

Audience Score: The Audience Score is a normalized indication of the user’s reach

taking into account the relative size of his audience to the size of the audiences of

others. In calculating his Audience Score, PeerIndex does not simply use the number

of people who follow him, but instead generate from the number of people who are

impacted by his actions and are receptive to what he is saying. If the user is a person

who has an "audience" consisting of a large number of spam accounts, bots, or

inactive accounts, his Audience Score will reflect this.

Activity Score: Your Activity Score is the measure of how much the user does that is

related to the topic communities he is part of. By being too active, his topic

community members tend to get fatigued and may stop engaging with him. The Activity Score takes into account this behavior. Like the other scores, Activity Score

is calculated relative to the user’s communities. If he is part of a community that has a

large amount of activity, his level of activity and engagement will need to be higher to

achieve the same relative score as in a topic that has less activity.

In Figure 8, we see a visualization of the model with Welkin.10

Figure 8: Influence model visualized with Welkin

6 Topic Semantic Similarity

A user has associated different topics drawn from multiple sources which give an

overview image of his mostly discussed concepts or his interests. In our current

implementation, topics are gathered from the Klout and PeerIndex services. While

PeerIndex returns a straight-forward list of topics for a certain user, Klout has a

particular understanding of the concept of ―topic‖. Next, we will present Klout’s

method of finding topics.

10 http://simile.mit.edu/welkin/

http://simile.mit.edu/welkin/

Klout topics are gathered from the Twitter stream and in some cases they seem to

have nothing to do with what the tweets about. Klout looks for specific keywords/ in

the user’s tweets that received a certain amount of attention, such as numerous replies

to the user’s tweet or retweets of that tweet. If the user replies to someone’s tweet and

the response generated lots of interest, then Klout will look back to the original tweet

for keywords. Once the keywords that draw influence are obtained, Klout uses a

dictionary to identify relevant terms. More details regarding this dictionary and how the terms are correlated seem not to be available for public disclosure. Klout then

compares the user’s influence on these terms to see if you he is generating significant

influence within their network. If Klout determines if a user has influence on a

specific term, that term will appear on his list of topics. For a better understanding of

this process, we give a small example. If a user has at least 10 tweets about cats each

day, but no one every replies on those, the term ―cat‖ will not appear on his topic list,

but if a user publishes a tweet about ―war‖ and this tweet generates tens of replies and

gets retweeted a lot of times, then it is most likely that the term ―war‖ will be found in

his list of topics.

For computing the semantic similarity between two terms, we use three WordNet

semantic similarity algorithms, Wu and Palmer, Resnik and Lin. Next, we give more details about these measures and present results computed on 5 Klout topics extracted

from our knowledgebase.

Wu and Palmer measure. The Wu & Palmer measure [3] calculates semantic

similarity by considering the depths of the two synsets in the WordNet taxonomies,

along with the depth of the least common subsumer. The formula is as follows:

s1: the synset of the first term;

s2: the synset of the second term;

lcs(s1, s2): the synset of the least common subsumer.

This means that 0 < <= 1. The score can never be zero because the

depth of the least common subsumer is never zero. The depth of the root of a

taxonomy is one. The score is one if the two input synsets are the same.

Table 2: Wu and Palmer

Terms internet design web education philosophy

internet 1.0 0.631 0.909 0.222 0.21

Design 0.631 1.0 0.75 0.8 0.75

Web 0.909 0.75 1.0 0.461 0.428

education 0.222 0.8 0.8 1.0 0.8

philosophy 0.21 0.75 0.428 0.8 1.0

Resnik measure. This measure also relies on the idea of a least common subsumer

(LCS), the most specific concept that is a shared ancestor of the two concepts. [4]

The Resnik [1] measure simply uses the Information Content of the LCS as the

similarity value:

lcs(t1,ts2): the least common subsumer.

freq(t): the freaquecy of term t in a corpus;

maxFreq: the maximum frequency of a term from the same corpus.

The Resnik measure is considered somewhat coarse, since many different pairs of

concepts may share the same LCS. However, it is less likely to suffer from zero

counts (and resulting undefined values) since in general the LCS of two concepts will

not be a very specific concept.

Table 3: Resnik

terms internet design web education philosophy

internet 10.37 0.631 10.37 0.0 0.0

design 2.49 11.76 2.49 3.39 3.39

web 10.37 2.49 11.76 2.87 0.77

education 0.0 3.39 2.87 10.66 3.39

philosophy 0.0 3.39 0.77 3.39 11.76

Lin measure. The Lin measure [2] augments the information content of the LCS with

the sum of the information content of concepts A and B themselves. The lin measure

scales the information content of the LCS by this sum.

Table 4: Lin

terms internet design web education philosophy

internet 1.0 0.28 0.32 0.0 0.0

design 0.28 1.0 27 0.46 0.48

web 0.32 0.27 1.0 0.09 0.09

education 0.0 0.46 0.09 1.0 0.46

philosophy 0.0 0.48 0.09 0.46 1.0

Topic set similarity. For computing the semantic similarity between the topics of

interest of two users using one of the three measures described above, we first

generate the stem of each term, using an open source implementation of the Porter

Stemmer. The final similarity score is obtained using a weighted average over the

maximum score obtained by applying a semantic similarity measure on each

combination of a term from the first user’s topics set and one from the second user’s

topic set.

T1: first user’s topics set;

T2: second user’s topics set;

sim(t1, t2): one of the Wu and Palmer, Resnik or Lin similarity measures.

7 Visualization

We mentioned Protovis11 usage in order to create the graphics for visualizing a semantic profile. Protovis is a great tool that draws images in the Scalable Vector

Graphic format (SVG) which every modern and mobile browser, including IE 9, can

render it. We used two types of graphs: a force-directed graph and a timeline. In the

case of the force-directed graph, an intuitive approach to network layout is to model

the graph as a physical system: nodes are charged particles that repel each other, and

links are dampened springs that pull related nodes together. A physical simulation of

these forces then determines node positions; approximation techniques that avoid

computing all pair wise forces enable the layout of large numbers of nodes. In

addition, interactivity allows the user to direct the layout and jiggle nodes to

disambiguate links. A structure of this type graph has been developed for representing

friendship scoring between a user and his friends.

11 http://mbostock.github.com/protovis/docs/

Figure 9: Graph

The timeline represents a common way of showing a user’s activity in time. Screen

shots below.

Figure 10: Timeline

Figure 11: Sparql Endpoint

8 Use Cases

We distinguish two main types of use cases. One involving an inexperienced user that

just wants to find information about his social graphs or about his friends’ graphs and

another one, where an user with Sparql knowledge can write his own queries and visualize the results in table form, or select one of the predefined queries that generate

interactive graphs and modify the queries.

9 Conclusion

Semantic modeling deserves necessary involvement from out team and it is important

to continue investigating new means for influence computation more accurately. A larger collection of triples would be needed, along with a more complex semantic

model. Future work includes completing SUP with a semantic similarity

computation between users’ topics. The module has been implemented using

WordNet based semantic similarity algorithms but not yet included in the main workflow. In conclusion, we will focus on improving the semantic model and

furthermore exploring new ways of proper visualizing data.

References

1. Philip Resnik. 1995. Using information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Confer

2. D. Lin. 1998. An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning, Madison, August.

3. Wu and M. Palmer. 1994. Verb semantics and lexical selection. In 32nd Annual Meeting of

the Association for Computational Linguistics, pages 133–138, Las Cruces, New Mexico 4. Pedersen, Ted, Siddharth Patwardhan, and Jason Michelizzi. 2004. Wordnet::similarity —

measuring the relatedness of concepts. In Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04). AAAI Press, Cambridge, MA, pages 1024–1025

sup (semantic user profiling)

Technology