uciad overview

26
User Centric Integration of Activity Data Mathieu d’Aquin, Stuart Brown, Salman Elahi, Enrico Motta The Open University

Upload: mathieu-daquin

Post on 11-May-2015

506 views

Category:

Technology


0 download

DESCRIPTION

Overview of the UCIAD (user centric integration of activity data), presented during the JISC visit - 20/04/2011

TRANSCRIPT

Page 1: UCIAD overview

User Centric Integration of Activity Data

Mathieu d’Aquin, Stuart Brown, Salman Elahi, Enrico Motta

The Open University

Page 2: UCIAD overview

Agenda

• Introduction of the Team

• Objectives and Hypothesis

• Overview of technical realization

• Challenges

• Summary of results so far and dissemination

Page 3: UCIAD overview

Team

• Dr Mathieu d’Aquin – Research fellow, KMi – project director

• Stuart Brown – Web developments and online communities, communication services – member of the steering group, liaison with online services

• Salman Elahi – Resarch assistant and PhD student, KMi – developer/researcher

• Prof Enrico Motta – Professor of knowledge technologies, KMi – Chair of the steering group

Page 4: UCIAD overview

Objectives and Hypothesis

Hypothesis1. Taking a user centric point of view can

allow different types of analysis of logs/activity data, which are valuable to the organisation and the user

2. Ontologies and Ontology-based reasoning can support the integration, consolidation and interpretation of activity data from multiple sources

Page 5: UCIAD overview

Organisation Centric Activity Data

Users

Organisation

Website 1

Website 2

Website 3

Website 4

Logs 1Logs 2

Logs 3

Logs 4

ConsolidationConsolidation Consolidation

Analytics = aggregated stats

Page 6: UCIAD overview

At the Open University

• An analytics system building aggregated data from various university’s websites

• Based on a manually defined sitemaps• Good for website optimization, marketing

campaigns, etc.• But the data being pre-aggregated, it is limited

with respect to what it can do• Limited control• No user view

Page 7: UCIAD overview

User Centric Activity Data

Users

Organisation

Website 1

Website 2

Website 3

Website 4

Logs 1Logs 2

Logs 3

Logs 4

ConsolidationIntegration

Interpretation

Activity analysis for and by individual users

Ontologies

Page 8: UCIAD overview

Ontologies

• Formal conceptual models of a domain– Here, the domain is online user activity

• At the basis of Semantic Web technologies– Standard languages for expressing ontologies and

ontological data (RDF, OWL)– Tools to manipulate and work with ontologies and

semantic data (NeOn Toolkit, OWLIM)– Many ontologies to reuse (cf. Watson)

• Adhere to a logical formalism– Enable inferences on the data

Page 9: UCIAD overview

Objectives and Deliverables

• Build the technical infrastructure that can hold traces of activity data as semantic data– Include triple store with reasoning capability, log parsers for

different formats of logs, and renderers as semantic data (RDF)

• Build the ontologies to interpret and reason upon activity data– Including various aspects of activity data in a way which is

extensible

• Tools to support users in analyzing their own activity data– Recognize a user from the different settings and provide view on

his/her own data – Allow him/her to customize the view, by customizing the ontology

• Test, validate, deploy, distribute

Page 10: UCIAD overview

Technical infrastructure

Server1 Server2 Server3

Application

Application

Log Log

Log Log

Log

Parser/RDF renderer

Parser/RDF renderer

Parser/RDF renderer

Parser/RDF renderer

Parser/RDF renderer

Daily RDF traces

Daily RDF traces

Daily RDF traces

Daily RDF traces

Daily RDF traces

Scheduler/Manager

Semantic Triple Store

Page 11: UCIAD overview

Technical infrastructure

• Development of parsers for different kinds a log formats – Currently handle Apache web server log files,

parameterized from the Apache configuration– Easily extensible for dedicated log formats

• Provide a common data structure serialized in RDF by the RDF renderer

• Each server produces a daily extract from the logs in RDF, which is being used to populate the semantic triple store

• The triple store includes multiple repositories and sub-spaces depending on time/user/server

Page 12: UCIAD overview

Ontologies

• Key concepts to be represented:– Actors (human users and robots)– Sitemaps– Traces (broad notion of logs)– Activities

• Reusing existing ontologies– FOAF: for people and documents– Time Ontology: for traces– Action ontology: for traces and activities– (Planned) OPO: Online presence– (Planner) SIOC: Online communities

Page 13: UCIAD overview
Page 14: UCIAD overview

Iterative and extensible construction of the ontologies– Provide a base with actors, sitemaps and traces– Specific extensions with typologies of activities, depending

on user and site– Dynamically building and integrating

Page 15: UCIAD overview

Tool for analysis

• Need a tool which given– A set of ontologies– A data repository (which can be the overall one, the

one restricted by time, and one for a given user)

can provide a meaningful and interactive overview of the activity data

• To be used for – Provide an ontology-specific view of data analytics– Support the iterative development of the ontologies– Provide a user centric view of the data

Page 16: UCIAD overview

Tools for analysis

Page 17: UCIAD overview

Example

In the ontology:/robot.txt is a RobotTXT page

A Spider is an RobotAgent (ActorAgent)

An agent used to access a RobotTXT is a Spider

An AutomaticActivity is a Trace realized by a RobotAgent

Result:Thousands of traces

automatically classified as automatic activities.

Page 18: UCIAD overview

Example

In the ontology:UCIAD-Blog and LUCERO-Blog

are Blogs (Website)

A BlogPage is a page which is part of a Blog

An activity onBlog is an activity happening on a Blog Page

Result:Can look specifically at activities

happening on a Blog and specialize them (same applies to Wikis, and other types of websites)

Page 19: UCIAD overview

Example

In the ontology:A SPARQLEndpoint is a specific type of

Webpage

AccessingSparqlEnpoint is an activity on a SPARQLEndpoint

SPARLQQueryParameter is a parameter with the name “query” used in an AccessingSPARQLEndpoint activity

ExecutingSPARQLQuery is an AccessingSPARQLQuery activity attached to a SPARQLQueryParameter

Result:Can explore the specific activity of executing

SPARQL queries and its parameters

Can combine: Detect the activity of Automatically Accessing a SPARQL endpoint: and automatic activity and accessing a SPARQL endpoint.

Page 20: UCIAD overview

Next step: User support

• Allow users – to log-in– detect setting – bring up the relevant data – explore it

• But also, – to customize the view of the data– to extend the ontologies to provide a personalized

analysis of activity data– to export (interpreted) activity data for reuse

Page 21: UCIAD overview

User support

User Logging or register

Display Activity Data related to all known settings of the user

Detect setting (agent+IP)

Check setting non-

ambiguous

It is the first time you log into UCIAD with this setting (detail) do you want to attach it to your

account?

Add setting to known setting

Register setting as

ambiguous

known setting for user

unknown setting

ambiguousnon-

ambi

guou

s

yes

no

Page 22: UCIAD overview

User support: data for a user

For a user <u> the SPARQL query

Construct {?trace ?p ?y. ?y ?q ?z} where

{<u> actor:hasKnownSetting ?s.

?trace trace:hasSetting ?s. ?trace ?p ?y. ?trace ?q ?z}builds the traces of activities around the known setting of

<u>

Used to populate a specific repository with sub-spaces for each registered users

Page 23: UCIAD overview

Deployment, test, validation

• At the moment, testing for websites of projects and events hosted on KMi servers:– Sssw.org, sssw09.org, loted.eu, lucero-project.info,

uciad.info, data.open.ac.uk, lucero.open.ac.uk, …

• Next level up, websites/systems from main open university website:– www.open.ac.uk, study at the OU,

podcasts.open.ac.uk, VLE

• Extend to deployment of instances for specific projects with distributed websites

Page 24: UCIAD overview

Challenges

• Scalability– OWLIM triple store can handle billions of triples– But struggle with millions when inference is “on”– 1 repository without inference with all historical data, 1 with inference with 1

week of data only, and 1 with inference for registered users

• User management and privacy– Ensuring that the user who logs in from a particular setting is the one having the

activity is difficult (e.g., in the case of shared computers)– Is this really a problem?– Check ambiguity – ask verification questions – moderate?

• Distribution and IPR– Code and ontologies under open licenses (small uncertainty regarding code

developed in other projects)– Overall data: privacy issues (is k-anonymity actually applicable? Would it work?)– Overall data: institutional issues (can we show the traffic on our websites to

everybody)– User data export: what license?

Page 25: UCIAD overview

Summary and dissemination

• Promising initial results– Can create new ways of analysis at run-time by editing the ontologies!– Mechanisms to provide personal views on own activity data across

websites

• First version of the ontologies: ongoing task• First version of the tools: test and validate!• Dissemination

– Blog / Twitter #uciad– KMi’s internal news letter (KMi Planet)– Salman’s paper at the ESWC 2011 PhD symposium: “Personal

Semantics: Personal information management in the Web with Semantic Technologies”

– Position paper at the W3C Web tracking and privacy workshop: “Self-Tracking on the Web: Why and How”

– Submission to the Personal Semantic Data workshop at K-CAP 2011

Page 26: UCIAD overview

More info

UCIAD Blog: http://uciad.info

Code base: http://github.com/uciad

Twitter: #uciad

@mdaquin