rdf data clustering

36
8 th

Upload: silvia-giannini

Post on 27-Jan-2015

122 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: RDF data clustering

Towards a uni�ed framework for distributed data managementacross the Semantic Web

Silvia Giannini(Supervisor: Prof. Eugenio Di Sciascio)

Dipartimento di Ingegneria Elettrica e dell'Informazione (DEI),Politecnico di Bari, Bari, Italy

[email protected]

8th ICCL Summer School Workshop (ICCL 2013)Semantic Web - Ontology Languages and Their Use

Dresden, Germany | 26 August, 2013

Page 2: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

Outline

1 The scenario

2 RDF clusteringMotivationsState of Art

3 Proposal

4 Preliminary Results

5 Conclusions

Silvia Giannini RDF data clustering

Page 3: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

Outline

1 The scenario

2 RDF clustering

3 Proposal

4 Preliminary Results

5 Conclusions

Silvia Giannini RDF data clustering

Page 4: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

The Linking Open Data (LOD) project

A global Uniform Resource Identi�er for each entity on the web (URIs)

A standardized access mechanism (HTTP URIs)

A machine-readable, open and standardized data format (RDF)

A mechanism for linking di�erent data sources (RDF-links)Relationship LinksIdentity LinksVocabulary Links

Silvia Giannini RDF data clustering

Page 5: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

The Linking Open Data (LOD) project

As of September 2011

MusicBrainz

(zitgist)

P20

Turismo de

Zaragoza

yovisto

Yahoo! Geo

Planet

YAGO

World Fact-book

El ViajeroTourism

WordNet (W3C)

WordNet (VUA)

VIVO UF

VIVO Indiana

VIVO Cornell

VIAF

URIBurner

Sussex Reading

Lists

Plymouth Reading

Lists

UniRef

UniProt

UMBEL

UK Post-codes

legislationdata.gov.uk

Uberblic

UB Mann-heim

TWC LOGD

Twarql

transportdata.gov.

uk

Traffic Scotland

theses.fr

Thesau-rus W

totl.net

Tele-graphis

TCMGeneDIT

TaxonConcept

Open Library (Talis)

tags2con delicious

t4gminfo

Swedish Open

Cultural Heritage

Surge Radio

Sudoc

STW

RAMEAU SH

statisticsdata.gov.

uk

St. Andrews Resource

Lists

ECS South-ampton EPrints

SSW Thesaur

us

SmartLink

Slideshare2RDF

semanticweb.org

SemanticTweet

Semantic XBRL

SWDog Food

Source Code Ecosystem Linked Data

US SEC (rdfabout)

Sears

Scotland Geo-

graphy

ScotlandPupils &Exams

Scholaro-meter

WordNet (RKB

Explorer)

Wiki

UN/LOCODE

Ulm

ECS (RKB

Explorer)

Roma

RISKS

RESEX

RAE2001

Pisa

OS

OAI

NSF

New-castle

LAASKISTI

JISC

IRIT

IEEE

IBM

Eurécom

ERA

ePrints dotAC

DEPLOY

DBLP (RKB

Explorer)

Crime Reports

UK

Course-ware

CORDIS (RKB

Explorer)CiteSeer

Budapest

ACM

riese

Revyu

researchdata.gov.

ukRen. Energy Genera-

tors

referencedata.gov.

uk

Recht-spraak.

nl

RDFohloh

Last.FM (rdfize)

RDF Book

Mashup

Rådata nå!

PSH

Product Types

Ontology

ProductDB

PBAC

Poké-pédia

patentsdata.go

v.uk

OxPoints

Ord-nance Survey

Openly Local

Open Library

OpenCyc

Open Corpo-rates

OpenCalais

OpenEI

Open Election

Data Project

OpenData

Thesau-rus

Ontos News Portal

OGOLOD

JanusAMP

Ocean Drilling Codices

New York

Times

NVD

ntnusc

NTU Resource

Lists

Norwe-gian

MeSH

NDL subjects

ndlna

myExperi-ment

Italian Museums

medu-cator

MARC Codes List

Man-chester Reading

Lists

Lotico

Weather Stations

London Gazette

LOIUS

Linked Open Colors

lobidResources

lobidOrgani-sations

LEM

LinkedMDB

LinkedLCCN

LinkedGeoData

LinkedCT

LinkedUser

FeedbackLOV

Linked Open

Numbers

LODE

Eurostat (OntologyCentral)

Linked EDGAR

(OntologyCentral)

Linked Crunch-

base

lingvoj

Lichfield Spen-ding

LIBRIS

Lexvo

LCSH

DBLP (L3S)

Linked Sensor Data (Kno.e.sis)

Klapp-stuhl-club

Good-win

Family

National Radio-activity

JP

Jamendo (DBtune)

Italian public

schools

ISTAT Immi-gration

iServe

IdRef Sudoc

NSZL Catalog

Hellenic PD

Hellenic FBD

PiedmontAccomo-dations

GovTrack

GovWILD

GoogleArt

wrapper

gnoss

GESIS

GeoWordNet

GeoSpecies

GeoNames

GeoLinkedData

GEMET

GTAA

STITCH

SIDER

Project Guten-berg

MediCare

Euro-stat

(FUB)

EURES

DrugBank

Disea-some

DBLP (FU

Berlin)

DailyMed

CORDIS(FUB)

Freebase

flickr wrappr

Fishes of Texas

Finnish Munici-palities

ChEMBL

FanHubz

EventMedia

EUTC Produc-

tions

Eurostat

Europeana

EUNIS

EU Insti-

tutions

ESD stan-dards

EARTh

Enipedia

Popula-tion (En-AKTing)

NHS(En-

AKTing) Mortality(En-

AKTing)

Energy (En-

AKTing)

Crime(En-

AKTing)

CO2 Emission

(En-AKTing)

EEA

SISVU

education.data.g

ov.uk

ECS South-ampton

ECCO-TCP

GND

Didactalia

DDC Deutsche Bio-

graphie

datadcs

MusicBrainz

(DBTune)

Magna-tune

John Peel

(DBTune)

Classical (DB

Tune)

AudioScrobbler (DBTune)

Last.FM artists

(DBTune)

DBTropes

Portu-guese

DBpedia

dbpedia lite

Greek DBpedia

DBpedia

data-open-ac-uk

SMCJournals

Pokedex

Airports

NASA (Data Incu-bator)

MusicBrainz(Data

Incubator)

Moseley Folk

Metoffice Weather Forecasts

Discogs (Data

Incubator)

Climbing

data.gov.uk intervals

Data Gov.ie

databnf.fr

Cornetto

reegle

Chronic-ling

America

Chem2Bio2RDF

Calames

businessdata.gov.

uk

Bricklink

Brazilian Poli-

ticians

BNB

UniSTS

UniPathway

UniParc

Taxonomy

UniProt(Bio2RDF)

SGD

Reactome

PubMedPub

Chem

PRO-SITE

ProDom

Pfam

PDB

OMIMMGI

KEGG Reaction

KEGG Pathway

KEGG Glycan

KEGG Enzyme

KEGG Drug

KEGG Com-pound

InterPro

HomoloGene

HGNC

Gene Ontology

GeneID

Affy-metrix

bible ontology

BibBase

FTS

BBC Wildlife Finder

BBC Program

mes BBC Music

Alpine Ski

Austria

LOCAH

Amster-dam

Museum

AGROVOC

AEMET

US Census (rdfabout)

Media

Geographic

Publications

Government

Cross-domain

Life sciences

User-generated content

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Silvia Giannini RDF data clustering

Page 6: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

RDF: the big picture

DBpedia1 extract

dbpedia:Dresden

dbpedia-owl:country

328.8

dbpedia-owl:areaTotal

dbpedia:Germany

Graph-structured knowledge representation (data-model)

Resource: concrete or abstract entity of the real world, identi�ed bydereferenceable URIDescription: representation of properties or relationships among resourcesFramework: combination of web based protocols and formal semantics

Facts in Triple-form: subject - predicate - object<http://dbpedia.org/resource/Dresden> <http://dbpedia.org/property/country>

<http://dbpedia.org/resource/Germany>.

1http://dbpedia.org

Silvia Giannini RDF data clustering

Page 7: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

RDF: the big picture

DBpedia extract

dbpedia:Dresden

dbpedia-owl:country

328.8

dbpedia-owl:areaTotal

rdf:type rdf:type

rdf:type

rdfs:rangerdfs:domain

dbpedia-owl:country

RDF data model

RDF Schema

dbpedia:Germany

dbpedia-owl:PopulatedPlace dbpedia-owl:Country

owl:ObjectProperty

RDF Schema: Explicit semantics of content and links

Silvia Giannini RDF data clustering

Page 8: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

Outline

1 The scenario

2 RDF clusteringMotivationsState of Art

3 Proposal

4 Preliminary Results

5 Conclusions

Silvia Giannini RDF data clustering

Page 9: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

Motivations

RDF Data Management Challenges

LOD cloud statistic: >31 billions facts, >500 million links, at October 2011

How to e�ciently:

Develop services on the top of the RDF data-model forbrowsing data;query answering;supporting expressive search (approximate matching);

Speed up data access and query response times over distributed machines

CLUSTERING

Silvia Giannini RDF data clustering

Page 10: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

Motivations

Contributions

Clustering semantic web resources (RDF graphs)

Discovering homogeneous groups of resources

Summarizing the original graph content in a meaningful way

Revealing possible hierachies of clusters

Identi�ng a concept description or discriminating features for each cluster

Silvia Giannini RDF data clustering

Page 11: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

State of Art

What is a cluster: data-based approach

A set of resources with large intra-cluster similarityand large inter-cluster dissimilarity

Data clustering methods

pairwise distance metricagglomerativepartitional (K-Means)

- Number or size of clusters to be set

Silvia Giannini RDF data clustering

Page 12: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

State of Art

What is a cluster: data-based approach

A set of resources with large intra-cluster similarityand large inter-cluster dissimilarity

Data clustering methods

pairwise distance metricagglomerativepartitional (K-Means)

- Number or size of clusters to be set

Silvia Giannini RDF data clustering

Page 13: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

State of Art

What is a cluster: data-based approach

A set of resources with large intra-cluster similarityand large inter-cluster dissimilarity

Data clustering methods

pairwise distance metricagglomerativepartitional (K-Means)

- Number or size of clusters to be set

Silvia Giannini RDF data clustering

Page 14: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

State of Art

What is a cluster: data-based approach

A set of resources with large intra-cluster similarityand large inter-cluster dissimilarity

Data clustering methods

pairwise distance metricagglomerativepartitional (K-Means)

- Number or size of clusters to be set

RDF data-model not suited for traditional data-clustering techniquesapplication over real-life RDF datasets!

Silvia Giannini RDF data clustering

Page 15: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

State of Art

What is a cluster: graph-based approach

A set of resources with large intra-cluster similarityand large inter-cluster dissimilarity

Graph clustering methods

vertex connectivity

neighborhood similarity

spectral analysis of the adjacency matrix

- Number or size of clusters to be sethttp://sydney.edu.au/engineering/it/~shhong/img/cluster1.png

Silvia Giannini RDF data clustering

Page 16: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

State of Art

RDF clustering: literature

Instance extractionSubgraph relevant for a resource representation (DESCRIBE SPARQL2-query)

1 Immediate Properties+ simple, quick- loss of information

2 Concise Bounded Description (CBD)+ better body of knowledge- domain dependent (use of blanknodes)

3 Depth Limited Crawling+ stable over input data with well

limiting subgraph- �nd a tradeo� between size andinformation content (datadependent)

G.A. Grimnes, P. Edwards, and A. Preece. "Instance based clustering of semantic web resources." The

Semantic Web: Research and Applications. Springer Berlin Heidelberg, 2008. 303-317.

2http://www.w3.org/TR/rdf-sparql-query/

Silvia Giannini RDF data clustering

Page 17: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

State of Art

RDF clustering: literature

Instance extractionSubgraph relevant for a resource representation (DESCRIBE SPARQL2-query)

1 Immediate Properties+ simple, quick- loss of information

2 Concise Bounded Description (CBD)+ better body of knowledge- domain dependent (use of blanknodes)

3 Depth Limited Crawling+ stable over input data with well

limiting subgraph- �nd a tradeo� between size andinformation content (datadependent)

G.A. Grimnes, P. Edwards, and A. Preece. "Instance based clustering of semantic web resources." The

Semantic Web: Research and Applications. Springer Berlin Heidelberg, 2008. 303-317.

2http://www.w3.org/TR/rdf-sparql-query/

Silvia Giannini RDF data clustering

Page 18: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

State of Art

RDF clustering: literature

Instance extractionSubgraph relevant for a resource representation (DESCRIBE SPARQL2-query)

1 Immediate Properties+ simple, quick- loss of information

2 Concise Bounded Description (CBD)+ better body of knowledge- domain dependent (use of blanknodes)

3 Depth Limited Crawling+ stable over input data with well

limiting subgraph- �nd a tradeo� between size andinformation content (datadependent)

G.A. Grimnes, P. Edwards, and A. Preece. "Instance based clustering of semantic web resources." The

Semantic Web: Research and Applications. Springer Berlin Heidelberg, 2008. 303-317.

2http://www.w3.org/TR/rdf-sparql-query/

Silvia Giannini RDF data clustering

Page 19: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

State of Art

RDF clustering: literature

Instances distance computation

Comparing two RDF graphs with the resources as root nodes

1 feature-vector basedmappings: (feature → shortest path; value → set of reachable nodes)similarity measure: e.g., Dice coe�cient

2 graph basedconceptual similarity : overlapping of nodesrelational similarity : overlapping of edges

3 ontology based3 (well de�ned ontology and conforming instance data)taxonomy similarity : semantic distance between metadata in a concepthierarchyrelation similarity : similarity of the instances related to the two consideredresourcesattribute similarity : similarity of attribute values (numeric, literal, etc.)

Determine the appropriate number of clusters

3A. Maedche, and V. Zacharias. "Clustering ontology-based metadata in the semanticweb." Principles of Data Mining and Knowledge Discovery. Springer Berlin Heidelberg, 2002.348-360.

Silvia Giannini RDF data clustering

Page 20: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

Outline

1 The scenario

2 RDF clustering

3 Proposal

4 Preliminary Results

5 Conclusions

Silvia Giannini RDF data clustering

Page 21: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

Requirements

Ideal clustering of graph-structured data:

cohesive intra-cluster structure

homogeneous intra-cluster properties

Parameter free algorithm:

number and size of partitions extracted from data

Silvia Giannini RDF data clustering

Page 22: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

How does community detection algorithms behave over RDF(S) graphs?

Community Discovery Algorithms

Graph mining techniques for extracting knowledge from large graphs

Exploit native graph features (topology) of the RDF model

Why:If two sets of entities are strongly related, they exhibit more connectionsthan other sets of entities

Bene�ts:+ Automatically discover the number and size of modules

+ Can handle uncertainty in clustering (overlapping communities)

+ Faster than data-clustering inspired techniques (no instances extraction)

Silvia Giannini RDF data clustering

Page 23: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

What is a community

A subgraph of a network whose nodes are more tightly connected with each

other than with nodes outside the subgraph.

Similarity : cohesion degree of subsets of vertices

- No overlapping capabilitiesC = {C1, . . . , Cn}, Ci ∩ Cj = ∅ ∀i, j ∈ {1, . . . , n}, i 6= j

In labeled graphs (like RDF graphs), each link models only one speci�c relation

Overlapping Communities Analysis

Silvia Giannini RDF data clustering

Page 24: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

From Node to Link Perspective

Community : A set of nodes with more external than internal connections, i.e.,a set of closely interrelated links.

Bene�ts:

+ Captures multiple memberships between nodes

+ Uni�es hierarchical and overlapping clustering

It is always possible to move from a link partition P = {P1, . . . , Pm},Pi ∩ Pj = ∅ ∀i, j ∈ {1, . . . ,m}, i 6= j to m nodes clusters, with possibleoverlapping.

Silvia Giannini RDF data clustering

Page 25: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

Datasets

SP2Bench4: A SPARQL Performance Benchmark

data generator for arbitrarily large DBLP-like RDF documents creation

mirrors key characteristics and social-world distributions of original DBLPdataset

publicy available

4M. Schmidt, et al. "SP2Bench: SPARQL performance benchmark." Semantic WebInformation Management. Springer Berlin Heidelberg, 2010. 371-393.

Silvia Giannini RDF data clustering

Page 26: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

Node communities

SP2Bench: 720 triples

Paul_ErdoesPaul_Erdoes

ArticleArticle

PersonPerson

ArticleArticle

Paul_ErdoesPaul_Erdoes

PersonPerson

V.D. Blondel, et al. "Fast unfolding of communitiesin large networks." Journal of Statistical Mechanics:Theory and Experiment 2008.10 (2008): P10008.

Tool: Gephi (https://gephi.org)

Silvia Giannini RDF data clustering

Page 27: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

Link Communities

Given an undirected graph G = (V, E), the set of neighbors of node i isNi = {j ∈ V|eij ∈ E}.

Similarity5: S(eik, ejk) =|Ni∩Nj ||Ni∪Nj |

Link Dendrogram: hierarchical agglomerative algorithm

Optimization of Partition density : cut level optimizes link density insidecommunities

DP = 2M

∑c mc

mc−(nc−1)(nc−2)(nc−1)

,

5Y.Y. Ahn, J.P. Bagrow, and S. Lehmann. "Link communities reveal multiscale complexityin networks." Nature 466.7307 (2010): 761-764.

Silvia Giannini RDF data clustering

Page 28: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

Outline

1 The scenario

2 RDF clustering

3 Proposal

4 Preliminary Results

5 Conclusions

Silvia Giannini RDF data clustering

Page 29: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

RDF clustering6

Article1

_:x1dc:creator

Adamanta Schlitt

foaf:name

dc:title

richer dwelling scrapped

swrc:pages140

_:x1

_:x2

_:x3

foaf:Person

rdf:type

rdf:type

rdf:type

rdf:type

rdf:type

swrc:journal

swrc:journal

rdf:type

rdf:type

swrc:journal

dc:creator

dc:creator

dc:creator

SIGNATURE: <subject> SIGNATURE: (<predicate>, <object>) SIGNATURE: {(<predicate_1>, <object_1>), ... (<predicate_n>, <object_n>)}

Different background colours reveal the hierarchy of clusters

REPLICATED NODES REVEALING OVERLAPPING CLUSTERS

LINKS BELONGING TO OTHER CLUSTERS

rdf:type

Article20

Article13

Paul_Erdoes

swrc:journalswrc:journal

Article3

Article2

Article1

Journal1

bench:Article

TYPE 1. CLUSTER (a) TYPE 2. CLUSTER (b) TYPE 3. CLUSTER (c)

6S. Giannini, "RDF Data Clustering." Springer Berlin Heidelberg, 2013. BIS 2013Workshop, LNBIP 160: 220�231.

Silvia Giannini RDF data clustering

Page 30: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

RDF clustering

Cluster of type 1.

Instance extraction (�xed subject)

Cluster of type 2.

Aggregation of resources (�xed predicate - �xed object)

Mixed-type clusters

Set of clusters of type 1. (or equivalently, of type 2.)

Silvia Giannini RDF data clustering

Page 31: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

RDF clustering

Cluster of type 1.

Instance extraction (�xed subject)

ex:Article15 swrc:pages 139ex:Article15 dc:title equalled bewitchment cheatersex:Article15 dc:creator ex:node17r3ptqpmx16ex:Article15 rdfs:seeAlso http://www.skeins.tld/sandwiching/bewitchment.htmlex:Article15 foaf:homepage http://www.sandwiching.tld/cheaters/ri�ed.html

Cluster of type 2.

Aggregation of resources (predicate - object)

Mixed-type clusters

Set of clusters of type 1. (or equivalently, of type 2.)

Silvia Giannini RDF data clustering

Page 32: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

RDF clustering

Cluster of type 1.

Instance extraction (�xed subject)

Cluster of type 2.

Aggregation of resources (�xed predicate - �xed object)

ex:Article9 swrc:journal http://localhost/publications/journals/Journal1/1945ex:Article8 swrc:journal http://localhost/publications/journals/Journal1/1945ex:Article7 swrc:journal http://localhost/publications/journals/Journal1/1945ex:Article3 swrc:journal http://localhost/publications/journals/Journal1/1945ex:Article2 swrc:journal http://localhost/publications/journals/Journal1/1945ex:Article1 swrc:journal http://localhost/publications/journals/Journal1/1945ex:Article10 swrc:journal http://localhost/publications/journals/Journal1/1945

Mixed-type clusters

Set of clusters of type 1. (or equivalently, of type 2.)

Silvia Giannini RDF data clustering

Page 33: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

RDF clustering

Cluster of type 1.

Instance extraction (�xed subject)

Cluster of type 2.

Aggregation of resources (�xed predicate - �xed object)

Mixed-type clusters

Set of clusters of type 1. (or equivalently, of type 2.)

ex:Article8 dc:creator http://localhost/persons/Paul_Erdoesex:Article8 rdf:type http://localhost/vocabulary/bench/Articleex:Article8 swrc:journal http://localhost/publications/journals/Journal1/1942ex:Article5 dc:creator http://localhost/persons/Paul_Erdoesex:Article5 rdf:type http://localhost/vocabulary/bench/Articleex:Article5 swrc:journal http://localhost/publications/journals/Journal1/1942ex:Article4 dc:creator http://localhost/persons/Paul_Erdoesex:Article4 rdf:type http://localhost/vocabulary/bench/Articleex:Article4 swrc:journal http://localhost/publications/journals/Journal1/1942ex:Article3 dc:creator http://localhost/persons/Paul_Erdoesex:Article3 rdf:type http://localhost/vocabulary/bench/Articleex:Article3 swrc:journal http://localhost/publications/journals/Journal1/1942ex:Article2 dc:creator http://localhost/persons/Paul_Erdoesex:Article2 rdf:type http://localhost/vocabulary/bench/Articleex:Article2 swrc:journal http://localhost/publications/journals/Journal1/1942ex:Article1 dc:creator http://localhost/persons/Paul_Erdoesex:Article1 rdf:type http://localhost/vocabulary/bench/Articleex:Article1 swrc:journal http://localhost/publications/journals/Journal1/1942

Silvia Giannini RDF data clustering

Page 34: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

Advantages and Emerging issues

Tests over 266, 720, and 5362 triples datasets

Number of obtained clusters: 53, 277, 3437

+ Good behaviour in presence of blank nodes

http://localhost/vocabulary/bench/PhDThesis rdfs:subClassOf foaf:Documenthttp://localhost/vocabulary/bench/Www rdfs:subClassOf foaf:Documenthttp://localhost/vocabulary/bench/Book rdfs:subClassOf foaf:Document_:node17rocfnblx296 rdf:_3 misc:UnknownDocument_c_:node17rocfnblx296 rdf:_2 misc:UnknownDocument_b_:node17rocfnblx296 rdf:_1 misc:UnknownDocument_amisc:UnknownDocument_c rdf:type foaf:Documentmisc:UnknownDocument_b rdf:type foaf:Documentmisc:UnknownDocument_a rdf:type foaf:Documenthttp://localhost/vocabulary/bench/MastersThesis rdfs:subClassOf foaf:Document

- A post-processing phase is needed (links replication)

If Paul Erdoes is a Person included in a type 2. cluster with signature (rdf:type -

pre�x:Person), this property will not appear in the cluster of type 1. describing the

resource Paul_Erdoes

Silvia Giannini RDF data clustering

Page 35: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

Outline

1 The scenario

2 RDF clustering

3 Proposal

4 Preliminary Results

5 Conclusions

Silvia Giannini RDF data clustering

Page 36: RDF data clustering

The scenario RDF clustering Proposal Preliminary Results Conclusions

Conclusions and Future Works

Community detection algorithms are a promising candidate for:

semantic web resources clustering

instances extraction from RDF graphs

Ongoing and future works:

A more comprehensive experimental evaluation on di�erent datasets

Analysis of cut threshold

Better de�nition of post-processing phase

Comparison with existing approaches

Combination of (1) graph clustering techniques, and (2) reasoning services1 Identify communities of closely related resources2 Extract a semantic description of them

Experimentation of "property-driven" clustering

Dynamics and evolution of clusters

Silvia Giannini RDF data clustering