wi2015 - clustering of linked open data - the lodex tool

23
DB Group @ UNIMO Exposing the underlying schema of LOD sources Fabio Benedetti, Sonia Bergamaschi, Laura Po Department of Engineering “Enzo Ferrari” University of Modena & Reggio Emilia The 2015 IEEE/WIC/ACM International Conference on Web Intelligence

Upload: laura-po

Post on 22-Feb-2017

38 views

Category:

Software


1 download

TRANSCRIPT

Big Data Integration

Exposing the underlying schema of LOD sources

Fabio Benedetti, Sonia Bergamaschi, Laura PoDepartment of Engineering Enzo FerrariUniversity of Modena & Reggio Emilia

The 2015 IEEE/WIC/ACM International Conference on Web Intelligence

DB Group @ UNIMO1

Linked Open Data In 2006, Tim Berners-Lee coined the term"Linked DataThe Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data.

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMO

2

Linked Open Data 5 star open data publish data on the Web under an open license make data available as structured data make data available in a non-proprietary open format link your data to other data to provide context use URIs to denote things L document your data in a top-down fashion

In 2006, Tim Berners-Lee coined the term"Linked Data

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMO1 star whatever format To understand if a dataset really contains interesting information a user have to manually explore it using SPARQL queries. A great number of datasets is published without a real documentation that could help on revealing their structure. There does not exist any standard for documenting a dataset User with no SPARQL skills are limited in exploring the LOD datasetsThe task of exploring a dataset can be time consuming without having any knowledge of its structureFew sources make use of voidDescriptors.3

LOD todayThe LOD Cloudmore then one thousand of interlinked datasetsseveral billions of RDF triples

Each LOD sourcewidely varying size, from thousands to billions of triples

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOThese numbers are still rapidly growing encouraged by the linking open data community and by the open government data initiatives. As greater amounts of data become available through LOD cloud, the expected consumption increases and this encourages new data publications, establishing a virtuous cycle.(e.g., Dbpedia contains 3 billion RDFtriples)

4

Our solution - LODeXA tool for promoting the understanding, navigation and querying of LOD sources

Requirements

portable to the LOD Cloudprovide a synthetic representation of the structure of the dataset (Schema Summary, Clustered Schema Summary)provide visual query building functionalities hiding the complexity of Semantic Web technologies

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMO

Example: finding out information on the world bank

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOdatahub is the datataset catalog for LODWorld bank a international financial institution part of the United Nations6

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMO

Schema Summary and Clustered Schema SummarySchema Summary

Clustered Schema Summary

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOThe SS exposes all the main classes and properties used within the datasets, either they are taken fromexternal vocabularies or not. The CSS provides a more high level view of the classes and the properties used, it exploits themultiple class instantiations to generate clusters of classes and decrease the overall size of the graph.8

Schema Summary and Clustered Schema SummarySchema Summary

Clustered Schema Summary

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOThe SS exposes all the main classes and properties used within the datasets, either they are taken fromexternal vocabularies or not. The CSS provides a more high level view of the classes and the properties used, it exploits the multiple class instantiations to generate clusters of classes and decrease the overall size of the graph.9

LODeX

Schema SummaryClusteredSchema Summary

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOThe Semantic Web has provided a schema language such as RDF Schema (RDFS) and an ontology definition language as OWL which allow for adding rich semantics to the dataset.The information contained in the Intensional knowledge can be incompleted or absentThe extensional knowledge is too wide to be explored manuallyUsers need a comprehensive view of the source at a high level

10

Conclusion and Future WorksA tool for exploring and querying LOD sources + navigation of large LOD sourcesTry LODeX at: http://dbgroup.unimo.it/lodex2http://www.dbgroup.unimo.it/lodex2/testCluster

Future works

New filtering and clustering techniques An interactive exploration than start from the highest level and can be detailed till the lowest levelQuery functionalities on the Clustered Schema Summary (mapping functionalities to convert a visual query on the CSS to a SPARQL query on the LOD endpoint)

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOThis paper represent only the first step towards the construction of a high level view for very large LOD sources

enable the navigation and exploration on different level of details, and allow query functionalities only at the base level

11

Thanks for your attention!

Come to see the poster!

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOReferenceF. Benedetti, S. Bergamaschi, L. Po, Exposing the underlying schema of LOD sources. WI 2015 F. Benedetti, S. Bergamaschi, L. Po, LODeX: A tool for Visual Querying Linked Open Data. ISWC 2015 (Posters & Demonstrations Track)F. Benedetti, S. Bergamaschi, L. Po, Visual Querying LOD sources with LODeX. K-CAP 2015F. Benedetti, S. Bergamaschi, and L. Po, A visual summary for linked open data sources. ISWC 2014 (Posters & Demonstrations Track)F. Benedetti, S. Bergamaschi, and L. Po. Online index extraction from linked open data sources. Linked Data for Information Extraction (LD4IE) Workshop held at ISWC 2014Acknowledgment

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMO

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOClusteringEach RDF graph is composed by a set of vertices V and a set of labelled edges E. The vertices can be divided in 3 disjoint sets: the URIs U, the blank nodes B and literals L.Two vertices connected by an edge represent a statement. Each statement is stored into a triple, where subject (U B) , object V and predicate E. We can define the whole RDF graph as a set of triples RG.RG (U B) x E x V

The rdf:type property is used to state that a certain resource is an instance of a class. We define the set of classes as Cs.Cs = {c | RG ^ i (U B) }We call partial cluster of classes (PC) a set of classes that concur in the multiple instantiation of the same resource:PC(i) = {c| RG ^ i (U B) }and each PC(i) C

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOClustering (2)The partial cluster of classes (PC) are sets of classes that concur in the multiple instantiation of the same resource:PC(i) = {c| RG ^ i (U B) }

By examining all the instances in a RG graph, we find different PC. The collection of all the PC that occur in a RG graph is called family of PC, C : C = {PC(i): i (U B)} C contains a particular family of sets able to generate all the other sets. We call this family, family of super sets (S2), and we define it as follow: S = {ST C: PC C ^ PC ST}For each set st S , a class ca st must be elected to represent the entire set of classes. This class is called candidate agent of the superset. For each superset, we choose as candidate agent the class with the highest number of instances.

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOSchema Summary

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOIndexes needed to generate a Schema SummaryThese indexes belong to extensional group of the Statistical Indexes [2]:SC (Subject Class) contains the pairs (p,c) where p is an object property and c is its domain class.SCl (Subject Class to literal) contains the pairs (p,c) where p is a datatype property and c is its domain class.OC (Object Class) contains the pairs (p,c) where p is an object property and c is its range class.

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOIndexes needed to generate a Schema SummaryThese indexes belong to extensional group of the Statistical Indexes [2]:SC (Subject Class) contains the pairs (p,c) where p is an object property and c is its domain class.SCl (Subject Class to literal) contains the pairs (p,c) where p is a datatype property and c is its domain class.OC (Object Class) contains the pairs (p,c) where p is an object property and c is its range class.

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOIndexes needed to generate a Schema SummaryThese indexes belong to extensional group of the Statistical Indexes [2]:SC (Subject Class) contains the pairs (p,c) where p is an object property and c is its domain class.SCl (Subject Class to literal) contains the pairs (p,c) where p is a datatype property and c is its domain class.OC (Object Class) contains the pairs (p,c) where p is an object property and c is its range class.

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOSchema Summary generationWe use an algorithm for combining these indexes and produce a Schema SummaryNameValuesSC(foaf:Organization,ex:ceo,1), (foaf:Organization,ex:sector,2)SCl(foaf:Person,foaf:firstName,1), (foaf:Person,foaf:lastName,1), (foaf:Organization,ex:dbpedia:fax,1), (ex:Sector,dc:title,1), (foaf:Organization,ex:activity,1), (foaf:Organization,dbpedia:fax,1)OC(ex:Sector,ex:sector,1)(ex:Person,ex:ceo,1)

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOSchema Summary generation

We use an algorithm for combining these indexes and produce a Schema SummaryNameValuesSC(foaf:Organization,ex:ceo,1), (foaf:Organization,ex:sector,2)SCl(foaf:Person,foaf:firstName,1), (foaf:Person,foaf:lastName,1), (foaf:Organization,ex:dbpedia:fax,1), (ex:Sector,dc:title,1), (foaf:Organization,ex:activity,1), (foaf:Organization,dbpedia:fax,1)OC(ex:Sector,ex:sector,1)(ex:Person,ex:ceo,1)

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOLODeX ArchitectureTwo main modules

Extraction & SummarizationIndex Extraction (IE)Post Processing (PP)

Visualization & QueryingSchema Summary VisualizationQuery Orchestrator

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOVisualization & QueryingSchema Summary Visualization

Front end of the Web Application composed by three panel:

List of datasets indexed in LODeXSchema Summary and query building panelRefinement panel

Query Orchestrator

It manages the interaction between the User and the GUIIt contains a SPARQL compiler able to compile the visual query in a SPARQL one

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMOSchema Summary Building a Visual Query

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMORefinement Panel

NLaura Po Exposing the underlying schema of LOD sourcesN

DB Group @ UNIMO