wi2015 - clustering of linked open data - the lodex tool
TRANSCRIPT
Big Data Integration
Exposing the underlying schema of LOD sources
Fabio Benedetti, Sonia Bergamaschi, Laura PoDepartment of Engineering Enzo FerrariUniversity of Modena & Reggio Emilia
The 2015 IEEE/WIC/ACM International Conference on Web Intelligence
DB Group @ UNIMO1
Linked Open Data In 2006, Tim Berners-Lee coined the term"Linked DataThe Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data.
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMO
2
Linked Open Data 5 star open data publish data on the Web under an open license make data available as structured data make data available in a non-proprietary open format link your data to other data to provide context use URIs to denote things L document your data in a top-down fashion
In 2006, Tim Berners-Lee coined the term"Linked Data
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMO1 star whatever format To understand if a dataset really contains interesting information a user have to manually explore it using SPARQL queries. A great number of datasets is published without a real documentation that could help on revealing their structure. There does not exist any standard for documenting a dataset User with no SPARQL skills are limited in exploring the LOD datasetsThe task of exploring a dataset can be time consuming without having any knowledge of its structureFew sources make use of voidDescriptors.3
LOD todayThe LOD Cloudmore then one thousand of interlinked datasetsseveral billions of RDF triples
Each LOD sourcewidely varying size, from thousands to billions of triples
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOThese numbers are still rapidly growing encouraged by the linking open data community and by the open government data initiatives. As greater amounts of data become available through LOD cloud, the expected consumption increases and this encourages new data publications, establishing a virtuous cycle.(e.g., Dbpedia contains 3 billion RDFtriples)
4
Our solution - LODeXA tool for promoting the understanding, navigation and querying of LOD sources
Requirements
portable to the LOD Cloudprovide a synthetic representation of the structure of the dataset (Schema Summary, Clustered Schema Summary)provide visual query building functionalities hiding the complexity of Semantic Web technologies
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMO
Example: finding out information on the world bank
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOdatahub is the datataset catalog for LODWorld bank a international financial institution part of the United Nations6
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMO
Schema Summary and Clustered Schema SummarySchema Summary
Clustered Schema Summary
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOThe SS exposes all the main classes and properties used within the datasets, either they are taken fromexternal vocabularies or not. The CSS provides a more high level view of the classes and the properties used, it exploits themultiple class instantiations to generate clusters of classes and decrease the overall size of the graph.8
Schema Summary and Clustered Schema SummarySchema Summary
Clustered Schema Summary
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOThe SS exposes all the main classes and properties used within the datasets, either they are taken fromexternal vocabularies or not. The CSS provides a more high level view of the classes and the properties used, it exploits the multiple class instantiations to generate clusters of classes and decrease the overall size of the graph.9
LODeX
Schema SummaryClusteredSchema Summary
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOThe Semantic Web has provided a schema language such as RDF Schema (RDFS) and an ontology definition language as OWL which allow for adding rich semantics to the dataset.The information contained in the Intensional knowledge can be incompleted or absentThe extensional knowledge is too wide to be explored manuallyUsers need a comprehensive view of the source at a high level
10
Conclusion and Future WorksA tool for exploring and querying LOD sources + navigation of large LOD sourcesTry LODeX at: http://dbgroup.unimo.it/lodex2http://www.dbgroup.unimo.it/lodex2/testCluster
Future works
New filtering and clustering techniques An interactive exploration than start from the highest level and can be detailed till the lowest levelQuery functionalities on the Clustered Schema Summary (mapping functionalities to convert a visual query on the CSS to a SPARQL query on the LOD endpoint)
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOThis paper represent only the first step towards the construction of a high level view for very large LOD sources
enable the navigation and exploration on different level of details, and allow query functionalities only at the base level
11
Thanks for your attention!
Come to see the poster!
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOReferenceF. Benedetti, S. Bergamaschi, L. Po, Exposing the underlying schema of LOD sources. WI 2015 F. Benedetti, S. Bergamaschi, L. Po, LODeX: A tool for Visual Querying Linked Open Data. ISWC 2015 (Posters & Demonstrations Track)F. Benedetti, S. Bergamaschi, L. Po, Visual Querying LOD sources with LODeX. K-CAP 2015F. Benedetti, S. Bergamaschi, and L. Po, A visual summary for linked open data sources. ISWC 2014 (Posters & Demonstrations Track)F. Benedetti, S. Bergamaschi, and L. Po. Online index extraction from linked open data sources. Linked Data for Information Extraction (LD4IE) Workshop held at ISWC 2014Acknowledgment
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMO
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOClusteringEach RDF graph is composed by a set of vertices V and a set of labelled edges E. The vertices can be divided in 3 disjoint sets: the URIs U, the blank nodes B and literals L.Two vertices connected by an edge represent a statement. Each statement is stored into a triple, where subject (U B) , object V and predicate E. We can define the whole RDF graph as a set of triples RG.RG (U B) x E x V
The rdf:type property is used to state that a certain resource is an instance of a class. We define the set of classes as Cs.Cs = {c | RG ^ i (U B) }We call partial cluster of classes (PC) a set of classes that concur in the multiple instantiation of the same resource:PC(i) = {c| RG ^ i (U B) }and each PC(i) C
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOClustering (2)The partial cluster of classes (PC) are sets of classes that concur in the multiple instantiation of the same resource:PC(i) = {c| RG ^ i (U B) }
By examining all the instances in a RG graph, we find different PC. The collection of all the PC that occur in a RG graph is called family of PC, C : C = {PC(i): i (U B)} C contains a particular family of sets able to generate all the other sets. We call this family, family of super sets (S2), and we define it as follow: S = {ST C: PC C ^ PC ST}For each set st S , a class ca st must be elected to represent the entire set of classes. This class is called candidate agent of the superset. For each superset, we choose as candidate agent the class with the highest number of instances.
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOSchema Summary
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOIndexes needed to generate a Schema SummaryThese indexes belong to extensional group of the Statistical Indexes [2]:SC (Subject Class) contains the pairs (p,c) where p is an object property and c is its domain class.SCl (Subject Class to literal) contains the pairs (p,c) where p is a datatype property and c is its domain class.OC (Object Class) contains the pairs (p,c) where p is an object property and c is its range class.
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOIndexes needed to generate a Schema SummaryThese indexes belong to extensional group of the Statistical Indexes [2]:SC (Subject Class) contains the pairs (p,c) where p is an object property and c is its domain class.SCl (Subject Class to literal) contains the pairs (p,c) where p is a datatype property and c is its domain class.OC (Object Class) contains the pairs (p,c) where p is an object property and c is its range class.
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOIndexes needed to generate a Schema SummaryThese indexes belong to extensional group of the Statistical Indexes [2]:SC (Subject Class) contains the pairs (p,c) where p is an object property and c is its domain class.SCl (Subject Class to literal) contains the pairs (p,c) where p is a datatype property and c is its domain class.OC (Object Class) contains the pairs (p,c) where p is an object property and c is its range class.
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOSchema Summary generationWe use an algorithm for combining these indexes and produce a Schema SummaryNameValuesSC(foaf:Organization,ex:ceo,1), (foaf:Organization,ex:sector,2)SCl(foaf:Person,foaf:firstName,1), (foaf:Person,foaf:lastName,1), (foaf:Organization,ex:dbpedia:fax,1), (ex:Sector,dc:title,1), (foaf:Organization,ex:activity,1), (foaf:Organization,dbpedia:fax,1)OC(ex:Sector,ex:sector,1)(ex:Person,ex:ceo,1)
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOSchema Summary generation
We use an algorithm for combining these indexes and produce a Schema SummaryNameValuesSC(foaf:Organization,ex:ceo,1), (foaf:Organization,ex:sector,2)SCl(foaf:Person,foaf:firstName,1), (foaf:Person,foaf:lastName,1), (foaf:Organization,ex:dbpedia:fax,1), (ex:Sector,dc:title,1), (foaf:Organization,ex:activity,1), (foaf:Organization,dbpedia:fax,1)OC(ex:Sector,ex:sector,1)(ex:Person,ex:ceo,1)
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOLODeX ArchitectureTwo main modules
Extraction & SummarizationIndex Extraction (IE)Post Processing (PP)
Visualization & QueryingSchema Summary VisualizationQuery Orchestrator
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOVisualization & QueryingSchema Summary Visualization
Front end of the Web Application composed by three panel:
List of datasets indexed in LODeXSchema Summary and query building panelRefinement panel
Query Orchestrator
It manages the interaction between the User and the GUIIt contains a SPARQL compiler able to compile the visual query in a SPARQL one
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMOSchema Summary Building a Visual Query
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMORefinement Panel
NLaura Po Exposing the underlying schema of LOD sourcesN
DB Group @ UNIMO