online index extraction from linked open data sources
DESCRIPTION
This presentation has been held by me at the Workshop titled Linked Data for Information Extraction 2014 (LD4IE) held at the International Semantic Web Conference 2014. The related paper is titled "Online Index Extraction from Linked Open Data Sources" and here is the link: http://ceur-ws.org/Vol-1267/LD4IE2014_Benedetti.pdfTRANSCRIPT
DB
Gro
up
@ U
NIM
O
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 1
Fabio Benedetti Sonia Bergamaschi Laura Po
Department of Engineering “Enzo Ferrari”
University of Modena & Reggio Emilia
DB
Gro
up
@ U
NIM
O
2
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 2
• Selection of a relevant LOD source
• Statistical indexes
• Architecture Overview
• Performance Evaluation
• LODeX & Conclusions
DB
Gro
up
@ U
NIM
O
3
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 3
Schmachtenberg, Max, Christian Bizer, and Heiko Paulheim. "Adoption of the Linked Data Best Practices in
Different Topical Domains." The Semantic Web–ISWC 2014. Springer International Publishing, 2014. 245-260.
DB
Gro
up
@ U
NIM
O
4
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 4
*Only 570 datasets belong to the LOD cloud,
the remaining datasets do not contain
ingoing/outgoing links to the LOD Cloud.
2009 2014*
Domain Number % Number %
Cross-domain 41 13.95% 41 4.04%
Geographic 31 10.54% 21 2.07%
Government 49 16.67% 183 18.05%
Life sciences 41 13.95% 83 8.19%
Media 25 8.50% 22 2.17%
Publications 87 29.59% 96 9.47%
Social web 0 0.00% 520 51.28%
User-generated content 20 6.80% 48 4.73%
Total 294 1014
2009 Domain
Cross-domain
Geographic
Government
Life sciences
Media
Publications
Social web
2014
DB
Gro
up
@ U
NIM
O
5
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 5
1. The documentation of the dataset
– The documentation can be poor or absent
– There are no standard to provide the documentation
– Sometime it is provided as an RDF file in XML format
2. Searching features of existing catalogs (i.e. Datahub)
– The metadata contain poor information
– None information about the structure of the dataset is used by the
search engine
3. The manual exploration of the Dataset
– It is required a good knowledge of SPARQL language
– It is a time consuming task
DB
Gro
up
@ U
NIM
O
6
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 6
To automatically extract a set of indexes able to
describe the structure of a LOD dataset
How to describe the datasetLOD datasets can have different purpose and structure:
• Ontology/Vocabulary (OWL & RDFS constraints)
• Open Data (i.e. generated from existing RDBMS)
The indexes should maximize the value of the information extraction
from heterogeneous datasets
Online & Automatic extraction• It does not require any additional information by the user
• It works with SPARQL endpoints
– We have to handle the bad performance issues of these Datasets
DB
Gro
up
@ U
NIM
O
7
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 7
We can think the entire set of RDF triples partitioned between:
• Intensional Knowledge
• Extensional Knowledge
The Intensional knowledge• It contains the RDFS or OWL constraints of the Ontology
• It represents the T-Box components of the knowledge base
The Extensional knowledge
• It contains the entities of the real word
described in the dataset
• It represents the A-Box components of
the knowledge base
• its triples cover most of the dataset
Instantiated classes act as a
bridge between the two type of
knowledge
DB
Gro
up
@ U
NIM
O
8
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 8
ex:Sector ex:Organization
owl:Class
ex:sector
sector
rdf:type rdf:type
rdfs:domainrdfs:range
rdf:label
rdf:label rdf:Propertyrdf:type
owl:ObjectProperty
rdf:type
sector1
rdf:type
organization1
rdf:type
ex:sector
dc:name
“Energy” organization2
rdf:typeex:sector
Instantiated Classes
ExtensionalKnowledge
IntensionalKnowledge
DB
Gro
up
@ U
NIM
O
9
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 9
Name Description Structure Category
t Number of Triples Integer
Generic
c Number of Classes Integer
I Number of Instances Integer
Cl Class List List(name, n. Instances)
Pl Property List List(name, n. occurrence)
IK Intensional K. triples List(s, p, o) Intensional
Sc Subject Class List(c, p, n. occurrence)
ExtensionalSCl Subject Class to literal List(c, p, n. occurrence)
Oc Object Class List(c, p, n. occurrence)
The Statistical Indexes are grouped in three categories:
• Generic
• Intensional
• Extensional
DB
Gro
up
@ U
NIM
O
10
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 10
ex:Sector ex:Organization
sector1
rdf:type
organization1
rdf:type
ex:sector
dc:name
“Energy” organization2
ex:sector rdf:type
Subject Class to literal
Subject Class
Sc - Subject Class SCl - Subject Class to literal Oc -Object Class
S ex:Organization ex:Sector ex:Sector
P ex:sector dc:name ex:sector
n 2 1 1
ex:Sector ex:Organization
sector1
rdf:type
organization1
rdf:type
ex:sector
dc:name
“Energy”
ex:sector
Object Class
DB
Gro
up
@ U
NIM
O
11
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 11
It takes in input a list of URLs of SPARQL endpoints
A set of Statistical Indexes for each endpoint is the output
• The IE process dynamically generates the SPARQL query used to
extract the Statistical Indexes
• It works in parallel querying different datasets• Partial results and the Statistical Indexes are stored in a NoSQL DB
DB
Gro
up
@ U
NIM
O
12
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 12
General Statistic Extraction• It uses 6 different queries to extract the indexes of this group
Intensional Knowledge Extraction• The extraction of the Intensional knowledge is performed through an
iterative algorithm
• The algorithm traverses the graph starting from the instantiated classes
Extensional Schema Extraction• It uses different SPARQL aggregation query to extract SC, SCl and OC
• Use a technique called Pattern Strategy to complete the extraction
– It is a technique able to produce an higher number of less
complex SPARQL query
– It is used when the endpoint is not able to answer an aggregation
query and it throws a timeout error
A complete list of the 24 query patterns is available at http://dbgroup.unimo.it/lodexQueries
DB
Gro
up
@ U
NIM
O
13
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 13
DB
Gro
up
@ U
NIM
O
14
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 14
Reachable datasets 244
SPARQL 1.1 compatible 137
Extraction completed 107
Extraction completedWithout PS
33
Total triples (107 datasets) 3,45 b
AVG time extraction 6,12 m
Total time (single process) 11,15 h
Total time (9 processes) 3,35 h
The test has been performed on a list of
469 Datasets
• More than the 90 % completed the
extraction in less than 500 s
• The PS technique has proved its worth • from 33 to 107 completed the
extraction
• The IE process is scalable • linear correlation between number of
triples and time
DB
Gro
up
@ U
NIM
O
17
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 17
LODeX is an online tool able to shows a visual Schema Summary for a LOD source
• We made use of the statistical indexes for the generation of the Schema Summary.
• Users can interact with the Schema Summary dataset and focus on the information that they are more interested in.
The tool is accessible at: www.dbgroup.unimo.it/lodex
Come to attend the LODeX demo at the ISWC demo session!F. Benedetti, S. Bergamaschi, and L. Po, “A visual summary for linked open data sources” 2014, International Semantic Web Conference (Posters & Demos).
DB
Gro
up
@ U
NIM
O
18
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 18
Conclusion
• We are able to extract valuable indexes from a LOD dataset
taking advantage of the definition of Intensional and Extensional knowledge
• The process of extraction is been tested with an huge number
of dataset and its efficiency and effectiveness has been
proven
Future Works
• To extend VOID vocabulary with our descriptors
• We want propose LODeX as assistance tool for LOD portals.
• We are extending LODeX in order to support the automatic
SPARQL query generation
DB
Gro
up
@ U
NIM
O
19
LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 19
DB
Gro
up
@ U
NIM
O
20
LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources
Dot. Fabio Benedetti
Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia
Thanks for your attention!