online index extraction from linked open data sources

18
DB Group @ UNIMO LD4IE 2014 – Riva Del Garda, Italy Online Index Extraction from Linked Open Data Sources Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 1 Fabio Benedetti Sonia Bergamaschi Laura Po Department of Engineering “Enzo Ferrari” University of Modena & Reggio Emilia

Upload: fabio-benedetti

Post on 02-Jul-2015

770 views

Category:

Technology


2 download

DESCRIPTION

This presentation has been held by me at the Workshop titled Linked Data for Information Extraction 2014 (LD4IE) held at the International Semantic Web Conference 2014. The related paper is titled "Online Index Extraction from Linked Open Data Sources" and here is the link: http://ceur-ws.org/Vol-1267/LD4IE2014_Benedetti.pdf

TRANSCRIPT

Page 1: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 1

Fabio Benedetti Sonia Bergamaschi Laura Po

Department of Engineering “Enzo Ferrari”

University of Modena & Reggio Emilia

Page 2: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

2

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 2

• Selection of a relevant LOD source

• Statistical indexes

• Architecture Overview

• Performance Evaluation

• LODeX & Conclusions

Page 3: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

3

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 3

Schmachtenberg, Max, Christian Bizer, and Heiko Paulheim. "Adoption of the Linked Data Best Practices in

Different Topical Domains." The Semantic Web–ISWC 2014. Springer International Publishing, 2014. 245-260.

Page 4: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

4

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 4

*Only 570 datasets belong to the LOD cloud,

the remaining datasets do not contain

ingoing/outgoing links to the LOD Cloud.

2009 2014*

Domain Number % Number %

Cross-domain 41 13.95% 41 4.04%

Geographic 31 10.54% 21 2.07%

Government 49 16.67% 183 18.05%

Life sciences 41 13.95% 83 8.19%

Media 25 8.50% 22 2.17%

Publications 87 29.59% 96 9.47%

Social web 0 0.00% 520 51.28%

User-generated content 20 6.80% 48 4.73%

Total 294 1014

2009 Domain

Cross-domain

Geographic

Government

Life sciences

Media

Publications

Social web

2014

Page 5: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

5

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 5

1. The documentation of the dataset

– The documentation can be poor or absent

– There are no standard to provide the documentation

– Sometime it is provided as an RDF file in XML format

2. Searching features of existing catalogs (i.e. Datahub)

– The metadata contain poor information

– None information about the structure of the dataset is used by the

search engine

3. The manual exploration of the Dataset

– It is required a good knowledge of SPARQL language

– It is a time consuming task

Page 6: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

6

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 6

To automatically extract a set of indexes able to

describe the structure of a LOD dataset

How to describe the datasetLOD datasets can have different purpose and structure:

• Ontology/Vocabulary (OWL & RDFS constraints)

• Open Data (i.e. generated from existing RDBMS)

The indexes should maximize the value of the information extraction

from heterogeneous datasets

Online & Automatic extraction• It does not require any additional information by the user

• It works with SPARQL endpoints

– We have to handle the bad performance issues of these Datasets

Page 7: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

7

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 7

We can think the entire set of RDF triples partitioned between:

• Intensional Knowledge

• Extensional Knowledge

The Intensional knowledge• It contains the RDFS or OWL constraints of the Ontology

• It represents the T-Box components of the knowledge base

The Extensional knowledge

• It contains the entities of the real word

described in the dataset

• It represents the A-Box components of

the knowledge base

• its triples cover most of the dataset

Instantiated classes act as a

bridge between the two type of

knowledge

Page 8: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

8

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 8

ex:Sector ex:Organization

owl:Class

ex:sector

sector

rdf:type rdf:type

rdfs:domainrdfs:range

rdf:label

rdf:label rdf:Propertyrdf:type

owl:ObjectProperty

rdf:type

sector1

rdf:type

organization1

rdf:type

ex:sector

dc:name

“Energy” organization2

rdf:typeex:sector

Instantiated Classes

ExtensionalKnowledge

IntensionalKnowledge

Page 9: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

9

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 9

Name Description Structure Category

t Number of Triples Integer

Generic

c Number of Classes Integer

I Number of Instances Integer

Cl Class List List(name, n. Instances)

Pl Property List List(name, n. occurrence)

IK Intensional K. triples List(s, p, o) Intensional

Sc Subject Class List(c, p, n. occurrence)

ExtensionalSCl Subject Class to literal List(c, p, n. occurrence)

Oc Object Class List(c, p, n. occurrence)

The Statistical Indexes are grouped in three categories:

• Generic

• Intensional

• Extensional

Page 10: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

10

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 10

ex:Sector ex:Organization

sector1

rdf:type

organization1

rdf:type

ex:sector

dc:name

“Energy” organization2

ex:sector rdf:type

Subject Class to literal

Subject Class

Sc - Subject Class SCl - Subject Class to literal Oc -Object Class

S ex:Organization ex:Sector ex:Sector

P ex:sector dc:name ex:sector

n 2 1 1

ex:Sector ex:Organization

sector1

rdf:type

organization1

rdf:type

ex:sector

dc:name

“Energy”

ex:sector

Object Class

Page 11: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

11

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 11

It takes in input a list of URLs of SPARQL endpoints

A set of Statistical Indexes for each endpoint is the output

• The IE process dynamically generates the SPARQL query used to

extract the Statistical Indexes

• It works in parallel querying different datasets• Partial results and the Statistical Indexes are stored in a NoSQL DB

Page 12: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

12

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 12

General Statistic Extraction• It uses 6 different queries to extract the indexes of this group

Intensional Knowledge Extraction• The extraction of the Intensional knowledge is performed through an

iterative algorithm

• The algorithm traverses the graph starting from the instantiated classes

Extensional Schema Extraction• It uses different SPARQL aggregation query to extract SC, SCl and OC

• Use a technique called Pattern Strategy to complete the extraction

– It is a technique able to produce an higher number of less

complex SPARQL query

– It is used when the endpoint is not able to answer an aggregation

query and it throws a timeout error

A complete list of the 24 query patterns is available at http://dbgroup.unimo.it/lodexQueries

Page 13: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

13

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 13

Page 14: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

14

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 14

Reachable datasets 244

SPARQL 1.1 compatible 137

Extraction completed 107

Extraction completedWithout PS

33

Total triples (107 datasets) 3,45 b

AVG time extraction 6,12 m

Total time (single process) 11,15 h

Total time (9 processes) 3,35 h

The test has been performed on a list of

469 Datasets

• More than the 90 % completed the

extraction in less than 500 s

• The PS technique has proved its worth • from 33 to 107 completed the

extraction

• The IE process is scalable • linear correlation between number of

triples and time

Page 15: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

17

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 17

LODeX is an online tool able to shows a visual Schema Summary for a LOD source

• We made use of the statistical indexes for the generation of the Schema Summary.

• Users can interact with the Schema Summary dataset and focus on the information that they are more interested in.

The tool is accessible at: www.dbgroup.unimo.it/lodex

Come to attend the LODeX demo at the ISWC demo session!F. Benedetti, S. Bergamaschi, and L. Po, “A visual summary for linked open data sources” 2014, International Semantic Web Conference (Posters & Demos).

Page 16: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

18

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 18

Conclusion

• We are able to extract valuable indexes from a LOD dataset

taking advantage of the definition of Intensional and Extensional knowledge

• The process of extraction is been tested with an huge number

of dataset and its efficiency and effectiveness has been

proven

Future Works

• To extend VOID vocabulary with our descriptors

• We want propose LODeX as assistance tool for LOD portals.

• We are extending LODeX in order to support the automatic

SPARQL query generation

Page 17: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

19

LD4IE 2014 – Riva Del Garda, Italy Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 19

Page 18: Online Index Extraction from Linked Open Data Sources

DB

Gro

up

@ U

NIM

O

20

LD4IE 2014 – Riva Del Garda, ItalyOnline Index Extraction from Linked Open Data Sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

Thanks for your attention!