4v - wp3 progress report (tin2013-46238)

42
4V: Volumen, Velocidad, Variedad y Valide en la gestión innovadora de datos (TIN2013-46238) Progress Report – WP3 Zaragoza, 15 de Junio 2016 Ontology Engineering Group (OEG) Escuela Técnica Superior de Ingenieros Informáticos Universidad Politécnica de Madrid Campus de Montegancedo, Boadilla del Monte, 28660, Spain

Upload: nandana-mihindukulasooriya

Post on 12-Apr-2017

123 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: 4V - WP3 Progress Report (TIN2013-46238)

4V: Volumen, Velocidad, Variedad y Validez en la gestión innovadora de datos

(TIN2013-46238)

Progress Report – WP3Zaragoza, 15 de Junio 2016

Ontology Engineering Group (OEG)Escuela Técnica Superior de Ingenieros Informáticos

Universidad Politécnica de MadridCampus de Montegancedo,

Boadilla del Monte, 28660, Spain

Page 2: 4V - WP3 Progress Report (TIN2013-46238)

2

Outline

• Loupe• On-going work

• Quality Assessment and Repair• Conciseness• Consistency

• Collaborations • A two-fold quality assurance approach for dynamic KBs: The

3cixty use case

Nandana Mihindukulasooriya, OEG

Page 3: 4V - WP3 Progress Report (TIN2013-46238)

3

Loupe - An Online Tool for Inspecting Datasets in the Linked Data CloudDemo @ ISWC2015

Nandana Mihindukulasooriya, OEG

Page 4: 4V - WP3 Progress Report (TIN2013-46238)

4

Loupe - Overview

Nandana Mihindukulasooriya, OEG

Explore the vocabularies used and the abstract triple patterns in 2+ billion triples including all Dbpedia datasets, Wikidata, Linked Brainz, Bio2RDF.

Loupe helps to understand data, uncover patterns, formulate queries, and detect quality issues

Loupe - An Online Tool for Inspecting Datasets in the Linked Data CloudDemo @ ISWC2015.

Page 5: 4V - WP3 Progress Report (TIN2013-46238)

5

Loupe – Google Analytics

Ontology Engineering Group, Universidad Politécnica de Madrid

Page 6: 4V - WP3 Progress Report (TIN2013-46238)

6

Loupe – Google Analytics (II)

• Users from 84 countries• Spain(23.76%), US (16.69%), Germany (10.64%), UK

(9.14%), Italy (4.51%)

Ontology Engineering Group

Page 7: 4V - WP3 Progress Report (TIN2013-46238)

7

Loupe On-going work

Nandana Mihindukulasooriya, OEG

Page 8: 4V - WP3 Progress Report (TIN2013-46238)

8

Loupe – Use Case Analysis

• Dataset Descriptions• Dataset statistics • Dataset profiling

• Dataset exploration• Class/property browsing • Triple pattern browsing

• Dataset discovery and recommendation• keywords, vocabularies• SPARQL queries • RDF shapes

Ontology Engineering Group

• Quality assessment• Consistency• Misused vocabularies

• Guided SPARQL query generation• auto-complete based on

abstract triple patterns• Vocabulary reuse and

recommendation• Recommendation of

vocabularies based on popularity

• Ontology development feedback• Common properties

Page 9: 4V - WP3 Progress Report (TIN2013-46238)

9

Loupe – LOD Laundromat integration

Nandana Mihindukulasooriya, OEG

• Current status of Loupe• 2 billion triples from 32 datasets

• LOD Laundromat• 32 billion triples from 650K documents • cleaned for syntax errors and duplicates• coverage of smaller documents

• Collaboration with VU University Amsterdam• Steps

• Fully automatic dataset download, SPARQL endpoint creation, indexing, clean up

• UI changes to handle large number of datasets• Vocabulary usage datasets

Page 10: 4V - WP3 Progress Report (TIN2013-46238)

10

Loupe Ontology – Vocabulary Usage Statistics of LOD

• Analysis of existing metrics • VoID • DCAT • RDFStats• LODStats • VoID-Ext

• Analysis of use case requirements• Statistics • Profiling • Discovery• Recommendation

Nandana Mihindukulasooriya, OEG

Page 11: 4V - WP3 Progress Report (TIN2013-46238)

11

Loupe Ontology

Nandana Mihindukulasooriya, OEG

Page 12: 4V - WP3 Progress Report (TIN2013-46238)

12

An Analysis of the Quality Issues of the Properties Available in the Spanish DbpediaCAEPIA 2015, Albacete

Nandana Mihindukulasooriya, OEG

Page 13: 4V - WP3 Progress Report (TIN2013-46238)

13

Analyzed Quality Dimensions

Nandana Mihindukulasooriya, OEG

An Analysis of the Quality Issues of the Properties Available in the Spanish Dbpedia CAEPIA2015.

A. Conciseness. A dataset does not contain redundant concepts with different identifiers.

B. Consistency. A dataset does not contain conflicting or contradictory data.

C. Syntactic Validity. Values belong to the legal value range for the represented domain and do not violate the syntactic rules.

D. Semantic Accuracy. Values correctly represent real world facts

Page 14: 4V - WP3 Progress Report (TIN2013-46238)

14Ontology Engineering Group, Universidad Politécnica de Madrid

Conciseness

• Many redundant properties in esDBpedia• 97.93% are auto-generated

• Causes• Capitalization (857): partidosEnPrimera,partidosenprimera• Synonyms: causaDeMuerte, causaDeFallecimiento• Prepositions: causaDeFallecimiento, causaFallecimiento• Spelling (7,495): apeliido, apelldio, apellid• Singular/plural: apellido, apellidos• Gender: administrador, administradora• Accent usage (1,252): administracion, administración• Parsing (107): altitudMin/máx, residencia/trabajo, idioma/s

Page 15: 4V - WP3 Progress Report (TIN2013-46238)

15

Consistency

• Diverse and incorrect domain and range types • esdbpedia:edad has range of type dbo:Place • esdbpedia:lugarmuerte has range of type dbo:Person• esdbpedia:pais has range of type dbo:Actor

• OWL properties with IRI and literal values• 3,380 properties• Use of strings and URL interchangeably

• esdbpedia:lugarDeEntierro• "Madrid"@es• http://es.dbpedia.org/resource/Madrid

Ontology Engineering Group, Universidad Politécnica de Madrid

Page 16: 4V - WP3 Progress Report (TIN2013-46238)

16

Conciseness

Nandana Mihindukulasooriya, OEG

Page 17: 4V - WP3 Progress Report (TIN2013-46238)

17

How to query for the birth place of a person in DBpedia?

Nandana Mihindukulasooriya, OEG

DBpedia (lang)

Syntactically Similar Semantically Similar

English birthplace, birthplace, placeofbirth, birthplace, birthdplace, birthPalce, birthplace, PlaceOfBirth, laceOfBirth, oplaceOfBirth, birthplace, birthplace, birthPalce, birthPlae, birthPace, birthPlaxe, birtPlace, birthPlcace, bithPlace, brithPlace, nbirthPlace, birthplace, birghPlace, birthdplace, biRthPlace, birth, placebirth, placeOfBirth, placOfBirth, birthPlaceOf, birthPlae

cityofbirth, cityofbirthPlace, cityOfBirth, birthLocation

Spanish birthPlace, placeOfBirth, birthPlace, birthplacelugarDeNacimiento, lugarNacimiento, lugarNacimiento, lugarnacimiento, lugardenacimiento, lugarNacimento, lugarNaciento

ciudaddenacimiento, ciudadDenacimiento, paisdenacimiento, paisNacimiento

German geburtsort, birthplace, birthPlace, placeOfBirth placeofbirth

geburtsland, countryofbirth

Page 18: 4V - WP3 Progress Report (TIN2013-46238)

18

Conciseness

• Less-concise datasets• Multiple identifiers with same semantics

• Issues • Harder to understand data and vocabularies used• Harder to write queries • Harder to reuse

• Causes• Less concise mappings

• Diverse distributed mappings created by multiple teams• No policies or guidance of consistent vocabulary usage• No tools for recommending class / properties

• Crowd-sourced ontologies• No or minimum labels / descriptions

Nandana Mihindukulasooriya, OEG

Page 19: 4V - WP3 Progress Report (TIN2013-46238)

19

RDF generation process

Nandana Mihindukulasooriya, OEG

Bulk RDF Transformation(e.g., LOD Refine, DBpedia extraction framework, Ad-hoc programs)

structured dataunstructured

Query RewritingRDF Mappings(e.g., R2RML, Mappings Wiki, D2R mappings, LOD Refine RDF skeletons)

SPARQL Endpoint(e.g., Virtuoso, Fuseki)

RDF Dumps

Linked DataResources(e.g,, Pubby, ELDA)

Triple Store Web Server

SPARQL Clients Linked Data Clients

Data sources

Transformation

Storage

Access

Page 20: 4V - WP3 Progress Report (TIN2013-46238)

20

DBpedia extraction process

Nandana Mihindukulasooriya, OEG

mappings

infobox

RDF Triplestore

Ren

derin

g

Page 21: 4V - WP3 Progress Report (TIN2013-46238)

21

Issues in DBpedia mappings

• 16 DBpeida chapters• Crowd-sourced mappings using mapings wiki

• 5553 template mappings• Mostly using DBpedia ontology

• 739 classes, 3049 properties • In-concise usage of similar properties

• elevation & height, formationYear & foundingYear, team & club, occupation & profession, foundedBy & founder

• Plan for repair• Detection of inconsistent property usage

• Feedback to the ontology team• Feedback and guidance to the mapping teams

• Automatic cleaning of the mappings (in RML)

Nandana Mihindukulasooriya, OEG

Page 22: 4V - WP3 Progress Report (TIN2013-46238)

22

Repairing conciseness issues in mappings

Nandana Mihindukulasooriya, OEG

Bulk RDF Transformation(e.g., LOD Refine, DBpedia extraction framework, Ad-hoc programs)

structured dataunstructured

Query RewritingRDF Mappings(e.g., R2RML, Mappings Wiki, D2R mappings, LOD Refine RDF skeletons)

SPARQL Endpoint(e.g., Virtuoso, Fuseki)

RDF Dumps

Linked DataResources(e.g,, Pubby, ELDA)

Triple Store Web Server

SPARQL Clients Linked Data Clients

Data sources

Transformation

Storage

Access

Page 23: 4V - WP3 Progress Report (TIN2013-46238)

23

Detecting in-concise mapping based on data

dbr:Adobe_Systems dbo:formationYear “1982” ^^xsd:gYear

Ontology Engineering Group

dbr:Adobe_Systems dbo:foundingYear “1982” ^^xsd:gYearDBpedia EN

DBpedia ES

Page 24: 4V - WP3 Progress Report (TIN2013-46238)

Detection of in-concise mappings

24Nandana Mihindukulasooriya, OEG

SC P1 ?o

Graph 1 (e.g., Dbpedia EN) Graph 2 (e.g., Dbpedia ES)

SC P2 ?oM1(C,P1,P2)

M2(C,P1,P2) SC P1 O SC P2 O

M3(C,P1,P2) SC P1 O1 SC P2 O2

M4(G1,C,P1,P2)

M5(G2,C,P1,P2)SC

P1 ?o

P2 ?o

SC

P1 ?o

P2 ?o

C P1 P1 M1 M2/M1

M3/M1

M4/M1

M5/M1

Company foundingYear formationYear 170 0.72 0.24 0 0.05

Person activeYearsEndYear year 150 0.84 0.16 0 0

Person birthPlace deathPlace 2845 0.59 0.43 0.53 0.31

in-concise mappings

1

2

3

4

5

Page 25: 4V - WP3 Progress Report (TIN2013-46238)

25

RDF generation process

Nandana Mihindukulasooriya, OEG

Bulk RDF Transformation(e.g., LOD Refine, DBpedia extraction framework, Ad-hoc programs)

structured dataunstructured

Query RewritingRDF Mappings(e.g., R2RML, Mappings Wiki, D2R mappings, LOD Refine RDF skeletons)

SPARQL Endpoint(e.g., Virtuoso, Fuseki)

RDF Dumps

Linked DataResources(e.g,, Pubby, ELDA)

Triple Store Web Server

SPARQL Clients Linked Data Clients

Data sources

Transformation

Storage

Access

Page 26: 4V - WP3 Progress Report (TIN2013-46238)

26

Property Maps

Property MapGeneration

• Step 1: group properties into clusters according to their domain and range

• Step 2: Multilingual NL preprocessing

• Step 3: aggregate properties by similarity (syntactic and semantic)

Ontology Engineering Group

Page 27: 4V - WP3 Progress Report (TIN2013-46238)

27

Enhance SPARQL queries with property mappings

Ontology Engineering Group

Page 28: 4V - WP3 Progress Report (TIN2013-46238)

28

Consistency

Nandana Mihindukulasooriya, OEG

Page 29: 4V - WP3 Progress Report (TIN2013-46238)

29

Consistency

• Consistent data does not contain conflicting or contradictory data.

Nandana Mihindukulasooriya, OEG

@prefix dbr: <http://dbpedia.org/resource/> .@prefix dbo: <http://dbpedia.org/ontology/> .

dbo:City a owl:Class ; rdfs:subClassOf

[ a owl:Restriction ; owl:onProperty dbo:populationTotal ; owl:maxCardinality "1"^^xsd:nonNegativeInteger ], [ a owl:Restriction ; owl:onProperty dbo:mayor;

owl:maxCardinality "1"^^xsd:nonNegativeInteger ] .

dbo:country a owl:ObjectProperty ; rdfs:domain dbo:City; rdfs:range dbo:Country .

Page 30: 4V - WP3 Progress Report (TIN2013-46238)

30

Consistency (II)

• Consistency issues• Data does not comply with the formal definitions or schema

Nandana Mihindukulasooriya, OEG

@prefix dbr: <http://dbpedia.org/resource/> .@prefix dbo: <http://dbpedia.org/ontology/> .

dbr:Zaragoza a dbo:City; dbo:populationTotal 666058;

dbo:populationTotal 684953; dbo:country dbr:Aragón; dbo:mayor dbr:Juan_Alberto_Belloch; dbo:mayor dbr:Pedro_Santisteve_Roche .

dbr:Aragón a dbo:AutonomousCommunity .

12

3

Page 31: 4V - WP3 Progress Report (TIN2013-46238)

31

populationTotal - Cardinality Violation

Nandana Mihindukulasooriya, OEG

1

Page 32: 4V - WP3 Progress Report (TIN2013-46238)

32

Consistency – (Incorrect) inferences

Nandana Mihindukulasooriya, OEG

dbr:Juan_Alberto_Belloch owl:sameAs dbr:Pedro_Santisteve_Roche .

dbr:Aragón a dbo:Country .

• Open World Assumption and Non-Unique Name Assumption• Works better for inferencing than validation

2

3

Page 33: 4V - WP3 Progress Report (TIN2013-46238)

33

Consistency – Rich Semantics

• Checking consistency with OWL.

Nandana Mihindukulasooriya, OEG

@prefix dbr: <http://dbpedia.org/resource/> .@prefix dbo: <http://dbpedia.org/ontology/> .@prefix dbo: <http://www.w3.org/2002/07/owl#>.

dbo:City a owl:Class ; rdfs:subClassOf [ a owl:Restriction ; owl:onProperty dbo:populationTotal ; owl:maxCardinality "1"^^xsd:nonNegativeInteger ], [ a owl:Restriction ; owl:onProperty dbo:mayor;

owl:maxCardinality "1"^^xsd:nonNegativeInteger ] .dbo:country a owl:ObjectProperty; rdfs:domain dbo:Place; rdfs:range dbo:Country .

dbo:AutonomousCommunity owl:disjointWith dbo:Country .

dbr:Juan_Alberto_Belloch owl:differentFrom dbr:Pedro_Santisteve_Roche .

2

3

Page 34: 4V - WP3 Progress Report (TIN2013-46238)

34

Consistency – SHACAL constraints

• Checking consistency with W3C SHACL.

Nandana Mihindukulasooriya, OEG

@prefix sh: <http://www.w3.org/ns/shacl#>@prefix dbo: <http://dbpedia.org/ontology/> .

_:cityShape a sh:Shape; sh:scopeClass dbo:City; sh:property [ sh:predicate dbo:mayor; sh:maxCount 1; sh:nodeKind sh:IRI; sh:classIn (dbo:Person schema:Person foaf:Person) ] ; sh:property [ sh:predicate dbo:country; sh:maxCount 1; sh:minCount 1; sh:nodeKind sh:IRI; sh:classIn (dbo:Country); sh:stem “http://dbpedia.org/” ] .

Page 35: 4V - WP3 Progress Report (TIN2013-46238)

35

Data validation with semi-automatically generated RDF Shapes

Nandana Mihindukulasooriya, OEG

PatternExtraction

Domain ExpertReview

RDF ShapeGeneration

DataValidation

Data Repair

SHACL Shapes

Page 36: 4V - WP3 Progress Report (TIN2013-46238)

36

Cardinality constraints example

Nandana Mihindukulasooriya, OEG

schema:Place Min Max P1 P99 Mean 0 1 2 3 4 5rdf:type 1 2 1 1 1.0002 0 99.9793 0.0207 0 0 0rdfs:label 1 6 1 6 4.2508 0 4.4048 36.6743 1.7445 0.4831 0rdfs:seeAlso 0 4 1 2 1.5717 0.0340 42.7702 57.1905 0.0041 0.0011 0owl:sameAs 0 6 0 0 0.0058 99.4455 0.5339 0.0146 0.0041 0.0015 0schema.org:review 0 2 0 2 0.0329 98.3175 0.0717 1.6108 0 0 0schema.org:url 0 40 0 10 0.5085 89.8340 1.8947 3.7013 0.3008 1.2155 0.3434events:poster 0 23 0 1 0.0155 98.9609 0.5900 0.4237 0.0097 0.0120 0.0007dc:publisher 0 2 0 2 1.0677 39.1777 14.8776 45.9447 0 0 0events:businessType 0 4 0 2 1.5273 4.1889 38.9255 56.8673 0.0041 0.0142 0schema:description 0 28 1 12 3.0573 0.0886 30.5193 32.8359 1.9605 19.1139 0.1226geo:location 0 24 0 4 0.2040 92.7525 0.6819 3.2436 0.2634 2.9831 0.0060

Property cardinalities of schema:Place class (extracted from data)

Pat. Min Max Description A 0 N No restrictions B 0 1 Maximum 1 C 1 N Minimum 1D 1 1 Exactly 1

Common cardinalities

CardinalityClassifier

schema:Place Classrdf:type D (Exactly 1)rdfs:label C (Minimum 1)rdfs:seeAlso C (Minimum 1)owl:sameAs A (No restrictions)schema.org:review A (No restrictions) Expert Review

schema:Place Classrdf:type C (Minimum 1)rdfs:label C (Minimum 1)rdfs:seeAlso C (Minimum 1)owl:sameAs A (No restrictions)schema.org:review A (No restrictions)

_:placeShape a sh:Shape; sh:scopeClass schema:Place; sh:property [ sh:predicate rdf:type; sh:minCount 1 ] ; sh:property [ sh:predicate rdfs:label; sh:minCount 1 ] ; sh:property [ sh:predicate rdfs:seeAlso; sh:minCount 1 ] ;

Approved PatternsExtracted Patterns

Restrictions in SHACL

Page 37: 4V - WP3 Progress Report (TIN2013-46238)

37

W3C SHACL restrictions

• Value type constraints • sh:class, sh:classIn, sh:datatype, sh:datatypeIn,

sh:nodeKind• Cardinality constraints

• sh:minCount, sh:maxCount• Value range constraints

• sh:minInclusive, sh:minExclusive, sh:maxInclusive, sh:maxExclusive

• String based constraints• sh:minLength, sh:maxLength, sh:pattern, sh:stem,

sh:uniqueLang• Property pair constraints

• sh:equals, sh:disjoint, sh:lessThan, sh:lessThanOrEquals

Ontology Engineering Group

Page 38: 4V - WP3 Progress Report (TIN2013-46238)

38

A Two-Fold Quality Assurance Approach for Dynamic Knowledge Bases: The 3cixty Use Case

Nandana Mihindukulasooriya, OEG

Page 39: 4V - WP3 Progress Report (TIN2013-46238)

39

Continuous Integration is essential

Ontology Engineering Group, Universidad Politécnica de Madrid

Page 40: 4V - WP3 Progress Report (TIN2013-46238)

40

Exploratory testing with Loupe

Ontology Engineering Group, Universidad Politécnica de Madrid

Page 41: 4V - WP3 Progress Report (TIN2013-46238)

Automated testing with SPARQL Interceptor

41Ontology Engineering Group, Universidad Politécnica de Madrid

• a set of user-defined SPARQL queries (as unit tests)• Knowledge-based specific

TestSPARQLQueries

SystemRequirements

Schema Constraints

Conventions and other

restrictions

Inputs from Exploratory

Testing

Page 42: 4V - WP3 Progress Report (TIN2013-46238)

42

SPARQL Interceptor

Ontology Engineering Group, Universidad Politécnica de Madrid

Designed and implemented by Localidata.