graph databases & data integration - the case of rdf

47
Graph databases & data integration The case of RDF By Dimitris Kontokostas AKSW/KILT - Leipzig DBpedia Association Thessaloniki Java Meetup / 09.05.2016

Upload: dimitris-kontokostas

Post on 15-Jan-2017

764 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Graph databases & data integration - the case of RDF

Graph databases & data integration

The case of RDF

By Dimitris KontokostasAKSW/KILT - LeipzigDBpedia Association

Thessaloniki Java Meetup / 09.05.2016

Page 2: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

About me

● I live in Veria● I am an ex-ICT teacher● Since 2003 I was working on mainly on R&D projects

○ + some web development

● Since 2012 doing a PhD & working in AKSW group in Leipzig○ Focusing on semantic web technologies (RDF, SPARQL, and many other scary terms)○ aka Knowledge Engineer

● I am on open source enthusiast (DBpedia, RDFUnit)● Recently became a W3c specification editor for SHACL● Walked across many langs but ended up in Scala, Java, & Bash

○ With bash / CLI as a first choice;)

Page 3: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Before we start… who knows?

LOD Cloud

Linked Data

Page 4: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Agenda*

● Graphs● RDF Graphs● Data integration● Who uses RDF● Quick overview of:

○ DBpedia○ SPARQL○ RelFinder○ Schema.org & actions○ JSON-LD○ Entity disambiguation○ Data Quality

(*) focusing mostly on getting familiar to basic terms and concepts(**) Apologies in advance for mixing greek with English

Page 5: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Page 6: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

The four V’s heatmap for Graph Databases

Study in 2013 found:● many organizations

find the “variety” dimension a greater challenge than volume or velocity.

Graph DBs to the rescue:● Combine multiple

sources with different structures

● while retaining the flexibility to add new ones without adapting schematas

● query combined data, or multiple sources at once

● detecting patterns in the data

(*) See also this

Page 7: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016© Image by Max De Margi

Page 8: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

● A graph is a way of specifying relationships among a collection of items● Items

○ Nodes - Alice, Bob, …○ Edges

■ undirected - knows, …■ directed - follows, …

○ Values -- weights, distances, scores, 0-5 scale, …○ Attributes - name, time, ...

Graphs

Page 9: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Graph Data Models

Property graphs

● Industry standards○ Neo4j, Titan, Apache TinkerPop, ...○ App specific way for querying, exporting, importing, etc○ Optimized for specific operation and in many cases faster

RDF Graphs

● W3c standards○ Like XML / HTML, define once run everywhere TM

○ Standardised way for querying, exporting, importing

Page 10: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Property Graphs

● Each node has a○ unique identifier.○ set of outgoing edges.○ set of incoming edges.○ collection of key-value properties.

● Each edge○ Is directed○ has a unique identifier.

○ has a label that denotes

the type of relationship between its source and

○ target nodes.○ has a collection of key-value

Page 11: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

RDF - Resource Description Framework

● An RDF Graph is a set of RDF Triples● An RDF triple consists of (only) three components:

○ the subject (is an IRI)○ the predicate (is an IRI)○ the object (can be an IRI or Literal)○ (subjects and objects can also be blank nodes but let’s leave it for now)

http://dbpedia.org/resource/Java

dbo:latestReleaseVersion“1.8.0_60”

http://dbpedia.org/resource/C++

dbo:influencedBy

http://dbpedia.org/resource/C#

dbo:influencedBy

Subject Predicate Object

Page 12: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

RDF is an abstract data model

Turtle@prefix dbo: <http://dbpedia.org/ontology/> .@prefix ex: <http://example.com/> .ex:Dimitris a dbo:Person .

NTriples<http://example.com/Dimitris> a <http://dbpedia.org/ontology/Person> .

JSON-LD{ "@id": "http://example.com/Dimitris", "@type": "http://dbpedia.org/ontology/Person" }

XML <rdf:Description rdf:about="http://example.com/Dimitris"> <rdf:type rdf:resource="http://dbpedia.org/ontology/Person"/> </rdf:Description>

RDFa (embedded in html)<div xmlns="http://www.w3.org/1999/xhtml" prefix=" rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# dbo: http://dbpedia.org/ontology/ rdfs: http://www.w3.org/2000/01/rdf-schema#"> <div typeof="dbo:Person" about="http://example.com/Dimitris"> </div></div>

Page 13: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

RDF & Graphs (Separate)

File1.ttl@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix ex: <http://example.com/> .ex:Dimitris foaf:knows ex:Petros .

File2.ttl@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix ex: <http://example.com/> .ex:Dimitris a foaf:Person .ex:Petros a foaf:Person .

File3.ttl@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix ex: <http://example.com/> .ex:Dimitris foaf:interest dbpedia:RDF .ex:Petros foaf:interest dbpedia:Cassandra .

Page 14: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

RDF & Graphs (merge)

File_all.ttl@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix ex: <http://example.com/> .ex:Dimitris foaf:knows ex:Petros .

ex:Dimitris a foaf:Person .ex:Petros a foaf:Person .

@prefix dbpedia: <http://dbpedia.org/resource/> .

ex:Dimitris foaf:interest dbpedia:RDF .ex:Petros foaf:interest dbpedia:Apache_Cassandra .

Page 15: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

RDF & Graphs (dataset / multi-graph) .n3 files

<http://example.com/relations-graph> {@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix ex: <http://example.com/> .ex:Dimitris foaf:knows ex:Petros .

}

<http://example.com/types-graph> {@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix ex: <http://example.com/> .ex:Dimitris a foaf:Person .ex:Petros a foaf:Person .

}

<http://example.com/interests-graph> {@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix ex: <http://example.com/> .ex:Dimitris foaf:interest dbpedia:RDF .ex:Petros foaf:interest dbpedia:Cassandra .

}

Page 16: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

RDF & Linked Data

● Using HTTP(s) based IRIs we get the Web of Data○ See TED talk from Tim Berners Lee (Creator of WWW)

● Every RDF Resource becomes like a REST GET API that returns all the RDF triples it is associated with

○ content negotiation for RDF (machine) or HTML (human)○ Follow-your-nose pattern

http://dbpedia.org/resource/Java

dbo:latestReleaseVersion “1.8.0_60”

http://dbpedia.org/resource/C++

dbo:influencedBy

http://dbpedia.org/resource/C#

dbo:influencedBy

http://aksw.org/DimitrisKontok

ostas ex:learns

http://www.geonames.org/733905/

dbo:birthPlace 40.52437

22.20242

geo:lat

geo:long

Page 17: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

LOD CLOUD

>1K Datasets>50B Triples>100M links

Page 18: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Vocabularies & Semantics

● Vocabularies/Ontologies define classes and predicates (properties) in RDF

○ ex:Dimitris a dbo:Person○ ex:Dimitris dbo:birthDate “1981-06-06”^^xsd:date

● Existing Vocabularies capture many use case○ DBpedia ontology (general purpose)○ Schema.org (general purpose / new backed by Google, Yahoo, Bing & Yandex)○ Foaf (Friend of a friend)○ Geo (geographical)○ Prov-o (data provenance)○ SKOS (classifications)○ Org (organization structure)○ … http://lov.okfn.org has more than 400

Page 19: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Vocabularies & Semantics

● classes and predicates (properties) have definitions (semantics)● ex:Dimitris a dbo:Person

○ dbo:Person Belongs in a class hierarchy● ex:Dimitris dbo:birthDate “1981-06-06”^^xsd:date

○ dbo:birthDate expects a dbo:Person as subject○ dbo:birthDate expects an xsd:date as object

● Reusing existing vocabularies (classes & properties) with defined semantics is a good practice

○ Get part of the data modeling for free○ Using common terms can help integrate data easier○ Validation (or inference) for free

■ ex:Thessaloniki dbo:birthDate “1981-06-06”^^xsd:date (is Thessaloniki a Person?)■ ex:Dimitris dbo:birthDate ex:Thessaloniki (ex:Thessaloniki is not an xsd:date)

Page 20: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Data integration with RDF

● Very simple graph data model● Convert your data to RDF and model against common vocabularies

○ Design applications against vocabularies○ Integrate multiple different sources

● Local identifiers are a common integration problem● Link to data authorities

○ ex:Dimitris dbo:birthPlace ex:Veria geonames:733905○ (or) ex:Veria owl:sameAs geonames:733905

Page 21: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Pay as you go Data Integration

● RDF views on top of RDBMS (e.g. MySQL) R2RML (W3c spec)○ Mapping files defines how SQL queries / tables translate to RDF○ Queryable through a virtual SPARQL endpoint translating SPARQL to SQL

● Convert XML/JSON/CSV/… to RDF with RML.io using mapping files● Find links to external databases with Limes & Silk

○ e.g.: ex:Veria owl:sameAs geonames:733905

● You can get some benefit with low effort● The more time you invest the better the results● (Common practice) work on secondary RDF views of your data

Page 23: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Some More Statistics

● Based on the common crawl of Nov 2015● 30% of HTML pages (541M / 1.77B pages) contained structured data.● This 30% originates from 2.72M different pay-level-domains out of the

14.41 million pay-level-domains covered by the crawl (19%). ○ 521K websites use RDFa○ 1.1 million Microdata○ 586K have embedded json-ld (mostly for search actions)

● Altogether, the extracted data sets consist of 24.38 billion RDF quads.

http://webdatacommons.org/structureddata/2015-11/stats/stats.html#results-2015-1

Page 25: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

SPARQL

„Which films starred John Cleese without any other members of Monty Python?“

SPARQL Examples by Markus Ackermann &Markus Freudenberg

Page 26: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Page 27: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Basic Graph Pattern

Page 28: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Page 29: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Graph Group Pattern

Page 30: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Page 31: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Filtering Unwanted Results

Page 32: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Page 33: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

RelFinder demo (flash)

http://www.visualdataweb.org/relfinder/demo.swf?obj1=Sm9obiBDbGVlc2V8aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL0pvaG5fQ2xlZXNl&obj2=VGVycnkgR2lsbGlhbXxodHRwOi8vZGJwZWRpYS5vcmcvcmVzb3VyY2UvVGVycnlfR2lsbGlhbQ==&obj3=TW9udHkgUHl0aG9ufGh0dHA6Ly9kYnBlZGlhLm9yZy9yZXNvdXJjZS9Nb250eV9QeXRob24=&name=REJwZWRpYSAobWlycm9yKSAoZnJvbSBVUkwgcGFyYW1ldGVycyk=&abbreviation=ZGJwMTQ2MjczNzExMTgxMQ==&description=TGlua2VkIERhdGEgdmVyc2lvbiBvZiBXaWtpcGVkaWEu&endpointURI=aHR0cDovL2RicGVkaWEub3JnL3NwYXJxbA==&dontAppendSPARQL=ZmFsc2U=&defaultGraphURI=aHR0cDovL2RicGVkaWEub3Jn&isVirtuoso=dHJ1ZQ==&useProxy=dHJ1ZQ==&method=UE9TVA==&autocompleteLanguage=ZW4=&autocompleteURIs=aHR0cDovL3d3dy53My5vcmcvMjAwMC8wMS9yZGYtc2NoZW1hI2xhYmVs&ignoredProperties=aHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L3dpa2lQYWdlV2lraUxpbmssaHR0cDovL2RicGVkaWEub3JnL3Byb3BlcnR5L3dpa2lQYWdlVXNlc1RlbXBsYXRlLGh0dHA6Ly9kYnBlZGlhLm9yZy9wcm9wZXJ0eS93aWtpbGluayxodHRwOi8vZGJwZWRpYS5vcmcvcHJvcGVydHkvd29yZG5ldF90eXBlLGh0dHA6Ly9wdXJsLm9yZy9kYy90ZXJtcy9zdWJqZWN0LGh0dHA6Ly93d3cudzMub3JnLzE5OTkvMDIvMjItcmRmLXN5bnRheC1ucyN0eXBlLGh0dHA6Ly93d3cudzMub3JnLzIwMDIvMDcvb3dsI3NhbWVBcyxodHRwOi8vd3d3LnczLm9yZy8yMDA0LzAyL3Nrb3MvY29yZSNzdWJqZWN0&abstractURIs=aHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L2Fic3RyYWN0&imageURIs=aHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L3RodW1ibmFpbCxodHRwOi8veG1sbnMuY29tL2ZvYWYvMC4xL2RlcGljdGlvbg==&linkURIs=aHR0cDovL3B1cmwub3JnL29udG9sb2d5L21vL3dpa2lwZWRpYSxodHRwOi8veG1sbnMuY29tL2ZvYWYvMC4xL2hvbWVwYWdlLGh0dHA6Ly94bWxucy5jb20vZm9hZi8wLjEvcGFnZQ==&maxRelationLegth=Mg==
http://www.visualdataweb.org/relfinder/demo.swf?obj1=Sm9obiBDbGVlc2V8aHR0cDovL2RicGVkaWEub3JnL3Jlc291cmNlL0pvaG5fQ2xlZXNl&obj2=VGVycnkgR2lsbGlhbXxodHRwOi8vZGJwZWRpYS5vcmcvcmVzb3VyY2UvVGVycnlfR2lsbGlhbQ==&obj3=TW9udHkgUHl0aG9ufGh0dHA6Ly9kYnBlZGlhLm9yZy9yZXNvdXJjZS9Nb250eV9QeXRob24=&name=REJwZWRpYSAobWlycm9yKSAoZnJvbSBVUkwgcGFyYW1ldGVycyk=&abbreviation=ZGJwMTQ2MjczNzExMTgxMQ==&description=TGlua2VkIERhdGEgdmVyc2lvbiBvZiBXaWtpcGVkaWEu&endpointURI=aHR0cDovL2RicGVkaWEub3JnL3NwYXJxbA==&dontAppendSPARQL=ZmFsc2U=&defaultGraphURI=aHR0cDovL2RicGVkaWEub3Jn&isVirtuoso=dHJ1ZQ==&useProxy=dHJ1ZQ==&method=UE9TVA==&autocompleteLanguage=ZW4=&autocompleteURIs=aHR0cDovL3d3dy53My5vcmcvMjAwMC8wMS9yZGYtc2NoZW1hI2xhYmVs&ignoredProperties=aHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L3dpa2lQYWdlV2lraUxpbmssaHR0cDovL2RicGVkaWEub3JnL3Byb3BlcnR5L3dpa2lQYWdlVXNlc1RlbXBsYXRlLGh0dHA6Ly9kYnBlZGlhLm9yZy9wcm9wZXJ0eS93aWtpbGluayxodHRwOi8vZGJwZWRpYS5vcmcvcHJvcGVydHkvd29yZG5ldF90eXBlLGh0dHA6Ly9wdXJsLm9yZy9kYy90ZXJtcy9zdWJqZWN0LGh0dHA6Ly93d3cudzMub3JnLzE5OTkvMDIvMjItcmRmLXN5bnRheC1ucyN0eXBlLGh0dHA6Ly93d3cudzMub3JnLzIwMDIvMDcvb3dsI3NhbWVBcyxodHRwOi8vd3d3LnczLm9yZy8yMDA0LzAyL3Nrb3MvY29yZSNzdWJqZWN0&abstractURIs=aHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L2Fic3RyYWN0&imageURIs=aHR0cDovL2RicGVkaWEub3JnL29udG9sb2d5L3RodW1ibmFpbCxodHRwOi8veG1sbnMuY29tL2ZvYWYvMC4xL2RlcGljdGlvbg==&linkURIs=aHR0cDovL3B1cmwub3JnL29udG9sb2d5L21vL3dpa2lwZWRpYSxodHRwOi8veG1sbnMuY29tL2ZvYWYvMC4xL2hvbWVwYWdlLGh0dHA6Ly94bWxucy5jb20vZm9hZi8wLjEvcGFnZQ==&maxRelationLegth=Mg==
Page 34: Graph databases & data integration - the case of RDF

Schema.org

● Vocabulary backed by all Search engines

● RDF data model○ Normative format is JSON-LD○ RDF in not actively mentioned (to

not scare people away)○ Allows use as general structured

data (e.g. microdata)● Enriches a lot of (at least) Google’s

application○ Search (try e.g. recipes)○ Gmail (travel, events, actions,...)○ Google Now○ Google Knowledge Graph○ ...

Page 35: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Schema.org actions

Page 36: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

JSON-LD

● Like normal JSON but better ;)

Page 37: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

JSON-LD

● Like normal JSON but better ;)● @context makes the difference● Append your own context

Page 38: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

JSON-LD

Page 39: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

JSON-LD

Page 40: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

JSON-LD

Page 41: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

JSON-LD links

● Previous examples

● JSON-LD specification & playground

● Hypermedia self-described APIs with Hydra

Page 42: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Entity disambiguation

aka NERD (Named Entity Resolution & Disambiguation)

● George Bush is sitting in front of the White House ○ George: some George?○ Bush: a small plant○ George Bush: former president of USA○ White: Colour○ House: a house○ White House:

● http://dbpedia-spotlight.github.io/demo/

Page 43: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Data Quality

● As mentioned earlier, we can (re) use the vocabulary semantics for automatic data validation

● RDFUnit - https://github.com/AKSW/RDFUnit ○ Automatically generates data unit tests based on the vocabularies your data uses○ Custom JUnit runner

● SHACL - http://w3c.github.io/data-shapes/shacl/ ○ Language to define advanced data constraints on RDF Graphs○ (In progress) W3c recommendation

Page 44: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

ALIGNED project

● Aligning software & data engineering● Tools & techniques for agility in changes in code / data● http://aligned-project.eu ● Options a free consultancy in aligned tools

○ See website for more info

Page 45: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Wrapping up / Key points

● Data variety is a common problem● Integrating Data can be a pain :)● Graph Databases can help, RDF can sometimes be more appropriate● Pay as you go data integration

○ Map your data to RDF○ Keep RDF as a copy of your source data

● RDF helps you develop reusable applications against schemas● Schema.org

○ For website markups○ For defining actions

● JSON-LD (embedded mappings)● RDF for text annotations

● There is very good tool support for RDF in Java

Page 46: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Links

● http://json-ld.org/● http://wiki.dbpedia.org ● http://dbpedia-spotlight.github.io/demo/● http://schema.org ● http://aksw.org - Many interesting tools● http://wikidata.org● Apache Jena - RDF Java library ● Virtuoso - Open Source RDF & RDBMS DB

Page 47: Graph databases & data integration - the case of RDF

Thessaloniki Java meetup - 09.05.2016

Thank you!

Questions?

Slides available at slideshare.net/jimkont