distributed query processing for federated rdf data management

32
Distributed Query Processing for Federated RDF Data Management Olaf Görlitz 07.11.2014

Upload: olafgoerlitz

Post on 17-Feb-2017

337 views

Category:

Internet


3 download

TRANSCRIPT

Page 1: Distributed Query Processing for Federated RDF Data Management

Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz

07.11.2014

Page 2: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 2

The Linked Open Data Cloud

Use as one large database!

Page 3: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 3

Life Science Scenario

Find drugs for nutritional supplementation

SELECT ?drug ?id ?title WHERE {  ?drug drugbank:drugCategory category:micronutrient .  ?drug drugbank:casRegistryNumber ?id .  ?keggDrug rdf:type kegg:Drug .  ?keggDrug bio2rdf:xRef ?id .  ?keggDrug purl:title ?title .}

Page 4: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 4

Linked Data Querying Paradigms

Data Warehouse

Link Traversal

Federation

Page 5: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 5

Linked Data Querying Paradigms

Requirements Data Warehouse Link Traversal Federation

Query Expressiveness

Schema Mapping

Data Freshness

Result Completeness

Scalability

Flexibility

Availability

Performance

Page 6: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 6

Contributions

Large ScaleInformation Retrieval

RDF Federation &Query Optimization

Benchmarking RDFFederation Systems

PINTSPeer-to-Peer Statistics

Management

SPLENDIDDistributed SPARQLQuery Processing

SPLODGELinked Data Query

Generation

Görlitz, Staab: SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions. COLD'11

Görlitz, Thimm, Staab: SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data. ISWC'12

Görlitz, Sizov, Staab: PINTS: Peer-to-Peer Infrastructure for TaggingSystems. IPTPS'08

Page 7: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 7

SPLENDID Federation

Federated Databases Federated RDF● Relational Schema ● Implicit Schema, Ontologies● Specific Data Wrappers ● SPARQL endpoints● Rich Data Statistics ● Limited Statistics (voiD)

Execute complex SPARQL queriesover federated RDF data sources

Page 8: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 8

SPLENDID Federation

SPARQL Query

SourceSelection

QueryOptimization

QueryExecution

SELECT ?drug ?id ?title WHERE {  ?drug drugbank:drugCategory category:micronutrient .  ?drug drugbank:casRegistryNumber ?id .  ?keggDrug bio2rdf:xRef ?id .  ?keggDrug rdf:type kegg:Drug .  ?keggDrug purl:title ?title .}

⋈?drug⋈? id

⋈?keggDrug⋈?keggDrug

? drugdrugbank :drugCategory category :micronutrient

? drugdrugbank :casRegistryNumber ? id

? keggDrug rdf : type kegg :Drug

? keggDrugbio 2 rdf : xRef ? id

? keggDrugpurl : title? title

Page 9: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 9

Source Selection Objectives

SPARQLQuery

SourceSelection

QueryOptimization

QueryExecution

Determine all relevant data sources

DARQ FedX SPLENDID● Explicit 'capabilities'● Query restrictions

(bound predicates)

● ASK queries + cachingmany (initial) requests

● Sub query aggregation

● VoiD descriptions+ ASK queries

● Sub query aggregation

Page 10: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 10

voiD voiD voiDvoiD

Source Selection Example

SELECT ?drug ?title WHERE {  ?drug drugbank:drugCategory category:micronutrient .  ?drug drugbank:casRegistryNumber ?id .  ?keggDrug rdf:type kegg:Drug .  ?keggDrug bio2rdf:xRef ?id .  ?keggDrug purl:title ?title .}

→ KEGG, DBpedia, ChEBI→ KEGG

→ DrugBank

SPARQLASK

→ DrugBank, ChEBI

→ KEGG

Page 11: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 11

Source Selection Result

⋈?drug

⋈? id

⋈?keggDrug

⋈?keggDrug

? drugdrugbank :drugCategory category :micronutrient

? drugdrugbank :casRegistryNumber ? id

? keggDrug rdf : type kegg :Drug

? keggDrugbio 2 rdf : xRef ? id

? keggDrugpurl : title? title

Page 12: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 12

Query Optimization

SPARQLQuery

SourceSelection

QueryOptimization

QueryExecution

Find best (fastest) query execution plan

DARQ FedX SPLENDID● Dynamic Programming● Custom Statistics● Only bound predicates● Bind Join

● Join Order Heuristics● No Statistics● Join Chains● Bind Join

● Dynamic Programming● Extended voiD statistics● Bind + Hash Join

Page 13: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 13

Dynamic Programming

● iterate over all possible execution plans● compare cost (execution time)

BindJoin,HashJoin

⋈?drug

⋈? id

⋈?keggDrug

⋈?keggDrug

? drugdrugbank :drugCategory category :micronutrient

? drugdrugbank :casRegistryNumber ? id

? keggDrug rdf : type kegg :Drug

? keggDrugbio 2 rdf : xRef ? id

? keggDrugpurl : title? title

Cost Modelcost send−query

cost receive−tuple

card (R (qi))

Page 14: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 14

Cardinality Estimation

⋈?drug

⋈? id

⋈?keggDrug

⋈?keggDrug

? drugdrugbank :drugCategory category :micronutrient

? drugdrugbank :casRegistryNumber ? id

? keggDrug rdf : type kegg :Drug

? keggDrugbio 2 rdf : xRef ? id

? keggDrugpurl : title? title

Page 15: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 15

Cardinality Estimation (Triple Pattern)

cardd (s , p ,o) = |d|⋅seld(s)⋅seld (p)⋅seld(o), d∈D

Assuming independence of s, p ,o

cardd (? ,p ,? )

cardd (s ,? ,? )

cardd (? ,? ,o)

cardd (s ,? ,o)

cardd (s ,p ,? )

cardd (? ,p ,o)

cardd (? , ? , ?) cardd (s,p,o)= voiDd→|d| = 1

= voiDd→p

=voiDd→|d|

voiDd→|s|

=voiDd→|d|

voiDd→|o|

= 1

=voiDd→p

voiDd→|sp|

=voiDd→p

voiDd→|op|

cardd (? , rdf : type ,T ) = voiDd→T

Page 16: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 16

Cardinality Estimation (Basic Graph Pattern)

Star Pattern Path Pattern

kegg:Drug

?keggDrugrn:R01786

?title

rdf:Type

purl:title

bio2rdf:xRef

drugbank:Drug

?keggDrug

rdf:Type

owl:sameAs

?drug kegg:Drug

rdf:Type

cardd*(P1⋈P2⋈P3) =

min(cardd (P1) , cardd (P2))

⋅voiDd→p3

voiDd→|sp3|

cardd ,d '~

(P1⋈P2) =

cardd (P1)⋅cardd '(P2)

⋅seld ,d '(P1⋈P2)

Page 17: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 17

Query Optimization

SPARQLQuery

SourceSelection

QueryOptimization

QueryExecution

⋈?drug

⋈B(? id )

⋈?keggDrug

⋈H(? keggDrug)

? drugdrugbank :drugCategory category :micronutrient

? drugdrugbank :casRegistryNumber ? id

? keggDrug rdf : type kegg :Drug

? keggDrugbio 2 rdf : xRef ? id

? keggDrugpurl : title? title

Page 18: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 18

Evaluation Methodology

Compare with state-of-the-art federation systems

– Use Multiple linked datasets

– With representative characteristics

– Execute 'typical' SPARQL queries

– In a reproducible benchmark setup

FedBench

Page 19: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 19

Evaluation Results

Page 20: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 20

Conclusion

● Federation for Linked Open Data– Database + Semantic Web technology

– Efficient Distributed Query Processing

– Extension of voiD statistics

● Query generation for Federation Benchmarks● Efficient statistics management in P2P networks

Page 21: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 21

Thank You

Page 22: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 22

VoiD Descriptions/Statistics

}}

}

} General Information

Basic statisticstriples = 732744

Type statisticschebi:Compound = 50477

Predicate statisticsbio:formula = 39555

Page 23: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 23

VoiD statistics extension

Page 24: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 24

State of the Art

DARQ AliBaba FedX SPLENDID

Statistics ServiceDesc – – VoiD

Source Selection

Statistics(predicates)

All sources ASK queries Statistics + ASK queries

Query Optimization

DynProg Heuristics Heuristics DynProg

Query Execution

Bind join Bind join Bound Join + parallelization

Bind Join + Hash Join

Page 25: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 25

SPARQL limitations

● Query protocol● Only SPARQL endpoints● Endpoint limitations

– SPARQL version

– Result size

– Data rate

– Availability

Page 26: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 26

Join Implementation

R1 R2 R1 R2

⋈B ⋈H

Bind Join Hash Join

?id ?y

1 42

2 13

3 20

4 50

5 3

?id ?x

1 'A'

1 'G'

4 'A'

7 'A'

7 'C'

Page 27: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 27

Join Cost Model

R (q1) R (q2 ' ) R (q1) R (q2)

⋈B ⋈H

Bind Join Hash Join

cost⋈B(q1, q2) = |R (q1)|⋅cost tuple +

|R (q1)|⋅costquery +

|R (q2 ' )|⋅cost tuple

cost⋈H(q1,q2) = |R (q1)|⋅cost tuple +

|R (q2)|⋅cost tuple +

2⋅costquery

Page 28: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 28

SPARQL Semi Join

Page 29: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 29

SPLENDID Architecture

Page 30: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 30

FedBench Datasets

● Cross Domain

● Life Science

● Linked Data

Page 31: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 31

Data Source Selection: Requests

Page 32: Distributed Query Processing for Federated RDF Data Management

Olaf Görlitz: Distributed Query Processing for Federated RDF Data Management

07.11.2014Slide 32

Conclusion

Linked Open Data voiD

Web-scale Query Processing

SPLENDID