federated query processing universidad simón bolívar...

Dagstuhl Seminar “Federated Semantic Data Management”, 26-30 June 2017

Federated Query Processing over RDF Graphs

Maria-Esther VidalFraunhofer IAIS, Germany

Universidad Simón Bolívar, Venezuela

1

Motivating Example

Vessels Vessel Observations

Ports

Vessels that have arrived at port of Funchal on 2017-06-21

2

Motivating ExampleDistributed Database

Vessels

Vessel Observations

Ports

mmsi country type

v1 ES tanker

v2 PT tanker

v3 US shipmmsi lat long date

v1 -26.3 56.9 2017-06-21

v1 30.8 70.8 2017-06-23

v2 -26.3 56.9 2017-06-23

v3 -26.3 56.9 2017-06-21

port lat long name

p1 -26.3 56.9 Funchal

p2 30.8 70.8 Lisbon

p3 40.9 89.8 Vigo

Interrelated databases distributed over a computer networkComponents are designed and tuned to work together.

3

Dimensions of Distributed Database Systems

Vessels

Vessel Observations

Ports

Greece

Canada World-Port Index USA

Distribution: Physical distribution of the data over multiple sites.

4


Vessels

Vessel Observations

Ports

Maritime Traffic

Evergreen Marine World-Port Index

Autonomy: Distribution of the control; degree to which individual systems can operate independently.

5


Vessels

Vessel Observations

Ports

Maritime Traffic


CSV

JSON

RDF Heterogeneity: Different forms ranging from hardware, differences in network protocols, data models, query languages.

6

Dimensions of Distributed Database Systems [1]Distribution

AutonomyHeterogeneity

7[1] Tamer Ozsu and Patrick Valduriez. Principles of Distributed Database Systems (Third Edition). Springer, 2011.

Agenda1) Distributed Data Management Systems

a) Client-Server Systemsb) Peer-to-Peer Systemsc) Federated Data Management Systems

2) Challenges in Federated Data Management3) Distributed SPARQL Systems 4) Conclusions

8

Distributed Data Management Systems

9

Client-Server SystemsDistribution


Client-Server Systems: ● Clients run user

applications and interfaces.

● Servers run data management tasks, e.g., query processing and storage.

10

Client-Server Systems

Server Engine

Vessels

Vessel Observations

Ports

Client

Queries Query Answers

11

Peer-to-Peer SystemsDistribution


Peer-to-Peer Systems: ● Massive distribution● May used different data

models.● Each system manages a

different dataset. ● Peers can

communicate.

12

Peer-to-Peer Systems

Vessels

Vessel Observations

Ports

Maritime Traffic


Peer Peer

Peer

Peers communicate.

13

Federated Data Management (FDM) SystemsDistribution


FDM Systems: ● Fully autonomous and

have no concept of cooperation.

● May use different data models.

● Each system manages a different database.

14

Federated Data Management (FDM) Systems

Vessels

Vessel Observations

Ports

Maritime Traffic


Federated Engine

CSV RDF

JSON

15

Challenges in Federated Data Management

16

Challenges for Query Processing in FDM

17

Given a query Q in a formal language, i.e., SPARQL● Identify the relevant data sources for Q (Source Selection)● Decompose Q into subqueries on relevant data sources (Query Decomposition)● Plan evaluation of subqueries against relevant data sources (Query Planning)● Merge data collected from relevant data sources (Query Execution)

Challenges for Query Processing in FDMVessels that have arrived at port of Funchal on 2017-06-21 SELECT ?mmsi ?name ?flagWHERE { ?vessel bdo:name ?name. ?vessel bdo:mmsi ?mmsi. ?vessel bdo:flag ?flag. ?observ bdo:lat ?lat. ?observ bdo:long ?long. ?observ bdo:date "2017-06-21" . ?observ bdo:mmsi ?mmsi. ?port bdo:lat ?lat. ?port bdo:long ?long. ?port bdo:name “Funchal” }

18

Example

SELECT ?mmsi ?name ?flagWHERE { ?vessel bdo:name ?name. ?vessel bdo:mmsi ?mmsi. ?vessel bdo:flag ?flag. ?observ bdo:lat ?lat. ?observ bdo:long ?long. ?observ bdo:date "2017-06-21" . ?observ bdo:mmsi ?mmsi. ?port bdo:lat ?lat. ?port bdo:long ?long. ?port bdo:name “Funchal” }

19

SELECT ?mmsi ?name ?flagWHERE { SERVICE C1 { ?vessel bdo:name ?name. ?vessel bdo:mmsi ?mmsi. ?vessel bdo:flag ?flag.} SERVICE C2 { ?observ bdo:lat ?lat. ?observ bdo:long ?long. ?observ bdo:date "2017-06-21" . ?observ bdo:mmsi ?mmsi. } SERVICE C3 { ?port bdo:lat ?lat. ?port bdo:long ?long. ?port bdo:name “Funchal” }}


20


21


22



23


24



?vessel bdo:name ?name. ?vessel bdo:mmsi ?mmsi.?vessel bdo:flag ?flag.

?observ bdo:lat ?lat.?observ bdo:long ?long.?observ bdo:date "2017-06-21" .?observ bdo:mmsi ?mmsi.

?port bdo:lat ?lat.?port bdo:long ?long.?port bdo:name “Funchal”

25


26


Challenges for Query Processing in FDM?vessel bdo:name ?name. ?vessel bdo:mmsi ?mmsi.?vessel bdo:flag ?flag.



CSV

RDF

JSON

MongoDB

MongoDB

Virtuoso

27

SELECT name, mmsi, flagFROM VESSELS

SELECT lat, long, mmsiFROM OBSERVATIONSWHERE date=”2017-06-21”

SELECT ?lat, ?long, ?mmsiWHERE {?port bdo:lat ?lat.?port bdo:long ?long.?port bdo:name “Funchal”}


?vessel bdo:name ?name. ?vessel bdo:mmsi ?mmsi.?vessel bdo:flag ?flag.



28

MappingintoSPARQL answers

MappingintoSPARQL answers

Operators consume tuples from children and pass output tuples to their parents.

Complexity of FDM Query Processing Tasks

NP-Hard Finding Optimal Plan for Distributed Data Sources [1]

Evaluation of SPARQL Queries is NP-Complete [2]

NP-Hard Finding Optimal Query Decomposition over Distributed RDF Data Sources [3]

29

[1] Tamer Ozsu and Patrick Valduriez. Principles of Distributed Database Systems (Third Edition). Springer, 2011.[2] Jorge Pérez, Marcelo Arenas, Claudio Gutierrez: Semantics and complexity of SPARQL. ACM Trans. Database Syst. 34(3): 16:1-16:45 (2009)[3] Maria-Esther Vidal, Simón Castillo, Maribel Acosta, Gabriela Montoya, Guillermo Palma: On the Selection of SPARQL Endpoints to Efficiently Execute Federated SPARQL Queries. Trans. Large-Scale Data- and Knowledge-Centered Systems 25: 109-149 (2016)

Traditional Federated Query Processing

30

Source Selection and Query Decomposition

Query Optimizer

Query Execution Engine

Statistics Generator

CatalogTraditional Optimize-Then-ExecuteArchitecture

Query

Ideal Federated Management Systems

● Systems able to change their behavior by learning behavior of data providers.

● Receive information from the environment.● Use up-to-date information to change their behavior.● Keep iterating over time to adapt their behavior based

on the environment conditions.

31

Adaptive Query Processing [4]

Adaptivity is needed:▪ Misestimated or missing statistics.▪ Unexpected correlations.▪ Unpredictable costs.▪ Dynamically changing data, workload, and source availability.▪ Changes at rates at which tuples arrive from sources

• Initial Delays.• Slow Delivery.• Bursty Arrivals.

32[4 ] Amol Deshpande, Zachary G. Ives, Vijayshankar Raman: Adaptive Query Processing. Foundations and Trends in Databases, (2007)

Adaptive Federated Query Processing

33

Source Selection and Query Decomposition

Query Optimizer

Query Execution Engine

Statistics Generator

Catalog

Query

Re-optimize

Re-optimize the original plan on-the-fly according to the source conditions.

Levels of Adaptation

34

AdaptationAdaptation Levels

Source Selection

Query Execution

Adaptation Levels

35

Source Selection: searching strategies to select the sources for answering a query according to the real-time source conditions:

● Schema changes● Source availability ● Data distribution changes

Adaptation Levels

36

Query Execution: strategies to adapt query processing to the environment conditions. Adaptation can be at different granularities:

● Fine-grained adaptation of small processes, e.g., ○ Per tuple

● Coarse-grained is attempted for large processes, e.g.,○ Several queries are executed under certain

assumptions about the conditions of the environment

Adaptive Query Processing Strategies

Adaptive Join Processing (Intra-Operator):● Adaptivity performed at tuple level. ● Query operators are able to adapt to the environment

changes, even in the context of a fixed query plan.Non-Pipelined Query Execution (Inter-Operator):● Sub-queries are re-scheduled based on uncertainties:

○ Execution cost of remote access, ○ Size of intermediate results, and ○ Unexpected delays.

37

Adaptive Query Processing Strategies

Adaptive Join Processing (Intra-Operator):● Adaptivity performed at tuple level. ● Query operators are able to adapt to the environment

changes, even in the context of a fixed query plan.Non-Pipelined Query Execution (Inter-Operator):● Sub-queries are re-scheduled based on uncertainties:

○ Execution cost of remote access, ○ Size of intermediate results, and ○ Unexpected delays.

38

Distributed SPARQL Systems

39

SPARQL Web-based Interfaces

40

SPARQL Endpoints: RESTful services that accept queries over HTTP written in the SPARQL query language and respecting the SPARQL protocol.

SELECT ?long WHERE { ?port <http://bigdataocean.eu/name> <http://bigdataocean.eu/Funchal> ?port <http://bigdataocean.eu/long> ?long.}

http://bigdataocean.eu/sparql?query=SELECT%20%3Flong%20WHERE%20%7B%20%3Fport%20%3Chttp%3A%2F%2Fbigdataocean.eu%2Fproperty%2Fname%3E%20%3Chttp%3A%2F%2Fbigdataocean.eu%2Fresource%2FFunchal%3E.%20%3Fport%20%3Chttp%3A%2F%2Fbigdataocean.eu%2Fproperty%2Flong%3E%20%3Flong%20%7D

http://bigdataocean.eu/sparql?query=

http://bigdataocean.eu/sparql?query=

SPARQL 1.1 Federated Extension

41

SERVICE <http://bigdataocean.de/sparql> {SELECT ?long WHERE { ?port <http://bigdataocean.eu/name> <http://bigdataocean.eu/Funchal> ?port <http://bigdataocean.eu/long> ?long.}}

Q1

Q0

The result of evaluating Q0 corresponds to the result of evaluating Q1 in the RDF dataset accessible through the endpoint <http://bigdataocean.de/sparql>

Carlos Buil Aranda, Marcelo Arenas, Óscar Corcho, Axel Polleres: Federating queries in SPARQL 1.1: Syntax, semantics and evaluation. J. Web Sem. 18(1): 1-17 (2013)

http://bigdataocean.de

SPARQL 1.1 Federation Extension

42Carlos Buil Aranda, Marcelo Arenas, Óscar Corcho, Axel Polleres: Federating queries in SPARQL 1.1: Syntax, semantics and evaluation. J. Web Sem. 18(1): 1-17 (2013)

SPARQL Web-based Interfaces

43

Triple Pattern Fragment:

Web-based access interface that allows for evaluating a triple pattern over an RDF dataset.

“Accessing a TPF interface is equivalent to accessing a SPARQL endpoint with queries that contain a single triple pattern only” [5]

[5] Ruben Verborgh, Miel Vander Sande, Olaf Hartig, Joachim Van Herwegen, Laurens De Vocht, Ben De Meester, Gerald Haesendonck, Pieter Colpaert: Triple Pattern Fragments: A low-cost knowledge graph interface for the Web. J. Web Sem.( 2016)

Client-Server SPARQL Query Engines [5]

44

Server of Triple Pattern Fragments (TPFs)

Clients of Triple Pattern Fragments (TPFs)

TPF: Web-based interface to execute SPARQL triple patterns ?vessel bdo:mmsi ?mmsi

TPF Clients: Execute SPARQL queries over a TPF Server

SPARQL Query Q

[5] Ruben Verborgh, Miel Vander Sande, Olaf Hartig, Joachim Van Herwegen, Laurens De Vocht, Ben De Meester, Gerald Haesendonck, Pieter Colpaert: Triple Pattern Fragments: A low-cost knowledge graph interface for the Web. J. Web Sem.( 2016)

Adaptive Client for Triple Pattern Fragments

45

SPARQL Query Q

Query Optimizer

Adaptive Engine

Routing Policies

Input

Optimized plan

Results for QOutput

TPF Server

nLDE [6]: ● Optimization techniques

tailored for TPFs● Autonomous Eddies● Routing policies for

SPARQL

[6] Maribel Acosta, Maria-Esther Vidal: Networks of Linked Data Eddies: An Adaptive Web Query Processing Engine for RDF Data. International Semantic Web Conference (2015)

Peer-to-Peer SPARQL Query Engines

46

SPARQL Query Q

Avalanche SPARQL Endpoint

Endpoint Directoryor Search Engine

Request Statistics

Distributed join

Avalanche [7]: ● Endpoints may

dereference data from other endpoints.

● On-line statistical information about relevant data sources.

● Queries are executed in a concurrent and distributed fashion.

[7] Cosmin Basca, Abraham Bernstein: Querying a messy web of data with Avalanche. J. Web Sem. (2014)

Federated SPARQL Query Engines

47

SPARQL endpoints: Web-access interfaces that allow for querying RDF data following SPARQL protocol.


48

Distribution



49

Federated SPARQL Query Engine

ANAPSID[8]

SPLENDID [9]

Distribution


[10]

[16]


50

Federated SPARQL Query Engine

Distribution


LILAC[11] FEDRA[12]Fed-DESATUR[3]

MULDER[14]

Extensions

DAW[13]HIBISCUS[15]

ANAPSID[8]

SPLENDID [9]

[10]

[16]

Adaptivity During Source Selection

51

Fine-Grained Adaptivity

ANAPSID SPLENDID

Coarse-Grained Adaptivity

No Adaptivity

Fed-DESATURMULDERDAW

HIBISCUS

LILAC FEDRA

Adaptivity During Source Selection

52


Coarse-Grained Adaptivity

No Adaptivity

MULDER

Adaptivity During Source Selection-FedX


53

For triple pattern, (s p o): A query is sent to the SPARQL endpoints for identifying relevant sources



54

Exclusive groups:Subqueries are executed by only one source

sq1sq2

sq3

sq4

sq5


55

sq1

sq2

sq3

sq4 sq5

Exclusive groups are executed first in the plan

Adaptivity During Source Selection-Mulder

56

Sources are described in terms of Classes and properties.

Adaptivity During Source Selection-Mulder

57

Sources are described in terms of Classes and properties.

Source descriptions are exploited during query decomposition.


?port bdo:lat ?lat .?port bdo:long ?long . ?port bdo:name "Funchal"

?vessel bdo:name ?name .?vessel bdo:mmsi ?mmsi .?vessel bdo:flag ?flag .

Query Planning-Mulder

?observation bdo:lat ?lat .?observation bdo:long ?long .?observation bdo:date "2017-06-21" .?observation bdo:mmsi ?vessel .

Empirical Evaluation

59

FedBench Benchmark:● Ten RDF datasets (314K to 44M)

25 Queries:● Cross Domain (CD)● Linked Data (LD)● Life Science (LS)

Ten Complex (C) queries

Completeness versus Execution Time


60






61





Adaptivity is a costly task

Adaptivity During Query Execution

62


ANAPSIDSPLENDID

No Adaptivity

Adaptivity During Query Execution

63


ANAPSIDSPLENDID

No Adaptivity

Fed-DESATURMULDER

DAW HIBISCUS

LILAC FEDRA

Intra-Operator Adaptivity

Intra-Operator Strategies:

● SPARQL operators able to detect when sources become blocked or data traffic is bursty

● Opportunistically produce results as quickly as data arrives from the sources

● Results are produced incrementally

64

JOIN

Inter-Operator AdaptivityInter-Operator Strategies:

● Produce an answer as soon as it is computed

● Can keep producing intermediate results even when data a source becomes blocked

65

JOIN JOIN

JOIN

Inter-Operator AdaptivityInter-Operator Strategies:

● Produce an answer as soon as it is computed

● Can keep producing intermediate results even when data a source becomes blocked

66

JOIN

JOIN

JOIN


67






68






69






70






71





Adaptivity is a costly task

Adaptive Query Processing

72

No Inter-Operator Strategy

Inter-Operator Strategy

Delays simulated with a Gamma distribution ( =1, =0.3)

Twenty-five queries against DBpedia; basic graph patterns with up to 15 triple patterns. Adaptive query processing strategies are able to speed up query execution in presence of delays

[6] M. Acosta, M.E. Vidal: Networks of Linked Data Eddies: An Adaptive Web Query Processing Engine for RDF Data. ISWC 2015

Data Replication-Aware

73

Replication-Aware

ANAPSID

SPLENDID

No Replication

Fed-DESATURMULDER HIBISCUS

LILAC

FEDRA

DAW

Experimental Study

74

Each dataset is accessible through 10 endpoints. Each endpoint has data for 100 random queries.The same fragment is replicated in at least 3 endpointsWatDiv datasets: 50,000 queriesRest of the datasets: 10,000 queries

Diseasome: 72,445 SWDF: 198,797DBpedia Geo: 1,900,004LinkedMDB: 3,579,610WatDiv1: 104,532WatDiv100: 10,934,518

[11] Gabriela Montoya, Hala Skaf-Molli, Pascal Molli, Maria-Esther Vidal: Decomposing federated queries in presence of replicated fragments. J. Web Sem. (2017)

Experimental Study

75

Each dataset is accessible through 10 endpoints. Each endpoint has data for 100 random queries.The same fragment is replicated in at least 3 endpointsWatDiv datasets: 50,000 queriesRest of the datasets: 10,000 queries

Diseasome: 72,445 SWDF: 198,797DBpedia Geo: 1,900,004LinkedMDB: 3,579,610WatDiv1: 104,532WatDiv100: 10,934,518

Experimental Study

76

Consider replication allows for producing complete answers while the execution time is reduced

Open Issues

77

Distribution


Distribution


Diverse types of heterogeneity:Data models, query capabilities,availability.

Open Issues

● Formal Definition

● Access Control● Security ● Privacy

78

● Data Evolution● Synchronization ● Big Data

Conclusions

79

References[1] Tamer Ozsu and Patrick Valduriez. Principles of Distributed Database Systems (Third Edition). Springer (2011)[2] Jorge Pérez, Marcelo Arenas, Claudio Gutierrez: Semantics and complexity of SPARQL. ACM Trans. Database Syst. 34(3): 16:1-16:45 (2009)[3] Maria-Esther Vidal, Simón Castillo, Maribel Acosta, Gabriela Montoya, Guillermo Palma: On the Selection of SPARQL Endpoints to Efficiently Execute Federated SPARQL Queries. Trans. Large-Scale Data- and Knowledge-Centered Systems 25: 109-149 (2016)[4 ] Amol Deshpande, Zachary G. Ives, Vijayshankar Raman: Adaptive Query Processing. Foundations and Trends in Databases, (2007)[5] Ruben Verborgh, Miel Vander Sande, Olaf Hartig, Joachim Van Herwegen, Laurens De Vocht, Ben De Meester, Gerald Haesendonck, Pieter Colpaert: Triple Pattern Fragments: A low-cost knowledge graph interface for the Web. J. Web Sem.( 2016)[6] Maribel Acosta, Maria-Esther Vidal: Networks of Linked Data Eddies: An Adaptive Web Query Processing Engine for RDF Data. International Semantic Web Conference (2015)

80

References[7] Cosmin Basca, Abraham Bernstein: Querying a messy web of data with Avalanche. J. Web Sem. (2014)[8] Maribel Acosta, Maria-Esther Vidal, Tomas Lampo, Julio Castillo, Edna Ruckhaus: ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints. International Semantic Web Conference (2011)[9] Olaf Görlitz, Steffen Staab: SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions. COLD (2011)[10] Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, Michael Schmidt: FedX: Optimization Techniques for Federated Query Processing on Linked Data. International Semantic Web Conference (2011)[11] Gabriela Montoya, Hala Skaf-Molli, Pascal Molli, Maria-Esther Vidal: Decomposing federated queries in presence of replicated fragments. J. Web Sem. (2017)[12] Gabriela Montoya, Hala Skaf-Molli, Pascal Molli, Maria-Esther Vidal: Federated SPARQL Queries Processing with Replicated Fragments. International Semantic Web Conference (2015)

81

References[13] Muhammad Saleem, Axel-Cyrille Ngonga Ngomo, Josiane Xavier Parreira, Helena F. Deus, Manfred Hauswirth: DAW: Duplicate-AWare Federated Query Processing over the Web of Data. International Semantic Web Conference (2013)[14] Kemele M. Endris, Mikhail Galkin, Ioanna Lytra, Mohamed Nadjib Mami, Maria-Esther Vidal, Sören Auer: MULDER: Querying the Linked Data Web by Bridging RDF Molecule Templates. International Conference on Database and Expert Systems Applications (2017)[15] Muhammad Saleem, Axel-Cyrille Ngonga Ngomo: HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation. Extended Semantic Web Conference (2014)[16] SemaGrow: Optimizing federated SPARQL queries Angelos Charalambidis, Antonis Troumpoukis and Stasinos Konstantopoulos In Proceedings of the 11th International Conference on Semantic Systems (SEMANTiCS 2015)

82

federated query processing universidad simón bolívar...

Documents