exploiting the query structure for efficient join ordering in sparql queries

Exploiting the query structure for efficient join ordering in

SPARQL queriesLuiz Henrique Zambom Santana

Vinicius da Silveira Segalin

Agenda

•Paper and authors

•Background

•Problem and solution

•Example

•Algorithms

•Analysis

•Conclusions

Paper and authors

Gubichev, Andrey, and Thomas Neumann. "Exploiting the query structure for efficient join ordering in SPARQL queries." EDBT. 2014.

Extending Database Technology – Qualis A2/H-index 52

Background

• SPARQL• W3C standard• Semantic Web• Inspired in SQL

•Query structure

• Join ordering (similar to matrix product)

Problem

•The join ordering problem is a fundamental challenge that has to be solved by any query optimizer•Depending on the order of the join, there is a different computation

time• SQL solutions are not immediately capable of handling large SPARQL

queries. It is introduced a new join ordering algorithm that performs a SPARQL-tailored query simplification

Problem

• Cardinality estimation is an essential part of any cost-based query optimizer

• Two different approaches:• RDF-3X: query compilation time (dominated by finding the optimal

join order) is one order of magnitude higher than the actual execution time

• Virtuoso 7: greedy algorithm for compilation leads to a slow run time (sub-optimal order)

Solution

• Best of both worlds:• Heuristics that spends a reasonable amount of time optimizing the

query, and yet gets a decent join order• The paper presents a SPARQL-tailored query simplification

procedure, that decomposes the query’s join graph into star-shaped subqueries and chain-shaped subqueries

Challenges

•RDF can be very verbose• TPC-H Query 2 written in SPARQL contains joins between 26 index

scans (as opposed to joins between 5 tables in the SQL formulation)

• Number of plans:• 5! = 120 plans in SQL vs 26! = 4 *1026

• Lack of schema• Foreign keys become structural correlations

Solution

• Characteristic set for s defines the properties (attributes) of an entity, thus defining its class (type) in a sense that the subjects that have the same characteristic set tend to be similar

• Hierarchical Characterization:• 1. H

0 is the set of all characteristic sets of R

• 2. Hi = {argmin ∀ C ⊂ S ∧|C|=|S|−1 cost(C) | ∀ S ∈ H

i−1}, that is H

i consists of the subsets C of sets

from Hi-1

that minimize cost(C). • 3. ∀ S ∈ H

k: |S| = 2

• 4. every S ∈ Hi-1

stores a pointer to its cheapest subset C ∈ Hi.

Algorithm 1 (part. 1)• Line 2: S=[{created, bornIn, livedIn, hasName},

{ bornIn, livedIn, hasName},...]• Line 8: Init Banker's iteration, ie. from the

smallest to the biggest possible set with the predicates

Algorithm 1 (part. 2)• Line 12: guarantees that S

2 is smaller than S

1• Line 15-16: finds the subsets that have smaller

cost• Cost

• Banker’s iteration potentially enumerates all the subsets of all predicates in the dataset, in reality it stops relatively early, since it is always bounded by the largest set in Sets

Algorithm 2 (part. 1/2)

• Objective: finding the optimal join order in (sub) queries of the form:

select * where {?s p1 ?o1. . . . ?s pk ?ok}• Idea: extract the part of the Hierarchical

Characterisation of the dataset starting with the set S

• Input: Star-shaped graph• Output: Order of the joins• Lines 1-9:

• While size S > 2, find the most expensive subset and push to front of O


• The first part leads to the optimal for star-shaped queries in linear time to the graph size

• However, it do not find the optional solution if the query have constants:select * where {?s p

1 “Berlin”. . . . ?s p

k ?o

k}

• Then:• Lines 12-14: only one of the bounded

objects is in the triple with the key predicate, ie., the entire star query is therefore a lookup of properties of a specific entity

• Lines 15-16: otherwise (many objects are key), keep pushing down the constants in the join tree and stop when the cost of the corresponding index scan is bigger than the cost of the join on that level of the tree


• Objective: ordering join in general SPARQL queries(s1, hasName, "Marie Curie"),

(s1, bornIn, s2), (s2, label, "Warsaw"),

(s2, locatedIn, "Poland")• Problem: s

2 links person to city, corresponding to the "foreign key", but RDF does not require any

schema. Knowledge of such dependencies is extremely useful for the query optimizer: without it, the optimizer has to assume independence between two entities linked via bornIn predicate, thus almost inevitably underestimating the selectivity of the join of corresponding triple pattern

• Thus, it uses Characteristic Pair (Paar Charakteristisch) in order to discover this kind of relation, where:PC (S

c(s), S

c(o)) = {(S

c(s), S

c(o), p) | Sc(o) != ∅ ∧ ∃p : (s, p, o) ∈ R}

• The CP is a in-memory structure and in theory, with n distinct characteristic sets we can get up to n2 characteristic pairs, in real datasets only few pairs appear frequently enough to be stored. For example, in YAGO-Facts dataset of the 250000 existing pairs, only 5292 pairs appear more than 100 times in the dataset. This way, the frequent characteristic pairs for the consume less than 16 KB.


• Idea: to decompose the query into star-shaped subqueries connected by chains, and to collapse the subqueries into meta-nodes

• Input: SPARQL graph• Output: join ordering for this graph• Lines 11-24: starts with clustering the query into disjoint

star-shaped subqueries around subjects• Line 13: order the triple patterns in the query by subject• Line 15: group triple patterns with identical subjects, since

they potentially form star-shaped subqueries• Lines 20-23: find starts around objects


• Lines 4-5: for every star it adds the new meta-node to the query graph and removes the intra-star edges

• Lines 6-7: the plan for the star subquery is computed using the Hierarchical Characterisation (Algorithm 2) and added to the DP table along with the meta-node

• Line 8: After all the star subqueries have been optimized, we add the edges between meta-nodes to the query graph, if the original graph has edges between the corresponding star sub-queries

Algorithm 3 (part. 4/4)• Line 10: selectivities associated with these edges are

computed using the Characteristic Pairs synopsis, and the regular Dynamic Programming algorithm starts working on this simplified graph

• In the following Figure simplifying the graph from 8 nodes to 3 nodes gives a reduction from 8!=40320 plans to 3!=6 plans

• This algorithm is also linear to the input graph

Analysis

Conclusions

•The problem is very similar to the Matrix product•The query simplification techniques reduces the search space size by

making some simplification before the DP algorithm starts•The time analysis shows how important are the complexity study

• There is no complexity analysis though it mentions DP and Greedy algorithms along the paper

• The tests did not turned the cache off

• Do not cover OPTIONAL clauses of SPARQL, which are equivalent to the left outer joins and can not be freely reordered with other joins

exploiting the query structure for efficient join ordering in sparql queries

Documents

join ordering problem

query compilation time

efficient join ordering

optimal join order

decent join

new join ordering algorithm

argmin c s c

s denes