exploiting the query structure for efficient join ordering in sparql queries
TRANSCRIPT
Exploiting the query structure for efficient join ordering in
SPARQL queriesLuiz Henrique Zambom Santana
Vinicius da Silveira Segalin
Agenda
•Paper and authors
•Background
•Problem and solution
•Example
•Algorithms
•Analysis
•Conclusions
Paper and authors
Gubichev, Andrey, and Thomas Neumann. "Exploiting the query structure for efficient join ordering in SPARQL queries." EDBT. 2014.
Extending Database Technology – Qualis A2/H-index 52
Background
• SPARQL• W3C standard• Semantic Web• Inspired in SQL
•Query structure
• Join ordering (similar to matrix product)
Problem
•The join ordering problem is a fundamental challenge that has to be solved by any query optimizer•Depending on the order of the join, there is a different computation
time• SQL solutions are not immediately capable of handling large SPARQL
queries. It is introduced a new join ordering algorithm that performs a SPARQL-tailored query simplification
Problem
• Cardinality estimation is an essential part of any cost-based query optimizer
• Two different approaches:• RDF-3X: query compilation time (dominated by finding the optimal
join order) is one order of magnitude higher than the actual execution time
• Virtuoso 7: greedy algorithm for compilation leads to a slow run time (sub-optimal order)
Solution
• Best of both worlds:• Heuristics that spends a reasonable amount of time optimizing the
query, and yet gets a decent join order• The paper presents a SPARQL-tailored query simplification
procedure, that decomposes the query’s join graph into star-shaped subqueries and chain-shaped subqueries
Challenges
•RDF can be very verbose• TPC-H Query 2 written in SPARQL contains joins between 26 index
scans (as opposed to joins between 5 tables in the SQL formulation)
• Number of plans:• 5! = 120 plans in SQL vs 26! = 4 *1026
• Lack of schema• Foreign keys become structural correlations
Solution
• Characteristic set for s defines the properties (attributes) of an entity, thus defining its class (type) in a sense that the subjects that have the same characteristic set tend to be similar
• Hierarchical Characterization:• 1. H
0 is the set of all characteristic sets of R
• 2. Hi = {argmin ∀ C ⊂ S ∧|C|=|S|−1 cost(C) | ∀ S ∈ H
i−1}, that is H
i consists of the subsets C of sets
from Hi-1
that minimize cost(C). • 3. ∀ S ∈ H
k: |S| = 2
• 4. every S ∈ Hi-1
stores a pointer to its cheapest subset C ∈ Hi.
Algorithm 1 (part. 1)• Line 2: S=[{created, bornIn, livedIn, hasName},
{ bornIn, livedIn, hasName},...]• Line 8: Init Banker's iteration, ie. from the
smallest to the biggest possible set with the predicates
Algorithm 1 (part. 2)• Line 12: guarantees that S
2 is smaller than S
1• Line 15-16: finds the subsets that have smaller
cost• Cost
• Banker’s iteration potentially enumerates all the subsets of all predicates in the dataset, in reality it stops relatively early, since it is always bounded by the largest set in Sets
Algorithm 2 (part. 1/2)
• Objective: finding the optimal join order in (sub) queries of the form:
select * where {?s p1 ?o1. . . . ?s pk ?ok}• Idea: extract the part of the Hierarchical
Characterisation of the dataset starting with the set S
• Input: Star-shaped graph• Output: Order of the joins• Lines 1-9:
• While size S > 2, find the most expensive subset and push to front of O
Algorithm 2 (part. 2/2)
• The first part leads to the optimal for star-shaped queries in linear time to the graph size
• However, it do not find the optional solution if the query have constants:select * where {?s p
1 “Berlin”. . . . ?s p
k ?o
k}
• Then:• Lines 12-14: only one of the bounded
objects is in the triple with the key predicate, ie., the entire star query is therefore a lookup of properties of a specific entity
• Lines 15-16: otherwise (many objects are key), keep pushing down the constants in the join tree and stop when the cost of the corresponding index scan is bigger than the cost of the join on that level of the tree
Algorithm 3 (part. 1/4)
• Objective: ordering join in general SPARQL queries(s1, hasName, "Marie Curie"),
(s1, bornIn, s2), (s2, label, "Warsaw"),
(s2, locatedIn, "Poland")• Problem: s
2 links person to city, corresponding to the "foreign key", but RDF does not require any
schema. Knowledge of such dependencies is extremely useful for the query optimizer: without it, the optimizer has to assume independence between two entities linked via bornIn predicate, thus almost inevitably underestimating the selectivity of the join of corresponding triple pattern
• Thus, it uses Characteristic Pair (Paar Charakteristisch) in order to discover this kind of relation, where:PC (S
c(s), S
c(o)) = {(S
c(s), S
c(o), p) | Sc(o) != ∅ ∧ ∃p : (s, p, o) ∈ R}
• The CP is a in-memory structure and in theory, with n distinct characteristic sets we can get up to n2 characteristic pairs, in real datasets only few pairs appear frequently enough to be stored. For example, in YAGO-Facts dataset of the 250000 existing pairs, only 5292 pairs appear more than 100 times in the dataset. This way, the frequent characteristic pairs for the consume less than 16 KB.
Algorithm 3 (part. 2/4)
• Idea: to decompose the query into star-shaped subqueries connected by chains, and to collapse the subqueries into meta-nodes
• Input: SPARQL graph• Output: join ordering for this graph• Lines 11-24: starts with clustering the query into disjoint
star-shaped subqueries around subjects• Line 13: order the triple patterns in the query by subject• Line 15: group triple patterns with identical subjects, since
they potentially form star-shaped subqueries• Lines 20-23: find starts around objects
Algorithm 3 (part. 3/4)
• Lines 4-5: for every star it adds the new meta-node to the query graph and removes the intra-star edges
• Lines 6-7: the plan for the star subquery is computed using the Hierarchical Characterisation (Algorithm 2) and added to the DP table along with the meta-node
• Line 8: After all the star subqueries have been optimized, we add the edges between meta-nodes to the query graph, if the original graph has edges between the corresponding star sub-queries
Algorithm 3 (part. 4/4)• Line 10: selectivities associated with these edges are
computed using the Characteristic Pairs synopsis, and the regular Dynamic Programming algorithm starts working on this simplified graph
• In the following Figure simplifying the graph from 8 nodes to 3 nodes gives a reduction from 8!=40320 plans to 3!=6 plans
• This algorithm is also linear to the input graph
Analysis
Conclusions
•The problem is very similar to the Matrix product•The query simplification techniques reduces the search space size by
making some simplification before the DP algorithm starts•The time analysis shows how important are the complexity study
• There is no complexity analysis though it mentions DP and Greedy algorithms along the paper
• The tests did not turned the cache off
• Do not cover OPTIONAL clauses of SPARQL, which are equivalent to the left outer joins and can not be freely reordered with other joins