rcq-ga: rdf chain query optimization using genetic algorithms bnaic 2009 alexander hogenboom, viorel...

16
RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus School of Economics Erasmus University Rotterdam {hogenboom, milea, frasincar, kaymak}@ese.eur.nl October 30, 2009

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms

BNAIC 2009

Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay KaymakErasmus School of EconomicsErasmus University Rotterdam

{hogenboom, milea, frasincar, kaymak}@ese.eur.nl

October 30, 2009

Page 2: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

Introduction

• The application of Semantic Web technologies in an Electronic Commerce environment implies a need for good support tools

• Fast query engines are needed for efficient querying of large amounts of data, usually represented using the Resource Description Framework (RDF)

• Problem: optimizing query paths (the order in which different parts of a query are evaluated)

• Two-phase optimization (2PO) has already been proposed (Stuckenschmidt et al. 2005) in a Semantic Web context, but a genetic algorithm (GA) appears to be a feasible alternative

BNAIC 2009

2

Page 3: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

RDF and Query Paths (1)

• RDF model is a collection of facts declared using RDF• Facts are triples in the form of a node-arc-node link

consisting of a subject, a predicate, and an object• RDF sources can be queried using SPARQL• We consider a subset of SPARQL queries: chain

queries, where a query path is followed by performing joins between its subpaths of length 1

1. PREFIX c: <http://www.daml.org/2001/09/countries/fips#>2. PREFIX o: <http://www.daml.org/2003/09/factbook/factbook-ont#>3. SELECT ?partner4. WHERE { c:SouthAfrica o:importPartner ?impPartner .5. ?impPartner o:country ?partner .6. ?partner o:border ?border .7. ?border o:country ?neighbour .8. ?neighbour o:internationalDispute ?dispute .9. }

3

BNAIC 2009

Page 4: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

RDF and Query Paths (2)

BNAIC 2009

4

Bushy query tree Right-deep query tree

Page 5: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

RDF Query Path Optimization (1)

• Challenge: determine the right order in which the joins should be computed, hereby optimizing the overall response time

• Consider a solution space with query paths• Solutions are associated with data transmission and

processing costs• Data processing costs are the sum of all join costs,

which are influenced by the cardinalities of each operand and the join method used

• Neighbouring solutions in solution space can be identified using transformation rules introduced by Ioannidis and Kang (1990)

BNAIC 2009

5

Page 6: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

RDF Query Path Optimization (2)

• Stuckenschmidt et al. (2005) propose to use 2PO for RDF chain query optimization:– Using Iterative Improvement (II), local optima are found by

walking through solution space (from random starting points), while only taking steps yielding improvement in solution quality

– The best local optimum thus found is used as starting point for Simulated Annealing (SA); a walk through solution space is performed, where moves not yielding improvement are accepted with a declining probability

• We propose to optimize RDF chain queries using a GA, RCQ-GA

BNAIC 2009

6

Page 7: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

RDF Query Path Optimization (3)

• In a GA, a population of chromosomes (solutions) is exposed to evolution: selection, crossovers, and mutations

• A GA generally is aware of good solutions faster than 2PO, but tends to spend a lot of time optimizing these already good results before it terminates

• We adopt the BushyGenetic (BG) algorithm proposed by Steinbrunn et al. (1997) for traditional query path optimization, but stimulate quicker convergence through elitist selection, fitness-based selection, a decreased population size, and tighter stopping conditions

BNAIC 2009

7

Page 8: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

RDF Query Path Optimization (4)

• Solutions are encoded using an efficient ordinal number encoding scheme, facilitating easy crossover and mutation operations

• The algorithm iteratively joins two concepts in an ordered list of concepts

• Result is saved on position of first appearing concept• Example:

– (c1, c2, c3, c4): join 3 and 4– (c1, c2, c3c4): join 1 and 2– (c1c2, c3c4): join 1 and 2– (c1c2c3c4)

• Encoded chromosome: ((3,4),(1,2),(1,2))

BNAIC 2009

8

Page 9: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

RDF Query Path Optimization (5)

BNAIC 2009

9

Page 10: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

Performance (1)

• We benchmark execution times and solution quality of BG and our adaptation to RDF query environments, RCQ-GA, against those of 2PO

• The effects of a time limit (1 second) on 2PO and RCQ-GA are also assessed

• The entire solution space is considered (i.e., bushy query trees are valid options)

• Each algorithm is tested on chain queries varying in length from 2 to 20 predicates

• Each experiment is iterated 100 times

BNAIC 2009

10

Page 11: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

Performance (2)

BNAIC 2009

11

Relative deviation of average execution times from 2PO average

Page 12: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

Performance (3)

BNAIC 2009

12

Relative deviation of average solution costs from 2PO average

Page 13: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

Performance (4)

BNAIC 2009

13

Relative deviation of coefficients of variation of solution costs from 2PO average

Page 14: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

Conclusions

• In optimizing the query path for chain queries in a single-source RDF query execution environment, the performance of a GA compared to 2PO is positively correlated with the complexity of the solution space and the restrictiveness of the environment

• An appropriately configured GA can outperform 2PO in solution quality, execution time needed, and consistency of solution quality

BNAIC 2009

14

Page 15: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

Future Work

• Optimize parameters (e.g., using meta-algorithms)• Evaluate performance in a distributed setting• Experiment with other algorithms, such as ant colony

optimization or particle swarm optimization

BNAIC 2009

15

Page 16: RCQ-GA: RDF Chain Query Optimization using Genetic Algorithms BNAIC 2009 Alexander Hogenboom, Viorel Milea, Flavius Frasincar, and Uzay Kaymak Erasmus

Questions?

• Feel free to contact:Alexander HogenboomErasmus School of EconomicsErasmus University RotterdamP.O. Box 1738, 3000 DR, The [email protected]

BNAIC 2009

16