impala as triplestore

Download Impala as Triplestore

Post on 05-Jul-2015

138 views

Category:

Software

1 download

Embed Size (px)

DESCRIPTION

Can we use impala as triplestore? This doucment shortly describes sempala which was published on ISWC 2014.

TRANSCRIPT

  • 1. Cloudera ImpalaCan we use impala as triple store?Jinuk, Jung

2. Impala?Impala Overview What is impala? In-memory, distributed SQL Query Engine No Map/Reduce Distributed on HDFS data nodes Why impala? Interactive data analysis Low-latency response Deploy on existing Hadoop Clusters 3. SQL SupportImpala Overview Subset of HiveQLselect aggregationprojection subqueriesunion order by(w/ limit)insert join 4. Limitation DDL statements(Delete, Update, Rollback) are not supported The only supported DML statements(Insert, Create, Alter) Constraints(Primary/Foreign key relationship) and Trigger are not available It create too much overhead for large scale environments. Joins are limited to in-memory joins(v1.2.3) Joins requires both tables which should be joined to fit into the memory of the cluster node In v2.0, supports On-Disk Joins Fault-tolerance is not supported Join OptimizationImpala Overview 5. SempalaSempala Overview ISWC 2014 published Impala as Triple Store Consist of RDF Loader / Query Compiler 6. SempalaBenchmark Environment System Spec. Intel Xeon E5-2420 1.9Ghz 6core 32GB RAM 4TB Disk Space Connected via gigabit network 10 nodes Software Env. Ubuntu 12.04 LTS Impala v 1.2.3 (Current version is 2.0) 7. Sempala LUBM Performance TestLUBM loading times and dataset sizesLUBM Runtime 8. Sempala LUBM Performance Test(contd) 9. SempalaLUBM QueriesUniv# 5 500 1000 1500 2000 2500 3000Q1 5.78 6.97 7.09 7.38 7.61 8.21 8.79Q1 (simple, fastest, 4 results)PREFIX rdf: PREFIX lubm: SELECT ?X WHERE {?X rdf:type lubm:GraduateStudent .?X lubm:takesCourse .}Converted SQLCreate Table result as (SELECT DISTINCT BGP1_0.X AS "X" FROM(SELECT DISTINCT ID AS "X" FROM propertytableWHERE ID is not nullAND rdf_type is not nullAND rdf_type = lubm:GraduateStudent AND lubm_takesCourse is not nullAND lubm_takesCourse = ) BGP1_0) ; 10. SempalaLUBM QueriesUniv# 5 500 1000 1500 2000 2500 3000Q1 5.86 16.22 28.28 36.29 47.32 55.83 66.86Q6 (simplest, slowest, +20million results)PREFIX rdf: PREFIX lubm: SELECT ?X WHERE { ?X rdf:type lubm:Student .}Converted SQLCREATE TABLE result AS(SELECT DISTINCT BGP1_0.x AS "X"FROM (SELECT DISTINCT id AS "X" FROM property_tableWHERE id IS NOT NULLAND rdf_type IS NOT NULLAND rdf_type = lubm:Student) BGP1_0) 11. SempalaLUBM QueriesUniv# 5 500 1000 1500 2000 2500 3000Q1 6.78 10.81 14.20 17.55 21.49 25.24 28.55Q9 (complicate, slower, +100k results)PREFIX rdf: PREFIX lubm: SELECT ?X ?Y ?Z WHERE{?X rdf:type lubm:Student .?Y rdf:type lubm:Faculty .?Z rdf:type lubm:Course .?X lubm:advisor ?Y .?Y lubm:teacherOf ?Z .?X lubm:takesCourse ?Z}CREATE TABLE result AS(SELECT DISTINCT BGP1.y AS "Y",BGP1.x AS "X",BGP1.z AS "Z"FROM (SELECT DISTINCT BGP1_0.y AS "Y",BGP1_0.x AS "X",BGP1_0.z AS "Z"FROM (SELECT DISTINCT lubm_advisorFROM property_table WHERE id IS NOT NULLAND rdf_type IS NOT NULLAND rdf_type = lubm:StudentAND lubm_advisor IS NOT NULLAND lubm_takescourse IS NOT NULL) BGP1_0JOIN (SELECT DISTINCT id AS "Y", lubm_teacherofAS "Z"FROM property_table WHERE id IS NOT NULLAND rdf_type IS NOT NULLAND rdf_type = lubm:FacultyAND lubm_teacherof IS NOT NULL) BGP1_1ON( BGP1_0.y = BGP1_1.yAND BGP1_0.z = BGP1_1.z ) JOIN (SELECT DISTINCT id AS"Z"FROM property_table WHERE id IS NOT NULLAND rdf_type IS NOT NULLAND rdf_type = lubm:Course) BGP1_2 ON( BGP1_1.z = BGP1_2.z)) BGP1) 12. Sempala BSBM Performance TestBSBM loading times and dataset sizesBSBM Runtime 13. Sempala BSBM Performance Test(contd) 14. ConclusionSuitable solution for our project?Pros Supports a join statement- Possible to convert almost SPARQL statement (+ additional stmt)- We dont need to care the query plan for graph joinCon Impala needs high-performance nodes- In-memory type architecture Performance problem- Fastest response time was 8.79s in LUBM 3000- Improvement = Query Tuning? 15. Reference Neu, A. R. (2014). Distributed Evaluation of SPARQL queries with Impala and MapReduce. Schtzle, A., Przyjaciel-Zablocki, M., Neu, A., & Lausen, G. (2014).Sempala: Interactive SPARQL Query Processing on Hadoop. ISWC 2014 Cloudera Impala Overview (http://goo.gl/cDUzUe) Performance Considerations for join queries (http://goo.gl/Z8ssW6)