impala as triplestore

15
Cloudera Impala Can we use impala as triple store? Jinuk, Jung

Upload: jinuk-jung

Post on 05-Jul-2015

150 views

Category:

Software


1 download

DESCRIPTION

Can we use impala as triplestore? This doucment shortly describes sempala which was published on ISWC 2014.

TRANSCRIPT

Page 1: Impala as Triplestore

Cloudera Impala

Can we use impala as triple store?

Jinuk, Jung

Page 2: Impala as Triplestore

Impala?

Impala Overview

• What is impala? • In-memory, distributed SQL Query Engine

• No Map/Reduce

• Distributed on HDFS data nodes

• Why impala? • Interactive data analysis

• Low-latency response

• Deploy on existing Hadoop Clusters

Page 3: Impala as Triplestore

SQL Support

Impala Overview

• Subset of HiveQL

select aggregation

projection subqueries

union order by(w/ limit)

insert join

Page 4: Impala as Triplestore

Limitation

• DDL statements(Delete, Update, Rollback) are not supported • The only supported DML statements(Insert, Create, Alter)

• Constraints(Primary/Foreign key relationship) and Trigger are not available • It create too much overhead for large scale environments.

• Joins are limited to in-memory joins(v1.2.3) • Joins requires both tables which should be joined to fit into the memory of the cluster node • In v2.0, supports On-Disk Joins

• Fault-tolerance is not supported

• Join Optimization

Impala Overview

Page 5: Impala as Triplestore

Sempala

Sempala Overview

• ISWC 2014 published • Impala as Triple Store • Consist of RDF Loader / Query Compiler

Page 6: Impala as Triplestore

Sempala

Benchmark Environment

• System Spec. • Intel Xeon E5-2420 1.9Ghz 6core

• 32GB RAM

• 4TB Disk Space

• Connected via gigabit network

• 10 nodes

• Software Env. • Ubuntu 12.04 LTS

• Impala v 1.2.3 (Current version is 2.0)

Page 7: Impala as Triplestore

Sempala LUBM Performance Test

LUBM loading times and dataset sizes

LUBM Runtime

Page 8: Impala as Triplestore

Sempala LUBM Performance Test(cont’d)

Page 9: Impala as Triplestore

Sempala

LUBM Queries

Q1 (simple, fastest, 4 results)

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX lubm: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>

SELECT ?X WHERE {

?X rdf:type lubm:GraduateStudent . ?X lubm:takesCourse <http://www.Department0.University0.edu/GraduateCourse0> .

}

Univ# 5 500 1000 1500 2000 2500 3000

Q1 5.78 6.97 7.09 7.38 7.61 8.21 8.79

Create Table result as (SELECT DISTINCT BGP1_0.X AS "X" FROM (SELECT DISTINCT ID AS "X" FROM propertytable WHERE ID is not nullAND rdf_type is not nullAND rdf_type = ’lubm:GraduateStudent’ AND lubm_takesCourse is not null AND lubm_takesCourse = ’<http://www.Department0.University0.edu/GraduateCourse0>’) BGP1_0) ;

Converted SQL

Page 10: Impala as Triplestore

Sempala

LUBM Queries

Q6 (simplest, slowest, +20million results)

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX lubm: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> SELECT ?X WHERE { ?X rdf:type lubm:Student .}

Univ# 5 500 1000 1500 2000 2500 3000

Q1 5.86 16.22 28.28 36.29 47.32 55.83 66.86

CREATE TABLE result AS (SELECT DISTINCT BGP1_0.x AS "X" FROM (SELECT DISTINCT id AS "X" FROM property_table WHERE id IS NOT NULL AND rdf_type IS NOT NULLAND rdf_type = ’lubm:Student’) BGP1_0)

Converted SQL

Page 11: Impala as Triplestore

Sempala

LUBM Queries

Q9 (complicate, slower, +100k results)

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX lubm: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>

SELECT ?X ?Y ?Z WHERE{ ?X rdf:type lubm:Student . ?Y rdf:type lubm:Faculty . ?Z rdf:type lubm:Course . ?X lubm:advisor ?Y . ?Y lubm:teacherOf ?Z . ?X lubm:takesCourse ?Z }

Univ# 5 500 1000 1500 2000 2500 3000

Q1 6.78 10.81 14.20 17.55 21.49 25.24 28.55

CREATE TABLE result AS (SELECT DISTINCT BGP1.y AS "Y",

BGP1.x AS "X",

BGP1.z AS "Z"FROM (SELECT DISTINCT BGP1_0.y AS "Y",

BGP1_0.x AS "X",

BGP1_0.z AS "Z"FROM (SELECT DISTINCT lubm_advisor

FROM property_table WHERE id IS NOT NULL

AND rdf_type IS NOT NULLAND rdf_type = ’lubm:Student’ AND lubm_advisor IS NOT NULLAND lubm_takescourse IS NOT NULL) BGP1_0

JOIN (SELECT DISTINCT id AS "Y", lubm_teacherof AS "Z"

FROM property_table WHERE id IS NOT NULL

AND rdf_type IS NOT NULLAND rdf_type = ’lubm:Faculty’ AND lubm_teacherof IS NOT NULL) BGP1_1

ON( BGP1_0.y = BGP1_1.yAND BGP1_0.z = BGP1_1.z ) JOIN (SELECT DISTINCT id AS "Z"

FROM property_table WHERE id IS NOT NULL

AND rdf_type IS NOT NULL

AND rdf_type = ’lubm:Course’) BGP1_2 ON( BGP1_1.z = BGP1_2.z )) BGP1)

Page 12: Impala as Triplestore

Sempala BSBM Performance Test

BSBM loading times and dataset sizes

BSBM Runtime

Page 13: Impala as Triplestore

Sempala BSBM Performance Test(cont’d)

Page 14: Impala as Triplestore

Suitable solution for our project?

Pros • Supports a join statement

- Possible to convert almost SPARQL statement (+ additional stmt) - We don’t need to care the query plan for graph join

Con • Impala needs high-performance nodes

- In-memory type architecture

• Performance problem - Fastest response time was 8.79s in LUBM 3000 - Improvement = Query Tuning?

Conclusion

Page 15: Impala as Triplestore

Reference

• Neu, A. R. (2014). Distributed Evaluation of SPARQL queries with Impala and MapReduce.

• Schätzle, A., Przyjaciel-Zablocki, M., Neu, A., & Lausen, G. (2014). Sempala: Interactive SPARQL Query Processing on Hadoop. ISWC 2014

• Cloudera Impala Overview (http://goo.gl/cDUzUe)

• Performance Considerations for join queries (http://goo.gl/Z8ssW6)