jaws - data warehouse with spark sql by ema orhian

15
Ema Orhian @emaorhian Jaws - Data Warehouse with Spark SQL

Upload: spark-summit

Post on 14-Jan-2017

441 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Jaws - Data Warehouse with Spark SQL by Ema Orhian

Ema Orhian @emaorhian

Jaws - Data Warehouse with Spark SQL

Page 2: Jaws - Data Warehouse with Spark SQL by Ema Orhian

• Big Data analytics / Machine Learning• 4+ years exp with Hadoop ecosystem• 2 years exp with Spark

About me

http://bigdataresearch.io/

• Co-founder of Big Data Research Group • Provides open source solutions around Big Data analytics

http://atigeo.com/

Page 3: Jaws - Data Warehouse with Spark SQL by Ema Orhian

Agenda• jaws-spark-sql-rest (Jaws) intro• Main features  • Architecture • Scaling• Resource manager• Working with Tachyon• Working with Parquet files• Configure Spark Sql context• Demo

Page 4: Jaws - Data Warehouse with Spark SQL by Ema Orhian

Shared Spark Sql Context

Concurrent queries run

Query history

Page resultsQuery editor

Page 5: Jaws - Data Warehouse with Spark SQL by Ema Orhian

Jaws• Highly scalable and resilient data warehouse explorer

• Restful alternative to Spark SQL JDBC and not only …

• Support for Spark 0.9.1/Shark thru Spark 1.5

• Support for hive/MR

https://github.com/atigeo/jaws-spark-sql-rest

Page 6: Jaws - Data Warehouse with Spark SQL by Ema Orhian

Main features• Submit queries concurrently and asynchronously

• Provides persisted logs, query history, results with paging

• Pluggable persistent layer (Cassandra/HDFS)

• Supports load balancing with query cancelation

• Provides a metadata browser

• In-memory Parquet warehouse with Tachyon

• Configuration file to fine tune Spark context

• Pluggable UI

Page 7: Jaws - Data Warehouse with Spark SQL by Ema Orhian

Jaws architecture

Page 8: Jaws - Data Warehouse with Spark SQL by Ema Orhian

Scaling•Standalone mode

•Mesos

•YARN

Fine grained mode

Coarse grained mode

Page 9: Jaws - Data Warehouse with Spark SQL by Ema Orhian

Canceling a query

Page 10: Jaws - Data Warehouse with Spark SQL by Ema Orhian

Canceling a query

Page 11: Jaws - Data Warehouse with Spark SQL by Ema Orhian

Results persistence• Queries with limited number of results:

‣ Cassandra‣ HDFS

• Queries with unlimited number of results:‣ HDFS‣ Tachyon

Page 12: Jaws - Data Warehouse with Spark SQL by Ema Orhian

Working with Tachyon• Persists unlimited results in Tachyon• Registers tables over Parquet files from Tachyon

Tachyon benefits:★ in memory storage system★ share data between applications at a memory

speed

Page 13: Jaws - Data Warehouse with Spark SQL by Ema Orhian

Working with Parquet files• Register tables on top of parquet files

Parquet★ columnar format★ nested data structures★ supports schema evolution★ efficient compression

• Files stored on HDFS or Tachyon• MetaInfo about table stored in Cassandra (feature before Spark

1.3)

Page 14: Jaws - Data Warehouse with Spark SQL by Ema Orhian

Configuring Jaws

• Cassandra

• HDFS

• Spray

• Application

• Spark

sparkConfiguration {spark-master=“spark://devbox.local:7077”

/ “mesos://devbox.local:5050” / yarn-client

spark-mesos-coarse=false / truespark-cores-max=100spark-executor-instances=10 }

Page 15: Jaws - Data Warehouse with Spark SQL by Ema Orhian

Demo