Download - Impala and BigQuery

Transcript
Page 1: Impala and BigQuery

Impala and BigQuery

By David Gruzman

BigDataCraft.com

Page 2: Impala and BigQuery

Impala and BigQuery

by David Gruzman

► Big Query is google's database service based on the Dremel. Big Query is hosted by Google.►Impala is open source database inspired by the Dremel paper. Impala is part of the Cloudera Hadoop distribution.

Page 3: Impala and BigQuery

Today agenda

► Overview of Dremel as a technology► Overview of the BigQuery► A few words about Impala► DG Mediamind use case► Deeper insights into Impala► Conclusions► Q&A

Page 4: Impala and BigQuery

Why dremel?

► Google is first who got MapReduce► Google is first faced MapReduce main problem

– latency. The problem was propagated to engines on top of MapReduce also.

► It is logical that Google was first who approached it by developing real time query capability for big data.

Page 5: Impala and BigQuery

How dremel is used in google

► Dremel is not replacement for the MapReduce or Tenzing but complements it. (Tenzing is Google's Hive)

► Analyst can make many fast queries using Dremel

► After getting good idea what is needed – run slow MapReduce (or SQL based on MapReduce) to get precise results

Page 6: Impala and BigQuery

Why dremel is Unique

► Dremel with BigQuery built on top of it is probably only Interactive big data query engine today.

► I mean that it is only engine capable to produce

results over terabytes of data in seconds!► Main idea (my guess) that is harness huge

cluster of machines for the single query.

Page 7: Impala and BigQuery

Dremel as technology

Novel Hierarchical columnar format.

► LLVM based code generation.

► Distributed aggregation Tree

► In-situ data processing. (inside the storage)

Page 8: Impala and BigQuery

Dremel : Aggregation tree

Page 9: Impala and BigQuery

Dremel : Nested columnar format

Page 10: Impala and BigQuery

Big Query

► Service built by google on top of the Dremel engine

► Only (known to me) query engine as a service working with BigData.

► Query time not depends on data size

Page 11: Impala and BigQuery

BigQuery main capabilities

► Aggregations► Join of big table to small table.► Join of two big tables (recently added)► Hierarchical data format. It makes pre-

aggregations cheaper.

Page 12: Impala and BigQuery

Main limitations

► Small results size► Intermediate results should not exceed memory

size.► No “external tables”

Page 13: Impala and BigQuery

Why BigQuery is not popular

Page 14: Impala and BigQuery

So,why BigQuery is not popular

► Data is not created in google cloud. It is hard and not practical to move big data. It is heavy, after all.

► Google is used to change APIs. BigQuery also changed during last years. It is hard to build busines.

► Many companies in Internet related businesses a wary of sharing data with Google.

► It is expensive. 35$ per TB can give 1000th of dollars bills per day.

Page 15: Impala and BigQuery

Dremel

Page 16: Impala and BigQuery

In the same time – it is goodtechnically

► I got referances from company doing serious testing

► Marting Fawler's company also tested it and give very good feedback.

Page 17: Impala and BigQuery

Question to all of you

Why Your organization decided not to use google's Big Query?

Page 18: Impala and BigQuery

Where we can find Impala

Page 19: Impala and BigQuery

Impala

Page 20: Impala and BigQuery

What is impala

► Massive parralel processing (MPP) database engine, developed by Cloudera.

► Integrated into Hadoop stack on the same level as MapReduce, and not above it (as Hive and Pig)

HDFS

Map Reduce

HivePig

Impala

Page 21: Impala and BigQuery

Why impala

► Data has a gravity► Today a lot of data live in HDFS► It is not practical to move big data► It is practical to bring engine to the data► In the same time – MapReduce is not must► Impala process data in Hadoop cluster without

using MapReduce

Page 22: Impala and BigQuery

MapReduce bypass

► Several other modern Database engines also realized the opportunity to bypass MapReduce but work right with HDFS.

► They takes various approaches.►

Page 23: Impala and BigQuery

MapReduce Bypass

► Existing MPP databases, like Greenplum – store their external tables in the HDFS

Page 24: Impala and BigQuery

MapReduce bypass

► Jethrodata store data in their own format on HDFS and also work with it without MR layer.

► They have their proprietary format which enable full indexing of the data together with columnar efficiency. In cases of high selectivity queries this approach has serious advantages.

Page 25: Impala and BigQuery

Use Case from DG

I think it is will be typical case in the future► DG is using Hadoop and Hive► Evaluation Impala to do part of things more

efficiently.► After their case presentation we will back to

discuss insights of the Impala

Page 26: Impala and BigQuery

Again – Impala has different place then Pig and Hive

HDFS

Map Reduce

Hive and Pig

Impala

Page 27: Impala and BigQuery

Impala architecture

Page 28: Impala and BigQuery

Impala – Dremel traces

► LLVM code generation► It is really fast► C++ as implementation language (not Java...)► Simple query engine. It actually doing things

which can be done in memory. ► Broadcast join algorithm is implemented

Page 29: Impala and BigQuery

LLVM code generation

► Assume you want to write custom code for the specific query. It will be super efficient

► Code generation automate this process for each query

► We actually need to super-optimize inner loop doing filtering (where) and group by.

► LLVM enables us to compile in fraction of seconds into native code

► LLVM enable us to enjoy new CPU capabilities like SSE in a portable way.

Page 30: Impala and BigQuery

Why code generation it interesting?

► If you develop own engine, or some peace of code responsible to process serious data volumes code generation may give you order of magnitude boost.

► I had cases when usage of such technology was game changing

Page 31: Impala and BigQuery

Impala – Hive Traces

► While dremel converts data into own format,

Impala supports multiple formats. It is kind of schema on read.

► Impala shares metastore with Hive, which enables very simple adoption

► Internally Impala have well defined way to add new formats

Page 32: Impala and BigQuery

Impala – unique things

► Impala “format adapters”, called scanners have predicate pushdown capability.

► Probably only open source MPP engine► Today we do not have any other means to run

hundreds of CPU cores in one query efficiently without expensive license.

► Hive give us the same but not efficiently.

Page 33: Impala and BigQuery

Impala vs MPP

► It usually tooks many years to create MPP database.

► There are serious simplifications:► The data is read only► There is actually not DBMS – only query

engine.► No serious resource management, but

measurement (all over code).

Page 34: Impala and BigQuery

Impala – hive killer?

► Not so quickly. ► Hive is doing things Impala can not do yet, like

joins between several big tables.► Hive has convinient java UDF, while impala is

not► Impala does not have inter-query fault

tolerance.► In the same time – MapReduce is not good

framework for the database engine

Page 35: Impala and BigQuery

Impala – Data Formats

► There are scanners for the following types:► RCFile► Parquet (native dremel format)► CSV► AVRO► Sequence File

Page 36: Impala and BigQuery

Impala – future

► Will get closer to other MPP engines► Support more formats► More advanced scheduling and resource

management

Page 37: Impala and BigQuery

Basic benchmark

► TPC-H, Q1, SF=10► 4 EC2 large instances► 4 seconds, while hive takes about 1 minute.

► This number means group by speed of about 235MB/sec per core.

Page 38: Impala and BigQuery

Impala price per GB

► 1 Large instance costs $0.24► Cluster costs 0.96 per hour.► Cost of 1 second : 0.96 / 3600► We process by such cluster 1.75GB per second► So cost of 1 TB processing is about $0.15► It is about 300 times cheaper then BigQuery

Page 39: Impala and BigQuery

Performance - summary

► It is fast when data reduction is big► It is fast, when data is hot.► It should enjoy fast storage / SSD. My

measurements shows about 200 MB/sec per core group by processing

► Always faster then Hive at least 10 times

Page 40: Impala and BigQuery

What with clouds?

Page 41: Impala and BigQuery

Impala in cloud is not elastic

► To be elastic we need to create cluster when we need it.

► Even if we agree to by hour resolution – storage will be a problem

► S3 will not give us hundreds of Mbs per second per instance

► To store data in local file system – is transient

Page 42: Impala and BigQuery

Impala - conclusions

► It is first time I remember when we can put our hands on free MPP database.

► There is no risk to try it side-by-side with Hive► It is possible to offload part of the work to

Impala and do the rest with Hive► It is part of the Cloudera Hadoop distribution

and easily installed by Cloudera Manager

Page 43: Impala and BigQuery

Materials used

► Benchmarks

http://www.slideshare.net/sudabon/performance-evaluation-of-cloudera-impala-20121208-15536323

https://amplab.cs.berkeley.edu/benchmark/► Architecture

http://www.slideshare.net/scottleber/impala-19176906

https://cloud.google.com/files/BigQueryTechnicalWP.pdf

► POC

http://martinfowler.com/articles/bigQueryPOC.html

Page 44: Impala and BigQuery

Material used - comparisons

► To hive: http://www.quora.com/Cloudera/Does-Cloudera-Impala-have-any-drawbacks-when-compared-with-Hive

► To vertica: http://www.quora.com/Cloudera-Impala/How-does-Cloudera-Impala-compare-to-Vertica

► To dremel: http://www.quora.com/Cloudera-Impala/How-does-Clouderas-Impala-compare-to-Googles-Dremel

Page 45: Impala and BigQuery

Thank you!!!

► Special thanks to► Faina Kamenetsky – who helped set up clusters

in amazon.

Page 46: Impala and BigQuery

BigDataCraft.com

► We are boutique consulting company► Our services are:► On paper POC► On hardware POC► Architecture / Design reviews► Custom integrations and bug fixing

Page 47: Impala and BigQuery

Impala - Flow


Top Related