real-time big data analytics engine using impala

41
Real-time Big Data Analytics Engine using Impala Jason Shih Etu 28 Sept, HIT 2013

Upload: jason-shih

Post on 26-Jan-2015

110 views

Category:

Technology


4 download

DESCRIPTION

Cloudera Impala is an open-source under Apache Licence enable real-time, interactive analytical SQL queries of the data stored in HBase or HDFS. The work was inspired by Google Dremel paper which is also the basis for Google BigQuery. It provide access same unified storage platform base on it's own distributed query engine but does not use mapreduce. In addition, it use also the same metadata, SQL syntax (HiveQL-like) ODBC driver and user interface (Hue Beeswax) as Hive. Besides the traditional Hadoop approach, aim to provide low-cost solution for resiliency and batch-oriented distributed data processing, we found more and more effort in the Big Data world pursuing the right solution for ad-hoc, fast queries and realtime data processing for large datasets. In this presentation, we'll explore how to run interactive queries inside Impala, advantages of the approach, architecture and understand how it optimizes data systems including also practical performance analysis.

TRANSCRIPT

Page 1: Real-time Big Data Analytics Engine using Impala

Real-time Big Data Analytics Engine using Impala

Jason Shih Etu 28 Sept, HIT 2013

Page 2: Real-time Big Data Analytics Engine using Impala

Outline

•  Motivation & Users’ perspective •  Impala architecture and data analytics stack Overview •  Performance benchmark •  Use Cases (Demo)

HIT 2013 2

Page 3: Real-time Big Data Analytics Engine using Impala

Motivation & Users’ Perspective •  Leverage existing Hadoop deployment

•  Reuse HIVE metadata, metastore, DLL & JDBC/ODBC drivers. •  File format widely support in Hadoop •  Read performance: disk awareness and short-circuit

•  MPP SQL query engine (over Hadoop) •  billion to trillion records at interactive speeds •  Both analytical & transactional •  General purpose & ad-hoc

•  MR •  High latency, dismissed for interactive workload •  Disk-based intermediated outputs •  Execution strategies (lack of optimization base on data statistics) •  Task and scheduling overhead

•  Task launch delay 5~10sec (pre-defined delay due to the periodic heartbeat for new scheduled tasks).

HIT 2013 3

Page 4: Real-time Big Data Analytics Engine using Impala

Motivation & Users’ Perspective (cont’)

•  High performance •  In memory query engine •  C++ instead of JAVA •  Runtime code generation •  Completely new execution engine (cf. MR framework) •  Data locality and short-circuit read

•  HDFS-2246: avoid HDFS API overhead •  HDFS-34: Making Short-Circuit Local Reads Secure

•  Intermediate data never hits disk •  Data stream to client

HIT 2013 4

Page 5: Real-time Big Data Analytics Engine using Impala

Motivation & Users’ Perspective (cont’)

•  MPP-RDB Paradigm •  HDFS:

•  Scalability & Availability •  Price Performance & Commodity

•  MPP DW appliance: •  Exadata, Vertica, HANA, Aster (SQL-MapReduce), HWAQ (Pivotal

HD) & Dremel etc. •  Pros:

•  Very matured & highly optimized engine •  Cons

•  Generally not fault-tolerance !  For long run queries when cluster scale-up !  Lack rich analytics (machine learning)

HIT 2013 5

Page 6: Real-time Big Data Analytics Engine using Impala

•  Impala •  Real-time queries in Apache Hadoop sit atop HDFS. •  ~2010-2012, 7 FTE (Marcel Kornacker) •  Completely open source, ASLv2 •  GA: connectors for BI, DW general available

Google F1 - The Fault-Tolerant Distributed RDBMS, May 2012

6 Ref: http://www.wired.com/wiredenterprise/2012/10/cloudera-impala-hadoop/

Page 7: Real-time Big Data Analytics Engine using Impala

Impala Overview: SQL Support

•  Functionality highlight: •  SQL-92 features minus correlated subqueries •  SELECT, INSERT INTO, , SELECT ... INSERT INTO … VALUES(…) •  ORDER BY requires LIMIT •  Flexible file format: RCFile

•  Unsupported/Limitation •  WITH clause does not support recursive queries in the WITH •  Only hash join

•  Joined tables has to fit in aggregated memory of all executing nodes •  No beyond SQL

•  buckets, samples, transforms, array, structs, maps, xpath and json •  UDF support

•  Impala 1.2: Support HIVE UDFs (existing jars without recompile) •  Impala native UDF/UDA and UDF/UDA register in metadata catalog

HIT 2013 7

Page 8: Real-time Big Data Analytics Engine using Impala

Impala SQL: create table

HIT 2013 8

Ref: SQL Language Element: http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_langref_sql.html

Page 9: Real-time Big Data Analytics Engine using Impala

Architecture Overview •  Two daemons:

•  impalad: •  Run on all HDFS DNs •  Functions as distributed query engine •  Handle client and internal requests (query exec) •  Design execution plan for queries and processes query on DNs •  Thrift services for these two roles

•  statestored: •  Cluster metadata, name service & metadata distribution

–  cf. HIVE metastore: RDB metadata •  Metadata updated when add/delete impalad processes •  Daemon cache metadata (INVALIDATE METADATA or REFRESH) •  Export thrift service •  Send periodic heartbeats, check for live backend and pushes new data •  Fail of statestore wont affect query execution except for stale state of DN

HIT 2013 9

Page 10: Real-time Big Data Analytics Engine using Impala

Architecture Overview: Impala daemons •  Impalad:

•  Impala 1.1 integrate Sentry for fine-grained authorization framework •  Daemon startup arg (default):

•  impalad -log_dir=/opt/impala/var/log/impala -state_store_port=24000 -state_store_host=impala-server -be_port=22000

•  Enabled security •  Rely on existing Kerberos subsystem for authentication framework •  -use_statestore -kerberos_reinit_interval=60 -principal=impala/impalad-

[email protected] -keytab_file=impala.keytab •  Authorization:

•  -authorization_policy_file arg., feed with .ini fmt •  divide into [groups] & [roles] (opt: [databases] & [users]) •  [users] will override OS-level mapping of users to groups. •  E.g.:

•  Statestored: •  daemon startup:

•  statestored -log_dir=/opt/impala/var/log/impala -state_store_port=24000 •  Enable Kerberos:

•  -kerberos_reinit_interval=60 –principal=impala/[email protected] -keytab_file=impala.keytab

•  Available flags: •  http://statestored-server:25010/varz

HIT 2013 10

Page 11: Real-time Big Data Analytics Engine using Impala

Architecture Overview (cont’)

•  Query execution phases •  Planner, coordinator, executor •  Queries arrive via JDBC/ODBC, Thrift API/CLI, Hue/Beeswax •  Planner turns request into collections of plan fragments •  Coordinator initiates execution on impalad(s) local to data

HIT 2013 11

Page 12: Real-time Big Data Analytics Engine using Impala

Architecture Overview: Query Execution •  Plan fragments upon request from JDBC/ODBC or thrift client •  Initiate execution on impalad by coordinator •  Intermediate result: streamed between impalad •  Results are streamed back to client

12

Page 13: Real-time Big Data Analytics Engine using Impala

Architecture Overview: Query Plan

HIT 2013

•  Plan node & operators: •  Depth-first execution tree •  Scan, HashJoin, HashAggr, Union, TopN, Exchange

•  Two phases processes •  Single node plan (left-deep tree) •  Plan fragments: Partitioning operator tree

•  Fragment: distributed atomic executable unit (plan nodes) •  Distributed plans:

•  Query operators are fully distributed •  Max. scan locality & min. data movement

•  Parallel joins: •  Order: FROM clause •  Broadcast join & partitioned join •  Future roadmap: cost-based optimization based on column stats & cost of data

transfers

13

Page 14: Real-time Big Data Analytics Engine using Impala

Architecture Overview: Query Plan (cont’)

HIT 2013 14

Page 15: Real-time Big Data Analytics Engine using Impala

Logging and Profile •  Impala logs:

•  Logging level control by •  GLOG_v env: “GLOG”

–  Default level = 1, connection logging and execution profile –  Level 2 logged ea. RPC initiated and execution progress info –  Everything plus logging of every row read in 3rd level.

•  -logbuflevel daemon startup flag. •  Exam:

•  $IMPALA_HOME/var/log/impala/{impalad,statestore}.{INFO,WARNING,ERROR} •  Consolidate: impala-server.log & impala-state-store.log •  http://impalad-server:25000/logs

•  Content: •  Startup opt: CPU, available spindles, flags, version and machine info •  Query profile: composition, degree of data locality, throughput statistics and responding

time. •  Auditing log featured in release 1.1.1

•  Extensive analytics data for query execution: •  query profile stored in zlib-compressed fmt: •  $IMPALA_HOME/var/log/impala/profiles •  http://impalad-server:25000/queries

HIT 2013 15

Page 16: Real-time Big Data Analytics Engine using Impala

Performance Tip •  Partitioning

•  Large table & always or almost always queried with conditions on the partitioning columns

•  JOIN •  Broadcast join by default. •  Partitioned join

•  suitable for large tables of roughly equal size •  subsets of rows can be processed in parallel by sending portion of each

tables •  Join the biggest table first •  Joining the table with the most selective filter

•  INSERT •  not suitable for loading large quantities of data into HDFS-based tables, due to

the lack of parallelized operations •  Staging temporary files in an ETL pipeline and upload to HDFS (refresh)

•  Resource usage: •  Impalad startup flag: “-mem_limits” 16

Page 17: Real-time Big Data Analytics Engine using Impala

Troubleshooting Hint •  Queries are slow?

•  Test: “select count(*) from table” •  Non-zero “Total remote scan volume” shown in impalad log indicate either

some DNs not running impalad or impalad instance fail to contact one or more impalad instances.

•  Missing impalad instances from DN •  live backend: http://statestore:25010/metrics

•  Data locality and native checksuming (>= CDH 4.2) •  Enable properties: “dfs.client.read.shortcircuit”

&“dfs.client.read.shortcircuit.skip.checksum” •  Rebuild/reinstall hadoop native lib “libhadoop.so” if needed. •  Error:

–  Unknown disk id. This will negatively affect performance. Check your hdfs settings to enable block location metadata

–  Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

HIT 2013 17

Page 18: Real-time Big Data Analytics Engine using Impala

Troubleshooting Hint (cont’) •  Queries getting slower?:

•  Impalad paging after mem exceeded •  E.g.: mem-limit.h:86] Query: 0:0Exceeded limit: limit=26996031488 consumption=26996148624

•  Incorrect result? •  Invalid metadata (GA: REFRESH, post-GA: INVALID METADATA)

•  Invalid query? •  Cross check the query in HIVE •  Useful debugging info from impala service logs. •  Invalid/unsupported stmt:

•  http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_langref.html#langref

•  Auth error: •  Server logging:

•  Minor code may provide more information (Cannot contact any KDC for realm or Kerberos: •  GSSAPI Error: Unspecified GSS failure

•  Client: “Error connecting: <class 'thrift.transport.TTransport.TTransportException'>, TSocket read 0 bytes” •  Ensured

•  valid Kerberos ticket lifetime at client •  Specify “-s” service principal and flag “–k” aim for kerberized impalad connection.

HIT 2013 18

Page 19: Real-time Big Data Analytics Engine using Impala

Limitation and Wish List •  limitation:

•  Subquery referenced in the SELECT •  Optional WITH clause before the INSERT.

•  Recursive queries in the WITH clauses •  Inconsistent VIEW

•  parenthesis in WHERE clauses

•  Wish list •  SQL modeling tool •  Fault tolerance query •  Memory management (caching parquet table) & usage estimation

•  Aggregation group of columns (> 30 etc.)

HIT 2013 19

Page 20: Real-time Big Data Analytics Engine using Impala

Impala: Now & Future Roadmap •  Now (1.1.x/1.0)

•  OS Support: •  RHEL/CentOS 5.7, Ubuntu, Debian, SLES, and Oracle Linux

•  Connecters: JDBC/ODBC drivers •  DDL support & SQL performance optimization •  Fast & memory efficient: join & aggregation •  File format: Parquet, Avro & LZO compressed

•  Future (1.2) – late 2013 •  UDF and extensibility •  Automatic metadata refresh •  In-memory HDFS caching •  Cost-base join order optimization •  Preview of YARN-integrated resource manager

•  2.0 Roadmap – first 3rd of 2014 •  SQL 2003-compliant analytic window functions •  Additional authentication mechanisms •  UDTFs (user-defined table functions) •  Intra-node parallelized aggregations and joins •  Nested data •  YARN-integrated resource manager •  Additional data types – including Date and Decimal types

HIT 2013 20

Page 21: Real-time Big Data Analytics Engine using Impala

More Information & Related Works •  “Dremel: Interactive Analysis of Web-Scale Datasets”, Sergey Melnik et

al., Google •  Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real

http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache- hadoop-for-real/

•  “Impala unlocks Interactive BI on Hadoop with MicroStrategy”, Justin Erickson & Jochen Demuth, Cloudera

•  “Cloudera impala Performance Evaluation”, Yukinori SUDA •  “HANA vs Impala, on AWS Cloud”, Aron MacDonald •  “Spark and Shark: High-speed In-memory Analytics over Hadoop Data”,

Reynold Xin, AMPLab •  Stinger Initiative http://hortonworks.com/blog/100x-faster-hive/ •  Apache Drill: http://incubator.apache.org/drill/

HIT 2013 21

Page 22: Real-time Big Data Analytics Engine using Impala

Performance Evaluation

0 20 40 60 80 100

Shark

Impala

PIG

Elephant

Km/h

Ref: Wiki & http://www.speedofanimals.com

Page 23: Real-time Big Data Analytics Engine using Impala

Breakdown of DNS Anomaly Analytics

HIT 2013 23

Two DN + Master -  Dual DC E5620 2.40GHz -  MEM 32GB ea. -  4 spindles, 2T ea.

HD

FS (GB

) Q

uery

Res

p.(s

ec)

Page 24: Real-time Big Data Analytics Engine using Impala

Data Volume and Ingest

HIT 2013

1D 1W 1M 2M

Data (Raw) (GB) 5.1 35 140 280

Data (HDFS) (GB) 3.8 25.9 103.6 207.2

Blocks (HDFS) 31 211 844 1598

MEvt 42 291 1,166 2,209

24

Page 25: Real-time Big Data Analytics Engine using Impala

PIG vs. Impala

•  Domain level compute in preprocessing streaming. •  DN sort throughput: ~120MB/s throughput & SIP/Qry ~ 50MB/s. •  Processing time scale linearly with data vol.

HIT 2013 27

Query Resp. (sec) Impala: 71s

7 times faster.

Page 26: Real-time Big Data Analytics Engine using Impala

Observation & Estimation •  Speed-up: 4.5~7 times •  DL Calc.: 57~70% memory usage •  Data ingest

!  Est. ~3TB take ~55K sec. •  Plus pre-processing time

!  Throughput constrain to GbE linkage (in/out bound) !  Avg. throughput ~80MB/s

•  non-encrypted file transfer

•  RTQ: ~15K sec for 3TB process !  c.f. 115K base on MR

HIT 2013 28

Page 27: Real-time Big Data Analytics Engine using Impala

Query Throughput & Latency

•  Queries •  20 from TPC-DS •  3 categories

•  Interactive: 1month •  Reports: several months •  Deep analytics: all data

•  Fact table: •  1TB snappy-seq.-files/5Yr

•  Resource level: •  20 nodes, 24cores/node.

•  Speed-up: •  Interactive: 25~68 •  Reports: 6~56 •  Deep analytics: 6~55

29 Ref: “Impala: A Modern, Open-Source SQL Engine for Hadoop”, Marcel Kornacker, Cloudera

Page 28: Real-time Big Data Analytics Engine using Impala

Impala vs. Stinger •  Stinger

•  Optimize execution plan •  TEZ framework optimize execution

•  Columnar file format

30 Ref: Cloudera Impala Overview, Scott Leberknight, Cloudera.

Page 29: Real-time Big Data Analytics Engine using Impala

Impala Use Cases

Offloads DW for ad hoc query environment, ETL and archiving Interactive BI/analytics on large volume of data Real-time response for unstructured data analysis

Page 30: Real-time Big Data Analytics Engine using Impala

Impala and HIVE

HIT 2013 32

•  Impala: •  Native MPP query engine for low

runtime overhead & interactive SQL •  No fault tolerance •  GA: UDF supported

•  HIVE •  MapReduce as an execution engine •  Fault-tolerant leveraging MR framework •  High runtime overhead (extensive

layering) •  UDF

•  Common for client: •  SQL syntax

•  highly compatible with HiveQL •  ODBC/JDBC drivers •  Metadata (table definition) •  HUE

Page 31: Real-time Big Data Analytics Engine using Impala

Data Warehouse Offload

33 Ref: Hadoop and the Data Warehouse: When to Use Which, Teradata

Page 32: Real-time Big Data Analytics Engine using Impala

Query Run Times •  Table with 60M Records

34 Ref: HANA vs Impala, on AWS Cloud

Page 33: Real-time Big Data Analytics Engine using Impala

TPC-H Query Run Times •  Lineitem table 60M Rows

35 Ref: HANA vs Impala, on AWS Cloud

Page 34: Real-time Big Data Analytics Engine using Impala

•  On-demand Customer Segmentation based on various demographic and mobile behavior attributes

•  On-demand Customer Profiling through fast screening & ranking of critical attributes With the power of distributed in-memory computation on hadoop, Impala enables market analyst to conduct various interactive analytics such as OLAP, statistical correlation, and data mining on big data.

HIT 2013 36

Page 35: Real-time Big Data Analytics Engine using Impala

「 標族群 」關聯屬性分析

� 33% �� 28% �� 27%

F� 12%

Facebook 43% Twitter 31%

Google+19%

LinkedIn 7%

�� 27%

�� 23%

K� 39% F� 11%

Facebook 44% Twitter 30%

Google+17%

LinkedIn 9%

[4 53%

(4 47%

[4 56% (4 44%

���app 28%

Co�app 17% 1��app 23%

XO�app 18%

.��app 14%

���app 25%

Co�app 14%

1��app 20%

-%DApp 33% .��app 10%

�E.�

1#.�

$(7F $82�

– A>��3; q 4'9=

39

Page 36: Real-time Big Data Analytics Engine using Impala

DEMO

•  CREATE TABLE, LOAD DATA from HDFS DROP TABLE IF EXISTS demo; CREATE EXTERNAL TABLE demo ( a string, b int, c int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/etu/demo';

•  PIG & Impala: •  SUM •  SUM with GROUP BY

HIT 2013 40

Page 37: Real-time Big Data Analytics Engine using Impala

DEMO (cont’)

•  SUM in PIG: a = load 'demo/demo_data.csv' using PigStorage(',') as (col1:chararray, col2:int, col3:int); b = foreach a generate col1, col2, col3, 1 as col4; d = group b by col4; d1 = foreach d generate SUM(b.col4); store d1 into 'demo/count2' using PigStorage(',');

•  SUM in Impala: SELECT sum(demo.c) FROM demo;

HIT 2013 41

Page 38: Real-time Big Data Analytics Engine using Impala

DEMO (cont’)

•  SUM with GROUP BY in PIG a = load 'demo/demo_data.csv' using PigStorage(',') as (col1:chararray, col2:int, col3:int); b = foreach a generate col1, col2, col3, 1 as col4; c = group b by col1; c1 = foreach c generate group, SUM(b.col2); store c1 into 'demo/count1' using PigStorage(',');

•  SUM with GROUP BY in Impala SELECT demo.a AS tag, sum(demo.b) AS val FROM demo GROUP BY demo.a;

HIT 2013 42

Page 39: Real-time Big Data Analytics Engine using Impala

DEMO (cont’)

•  Speed-up:

HIT 2013 43

Query Resp. (sec)

X 60

X 18

Two DN, same spec for DNS log analytics. Dual DC E5620, MEM 32GB ea. ~100 time faster when cluster scale.

Page 40: Real-time Big Data Analytics Engine using Impala

44

Question? [email protected]

Slideshare

www.slideshare.net/hlshih/hit2013-impala-0925etu

Acknowledgement Dr. CM Fan, MFactory, SYSTEX

Page 41: Real-time Big Data Analytics Engine using Impala

www.etusolution.com [email protected] Taipei, Taiwan 318, Rueiguang Rd., Taipei 114, Taiwan T: +886 2 7720 1888 F: +886 2 8798 6069 Beijing, China Room B-26, Landgent Center, No. 24, East Third Ring Middle Rd., Beijing, China 100022 T: +86 10 8441 7988 F: +86 10 8441 7227

Contact