deep dive into enterprise data lake through impala

of 41 /41
Deep dive into enterprise data lake through Impala Evans Ye 2014.7.28 01/15/202 2 Confidential | Copyright 2013 TrendMicro Inc. 1

Upload: evans-ye

Post on 19-Aug-2014

324 views

Category:

Engineering


9 download

DESCRIPTION

A Hadoop near real-time solution Impala

TRANSCRIPT

Page 1: Deep dive into enterprise data lake through Impala

Deep dive into enterprise data lake through Impala

Evans Ye2014.7.28

04/07/2023 Confidential | Copyright 2013 TrendMicro Inc. 1

Page 2: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Agenda

• Introduction to Impala• Impala Architecture• Query Execution• Getting Started• Parquet File Format• ACL via Sentry

2

Page 3: Deep dive into enterprise data lake through Impala

04/07/2023

3

Introduction to Impala

Confidential | Copyright 2013 TrendMicro Inc.

Page 4: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Impala

• General-purpose SQL query engine• Support queries takes from milliseconds to

hours (near real-time)• Support most of the Hadoop file formats

– Text, SequenseFile, Avro, RCFile, Parquet• Suitable for data scientists or business

analysts

2

Page 5: Deep dive into enterprise data lake through Impala

04/07/2023

5

Why do we need it?

Confidential | Copyright 2013 TrendMicro Inc.

Page 6: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

SPEED

Page 7: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Current adhoc-query solution - Pig

• Do hourly count on akamai log– A = load ‘/test_data/2014/07/20/00'

using PigStorage();B = foreach (group A all) COUNT_STAR(A);dump B;

– …0% complete100% complete(194202349)

2

4mins, 28sec

Page 8: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Using Impala

• No memory cache– > select count(*) from test_data

where day=20140720 and hour=0– 194202349

• with OS cache

• Do a further query:– select count(*) from test_data where day=20140720

and hour=00 and c='US';– 41118019

2

96.46s

9.07s

6.57s

Page 9: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

Page 10: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Status quo

• Developed by Cloudera• Open source under Apache License• 1.0 available in 05/2013• current version is 1.4• connect via ODBC/JDBC/hue/impala-shell• authenticate via Kerberos or LDAP• fine-grained ACL via Apache Sentry

2

Page 11: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Benefits

• High Performance– C++– direct access to data (no Mapreduce)– in-memory query execution

• Flexibility– Query across existing data(no duplication)– support multiple Hadoop file format

• Scalable– scale out by adding nodes

2

Page 12: Deep dive into enterprise data lake through Impala

04/07/2023

12

Impala Architecture

Confidential | Copyright 2013 TrendMicro Inc.

Page 13: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Impala Architecture

Datanode

Tasktracker

Regionserver

impala daemon

2

NN, JT, HMActive

NN, JT, HMStandby

Datanode

Tasktracker

Regionserver

impala daemon

Datanode

Tasktracker

Regionserver

impala daemon

Datanode

Tasktracker

Regionserver

impala daemon

State store

Catalog

Hive Metastore

Page 14: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Components

• Impala daemon– collocate with datanodes– handle all impala internal requests related to query

execution– User can submit request to impala daemon running on

any node and that node serve as coordinator node• State store daemon

– communicates with impala daemons to confirm which node is healthy and can accept new work

• Catalog daemon– broadcast metadata changes from impala SQL

statements to all the impala daemons

2

Page 15: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Fault tolerance

• No fault tolerance for impala daemons– A node failed, the query failed

• state-store offline– query execution still function normally– can not update metadata(create, alter…)– if another impala daemon goes down, then

entire cluster can not execute any query• catalog offline

– can not update metadata

2

Page 16: Deep dive into enterprise data lake through Impala

04/07/2023

16

Query Execution

Confidential | Copyright 2013 TrendMicro Inc.

Page 17: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

Page 18: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

Page 19: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

Page 20: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc. 2

Page 21: Deep dive into enterprise data lake through Impala

04/07/2023

21

Getting Started

Confidential | Copyright 2013 TrendMicro Inc.

Page 22: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Impala-shell (sercure cluster)

• $ yum install impala-shell• $ kinit –kt evans_ye.keytab evans_ye• $ impala-shell --kerberos \

--impalad IMPALA_HOST

2

Page 23: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Create and insert

• > create table t1 (col1 string, col2 int);• > insert into t1 values (‘foo’, 10);

– only supports writing to TEXT and PARQUET tables

– every insert creates 1 tiny hdfs file– by default, the file will be stored under

/user/hive/warehouse/t1/– use it for setting up small dimension table ,

experiment purpose, or with HBase table

2

Page 24: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Create external table to read existing files

• > create external table t2 (col1 string, col2 int)row format delimited fields terminated by ‘\t’ location ‘/user/evans_ye/test_data’;– location must be a directory

(for example, pig output directory)– files to read:

• V /user/evans_ye/test_data/part-r-00000• X /user/evans_ye/test_data/_logs/history• X /user/evans_ye/test_data/20140701/00/part-r-00000• no recursive

2

Page 25: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

No recursive?

• Then how to add external data with folder structure like this:– /user/evans_ye/test_data/20140701/00

/user/evans_ye/test_data/20140701/01…/user/evans_ye/test_data/20140701/02…/user/evans_ye/test_data/20140702/00

2

Page 26: Deep dive into enterprise data lake through Impala

04/07/2023

26

Partitioning

Confidential | Copyright 2013 TrendMicro Inc.

Page 27: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Create the table with partitions

• > create external table t3 (col1 string, col2 int)partitioned by (`date` int, hour tinyint) row format delimited fields terminated by ‘\t’;

• No need to specify the location on create

2

Page 28: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Add partitions into the table

• > alter table t3 add partition (`date`=20130330, hour=0) location ‘/user/evans_ye/test_data/20130330/00‘;

2

Page 29: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Partition number

• thousands of partitions per table– OK

• tens of thousands partitions per table– Not OK

2

Page 30: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Compute table statistics

• > compute stats t3;• > show table stats t3;

• Help impala to optimize join query:broadcast join, partitioned join

2

Page 31: Deep dive into enterprise data lake through Impala

04/07/2023

31

Parquet File Format

Confidential | Copyright 2013 TrendMicro Inc.

Page 32: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Parquet

• apache incubator project• column-oriented file format

– compression is better since all the value would be the same type– encoding is better since value could often be the same and

repeated– SQL projection can avoid unnecessary read and decoding on

columns• Supported by Pig, Impala, Hive, MR and Cascading• impala by default use snappy with parquet• impala + parquet = google dremel

– dremel doesn’t support join– impala doesn’t support nested data structure(yet)

2

Page 33: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Transform text files table into parquet format

• > create table t4 like t3 stored as parquet;• > insert overwrite t4

partition (`date`, hour) select * from t3

2

Page 34: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Using parquet in Pig

• $ yum install parquet• $ pig• > A = load ‘/user/hive/warehouse/t4’

using parquet.pig.ParquetLoader as (x: chararray, y: int);

• > store A into ‘/user/evans_ye/parquet_out’ using parquet.pig.ParquetStorer;

2

Page 35: Deep dive into enterprise data lake through Impala

04/07/2023

35

ACL via

Confidential | Copyright 2013 TrendMicro Inc.

Page 36: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Sentry

• apache incubator project• provide fine-grained role based

authorization• currently integrates with Hive and Impala• require strong authentication such as

kerberos

2

Page 37: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Enable Sentry for Impala

• turns on Sentry authorization for Impala– add two lines into impala daemon’s configuration

file(/etc/default/impala)

– auth-policy.ini Sentry policy file– server1 a symbolic name used in policy file

• all impala daemons must specify same server name

2

Page 38: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Sentry policy file example

• roles:– on server1,

spn_user_role has permission to read(SELECT) all tables in spn database

• groups– evans_ye group has role spn_user_role

2

Page 39: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Sentry policy file example

• roles:– evans_data has permission to access

/user/evans_ye• allows you to add data under /user/evans_ye as

partitions– foo_db_role can do anything in foo database

• create, alter, insert, drop

2

Page 40: Deep dive into enterprise data lake through Impala

04/07/2023

Confidential | Copyright 2013 TrendMicro Inc.

Impala 2014 Roadmap

• 1.4 (now available)– order by without limit

• 2.0– nested data types (structs, arrays, maps)– disk-based joins and aggregations

2

Page 41: Deep dive into enterprise data lake through Impala

41

Q&A

04/07/2023 Confidential | Copyright 2013 TrendMicro Inc.