bigdatain real-time -...
TRANSCRIPT
![Page 1: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/1.jpg)
BigData in Real-time
TCloud Computing 天云趋势
Impala Introduction
2012/12/13Beijing
Apache Asia Road Show
![Page 2: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/2.jpg)
Background (Disclaimer)
• Impala is NOT an Apache Software Foundation project yet
• Impala uses ASLv2
• The speaker (me) is NOT associated with Cloudera
![Page 3: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/3.jpg)
WHYImpala
![Page 4: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/4.jpg)
The Need For Speed
• What’s wrong with MapReduce?
– Batch oriented. Good at complex jobs
– But slow at startup & shuffle
– Programmer friendly
• How about Hive?
– SQL friendly. Still slow as …
• How about HBase?
– Slow data import
– No SQL
![Page 5: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/5.jpg)
The leads from the leader
• Google BigQuery, the service
– Based on Google Dremel, the paper
• SQL-like interface
• Interactive analysis of PBs data
• Query Execution Tree
– Tasks to sub-tasks, instead of identical distributed tasks
• Columnar storage based on nested ProtoBuffer data
– Faster traversing
• Amazon RedShift is another story…
![Page 6: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/6.jpg)
Open source alternatives?
• Apache Drill
– No substantial progress
– Mailing list msg # droped 80% from Sep to Nov
• Berkeley Shark/Spark
– Shared memory based, good at iteration tasks
– Different component stack
• Cloudera Impala
![Page 7: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/7.jpg)
Positioning
• Compared with MR, it’s all about trade-off
– Complexity or responsiveness
– General purpose or ad-hoc
• MPP-RDB paradigms on top of commodity DFS
– On par performance in some cases
– Extremely cheap
– Linear scalability
![Page 8: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/8.jpg)
WHATis
Impala
![Page 9: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/9.jpg)
Features
• Distributed SQL on raw HDFS files– Select, where, aggregation, join, – Insert into/overwrite– Text and Sequence files
• Hive compatible “meta store” and interface
– Reuse Hive’s metadata schema, DDL and JDBC/ODBC driver
• Up to 90x times faster, compared with Hive– Purely I/O bound scenario, 3-4X– With joins, 7-45X– With memory cached, 20-90X
![Page 10: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/10.jpg)
Status
• Announced at Oct/2012
– Now 0.3 at Dec/5
– Has been private beta for half-year
– Currently in public beta
– Target GA @ 2013 Q1
• Entirely developed by Cloudera (by now)
– In the past 2 years, 7 full time engineers
• Completely open source, ASLv2
![Page 11: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/11.jpg)
Example$ hive
hive> CREATE TABLE sales (id STRING, item STRING, price int);
hive> load data local inpath ‘/store/sales.txt’ into table sales;
$ impala-shell –impalad=172.16.204.4:21000
[172.16.204.4:21000]> show tables
sales
[172.16.204.4:21000]> select * from sales where price > 100 limit 3
13221 COOKIE 138
38384 DIPER 287
85845 TV 737
![Page 12: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/12.jpg)
Workflow Overview
HDFS DataNode
Impalad
HDFS DataNode
Impalad
HDFS DataNode
Impalad
HDFS DataNode
Impalad
CLIClient
ODBCDriver
Meta dataMySQL
Impala State Store
1-on-1 communication
Simplified multi-way communication
HueBeeswax
HDFS NameNode
![Page 13: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/13.jpg)
HOWthe Impalaspeed up
![Page 14: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/14.jpg)
‘MPP’ SQL
• PlanNode– Node of the Depth-First execution plan tree– Various types
• HDFS_SCAN_NODE, HBASE_SCAN_NODE• HASH_JOIN_NODE• AGGREGATION_NODE, SORT_NODE• EXCHANGE_NODE
• Fragment– Atomic executable unit, could be distributed– Contains one or more PlanNodes
• Depends on the data distribution and the SQL statement
![Page 15: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/15.jpg)
SQL breakdown sample
• There’s a saying that young single males don’t use
coupon or discount as much as others, is it true?
• We can compare the list price and sales price
• Items are a little bit expensive
• Buyers are young, single, male
• Live in major city
![Page 16: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/16.jpg)
SQL breakdown sample - tables
store_salesss_item_id
ss_customer_idss_sales_price
……customer
c_idc_gender
c_marital_statusc_city……
itemi_item_id
i_list_price……
![Page 17: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/17.jpg)
SQL breakdown sample – SQL statement
select i_item_id, i_list_price, avg(ss_sales_price) agg1
FROM store_sales
JOIN item on (store_sales.ss_item_id = item.i_item_id)JOIN customer on (store_sales.ss_customer_id = customer.c_id)
where
i_list_price > 1000 andc_gender = 'M' andc_marital_status = 'S‘ and
c_city in ('Beijing','Shanghai','Guangzhou')
group by i_item_id,
order by i_list_price
limit 1000
![Page 18: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/18.jpg)
Execution plan tree
Scansales_store
Scancustomer
Scanitem
Join onitem_id
Join oncustomer_id
Group by &Aggregation
Sort & Limit
![Page 19: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/19.jpg)
Execution plan tree – Distributed!
Scansales_store
Scancustomer
Scanitem
Join onitem_id
Join oncustomer_id
Group by &Local Agg
Total Aggregation
Sort / Limit
Distributable
Indistributable
![Page 20: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/20.jpg)
From execution plan to fragment
Scansales_store
Scancustomer
Scanitem
Join onitem_id
Join oncustomer_id
Group by &Aggregation
Total Aggregation
Sort / Limit
![Page 21: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/21.jpg)
From execution plan to fragment
Scansales_store
Scancustomer
Scanitem
Join onitem_id
Join oncustomer_id
Group by &Aggregation
Total Aggregation
Sort / Limit
![Page 22: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/22.jpg)
From execution plan to fragment
Scansales_store
Scancustomer
Scanitem
Join onitem_id
Join oncustomer_id
Group by &Aggregation
Total Aggregation
Sort / Limit
![Page 23: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/23.jpg)
THAT’S IT?
![Page 24: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/24.jpg)
There’s more…
• Written in C++– Only used jFlex/CUP to parse the SQL statement
• Local compilation of fragments– LLVM is used
• Disk awareness– Not just host awareness– dfs.datanode.hdfs-blocks-metadata.enabled– “40% faster”
• Direct read– Not via HDFS NameNode then DataNode then …– dfs.client.read.shortcircuit, dfs.client.read.shortcircuit.skip.checksum
![Page 25: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/25.jpg)
GOODENOUGH?
![Page 26: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/26.jpg)
TODOs
• No Data Definition Language yet
• No User Defined Function yet
• No fault tolerance yet
• Avro, RCFile, LZO, Trevni support is on the way– Impala + Trevni will introduce another performance boost
• With more SQL functions compared with BigQuery
• In memory Join only– Will be fixed in GA
• Partition before join, reduce traffic
• Support Hive partitions, but not buckets yet
![Page 27: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/27.jpg)
INSIDE
![Page 28: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/28.jpg)
Exec Engine
ImpaladPython CLI
JDBC/ODBC
Hue BeeswaxFrontend
Coordinator
Exec Engine
ThriftMeta data
MySQL
Another Impalad
22000
PlannerjFlex/CUP
22000Backend, ImpalaInternalService
25000
JNI
21000 (Frontend, ImpalaService)
2400023000Register
&Subscribe
State Store
HDFS
HBase
![Page 29: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/29.jpg)
All C++ code, both be and fe
LLVM IR generation
PlanNode implementation
Math, date, string, time, agg, etc
Coordinator, Executor, etc
main(), thrift server
State store
.thrift, very useful
Frontend, Planner
Python CLI shell
![Page 30: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/30.jpg)
趋势科技如何使用Hadoop
• 天云趋势由天云科技和趋势科技共同投资成立于2010年3月
• 趋势科技是Hadoop的重度使用者:
– 2006年开始使用, 用于处理Web Reputation和Email Reputation等
– 五个数据中心, 近1000个节点
– 日均处理3.6T日志数据
• 含HoneyPot收到的垃圾邮件但不含爬取的网页等
– 亚洲最早, 也是最大的代码贡献者
– HBase 0.92新功能的主要开发者 (coprocessors, security)
![Page 31: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/31.jpg)
天云趋势的BigData业务
– TCloud BigData Platform, 集部署、监控、配置优化、数据处理以及扩
展SDK库于一体的集群管理和开发平台
– 丰富的垂直行业开发经验,包括电信、电力、安全等
– 为多家大型传统软件开发厂商提供咨询服务
– Hadoop管理员、开发者培训,也可为企业内部进行定制化培训
http://www.tcloudcomputing.com.cn
![Page 32: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1](https://reader030.vdocuments.mx/reader030/viewer/2022040214/5ec566ba03829f7f455dd7bd/html5/thumbnails/32.jpg)
@少年振南