olap at toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · challenge •...

22
OLAP at Toutiao [email protected]

Upload: others

Post on 07-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

OLAP at [email protected]

Page 2: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

Challenge• 14000+ Hive Table

• 20+ billion rows for Top 10 daily partitions (300+ billion rows for Top 1)

• Data Security

• Variety of complex HQLs

• 50+ HS2 instances running on Marathon with Consul (trouble shooting)

Page 3: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

Case study

Page 4: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

Case study

Page 5: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

Case study

Page 6: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

Case study

• Operator Priority: and > or

• A and B or C <> A and (B or C)

• PartitionPruner#compactExpr

Page 7: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

Open Source Projects

• Apache Hive (HMS + HS2)

• Apache Spark (SQL)

• Presto

• Apache Kylin (Data Cube)

• Apache Sentry (Authorization)

Page 8: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

Architecture Overview

Sentry HMS HDFS YARN

HS2 Spark SQL

QueryEditor Priest TEA

Presto Kylin QAP

Other Tools

Page 9: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

What we did?

• Introduce Presto

• rewrite HQL to ANSI SQL

• deployed on YARN by slider

• HiveMPP

Page 10: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

What we did?

Page 11: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

What we did?• Query Analysis Platform (QAP)

• Caching HMS & HDFS RPC to speed up HQL semantic analysis.

• Extract Query Cost Features. (Cardinality Estimation)

• Predict elapsed time for every HQL query. (decision tree regressor)

Page 12: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

What we did?

• HMS & HS2 as a Service

• We have 50+ HS2 instances

• Emit metrics for Every HS2 RPC call.

Page 13: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

What we did?

Metrics for Every HS2 RPC call

Page 14: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

What we did?

• Introduce Apache Sentry

• Integrated into our people system.

• Work at HS2/HMS/CLI as an authorization plugin.

• Hook: preAnalyze && postAnalyze

Page 15: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

What we did?

Page 16: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

What we did?

• Category query error log

• Client or Server ?

• parse / validate / execute ?

Page 17: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

What we did?

Page 18: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

What we did?

Page 19: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

What we did?• Introduce Apache Kylin to speed up multi-

dimensional analytics.

• improve cuboid spanning algorithm

• CuboidJob is triggered by chronos task

• auto resume CuboidJob

• popcnt & lzcnt

Page 20: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

What we did?

Page 21: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

Future work

• Integrate Hive & Spark SQL

• Identify and refuse bad query

• Auto suggestion for HQL

• DevOps improvement

Page 22: OLAP at Toutiao 杨朝中bos.itdks.com › 0400dd4f77854580b1cdb6c7226ca088.pdf · Challenge • 14000+ Hive Table • 20+ billion rows for Top 10 daily partitions (300+ billion rows

We are hiring!

https://job.toutiao.com