introduction to apache kudu

57
1 © Cloudera, Inc. All rights reserved. Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data Jeff Holoman Sr. Systems Engineer

Upload: jeff-holoman

Post on 16-Apr-2017

1.942 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Introduction to Apache Kudu

1© Cloudera, Inc. All rights reserved.

Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast DataJeff HolomanSr. Systems Engineer

Page 2: Introduction to Apache Kudu

2© Cloudera, Inc. All rights reserved.

Agenda

What is Kudu? (Motivations & Design Goals)Use Cases

Overview of Design & Internals Simple Benchmark

Status & Getting Started

Page 3: Introduction to Apache Kudu

3© Cloudera, Inc. All rights reserved.

What is Kudu?

Page 4: Introduction to Apache Kudu

4© Cloudera, Inc. All rights reserved.

But First….

Page 5: Introduction to Apache Kudu

5© Cloudera, Inc. All rights reserved.

• Efficiently scanning large amounts of data

• Accumulating data with high throughput• Multiple SQL Options• All processing engines

• Single Row Access is problematic• Mutation is problematic• “Fast Data” access is problematic

Excels at…However…

GFS paper published in 2003!

Page 6: Introduction to Apache Kudu

6© Cloudera, Inc. All rights reserved.

• Efficiently finding and writing individual rows

• Accumulating data with high throughput

• Scans are problematic• High cardinality access is problematic• SQL support is so/so due to the above

Excels at…However…

Big Table Paper published in 2006!

Page 7: Introduction to Apache Kudu

7© Cloudera, Inc. All rights reserved.

Page 8: Introduction to Apache Kudu

8© Cloudera, Inc. All rights reserved.

In 2006…

DID NOT EXIST!

Page 9: Introduction to Apache Kudu

9© Cloudera, Inc. All rights reserved.

5 10 13 15 15$0

$20

$40

$60

$80

$100

$120

$140

$160

$180

$200

RAM / GB

RAM / GB

Page 10: Introduction to Apache Kudu

10© Cloudera, Inc. All rights reserved.

Page 11: Introduction to Apache Kudu

11© Cloudera, Inc. All rights reserved.

Today…Changing Hardware Landscape• Spinning disk -> solid state storage• NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and 1.5GB/sec write throughput,

at a price of less than $3/GB and dropping• 3D XPoint memory (1000x faster than NAND, cheaper than RAM)

• RAM is cheaper and more abundant• 64->128->256GB over last few years

Takeaway 1: The next bottleneck is CPU, and current storage systems weren’t designed with CPU efficiency in mind.

Takeaway 2: Column stores are feasible for random access.

Page 12: Introduction to Apache Kudu

12© Cloudera, Inc. All rights reserved.

Current Storage Landscape in Hadoop

Gaps exist when these properties are needed simultaneously

The Hadoop Storage “Gap”

Page 13: Introduction to Apache Kudu

13© Cloudera, Inc. All rights reserved.

The Kudu Elevator PitchStorage for Fast Analytics on Fast Data

• New updating column store for Hadoop• Simplifies the architecture for building

analytic applications on changing data• Designed for fast analytic performance• Natively integrated with Hadoop

• Apache-licensed open source (with pending ASF Incubator proposal)

• Beta now available

FILESYSTEMHDFS

NoSQLHBASE

INGEST – SQOOP, FLUME, KAFKA

DATA INTEGRATION & STORAGE

SECURITY – SENTRY

RESOURCE MANAGEMENT – YARN

UNIFIED DATA SERVICES

BATCH STREAM SQL SEARCH MODEL ONLINE

DATA ENGINEERING DATA DISCOVERY & ANALYTICS DATA APPS

SPARK, HIVE, PIG

SPARK IMPALA SOLR SPARK HBASE

RELATIONALKUDU

Page 14: Introduction to Apache Kudu

14© Cloudera, Inc. All rights reserved.

• High throughput for big scans• Low-latency for random accesses• High CPU performance to better take

advantage of RAM and Flash• Single-column scan rate 10-100x faster than HBase

• High IO efficiency• True column store with type-specific encodings• Efficient analytics when only certain columns are

accessed• Expressive and evolvable data model• Architecture that supports multi-data center

operation

Kudu Design Goals

Page 15: Introduction to Apache Kudu

15© Cloudera, Inc. All rights reserved.

Page 16: Introduction to Apache Kudu

16© Cloudera, Inc. All rights reserved.

Using Kudu

• Table has a SQL-like schema• Finite number of columns (unlike HBase/Cassandra)• Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY,

TIMESTAMP• Some subset of columns makes up a possibly-composite primary key• Fast ALTER TABLE

• Java and C++ “NoSQL” style APIs• Insert(), Update(), Delete(), Scan()

• Integrations with MapReduce, Spark, and Impala• more to come!

16

Page 17: Introduction to Apache Kudu

17© Cloudera, Inc. All rights reserved.

What Kudu is *NOT*

•Not a SQL interface itself • It’s just the storage layer – “Bring Your Own SQL” (eg Impala or Spark)

•Not an application that runs on HDFS• It’s an alternative, native Hadoop storage engine• Colocation with HDFS is expected

•Not a replacement for HDFS or HBase• Select the right storage for the right use case• Cloudera will support and invest in all three

Page 18: Introduction to Apache Kudu

18© Cloudera, Inc. All rights reserved.

Use Cases for Kudu

Page 19: Introduction to Apache Kudu

19© Cloudera, Inc. All rights reserved.

Kudu Use Cases

Kudu is best for use cases requiring a simultaneous combination ofsequential and random reads and writes, e.g.:

● Time Series○ Examples: Stream market data; fraud detection & prevention; risk monitoring○ Workload: Insert, updates, scans, lookups

● Machine Data Analytics○ Examples: Network threat detection○ Workload: Inserts, scans, lookups

● Online Reporting○ Examples: ODS○ Workload: Inserts, updates, scans, lookups

Page 20: Introduction to Apache Kudu

20© Cloudera, Inc. All rights reserved.

Industry Examples

• Streaming market data

• Real-time fraud detection & prevention

• Risk monitoring

• Real-time offers• Location-based

targeting

• Geospatial monitoring

• Risk and threat detection (real time)

Financial Services Retail Public Sector

Page 21: Introduction to Apache Kudu

21© Cloudera, Inc. All rights reserved.

Real-Time Analytics in Hadoop TodayFraud Detection in the Real World = Storage Complexity

Considerations:● How do I handle failure

during this process?

● How often do I reorganize data streaming in into a format appropriate for reporting?

● When reporting, how do I see data that has not yet been reorganized?

● How do I ensure that important jobs aren’t interrupted by maintenance?

New Partition

Most Recent Partition

Historic Data

HBase

Parquet File

Have we accumulated enough data?

Reorganize HBase file

into Parquet

• Wait for running operations to complete • Define new Impala partition referencing

the newly written Parquet file

Incoming Data (Messaging

System)

Reporting Request

Impala on HDFS

Page 22: Introduction to Apache Kudu

22© Cloudera, Inc. All rights reserved.

Real-Time Analytics in Hadoop with KuduSimpler Architecture, Superior Performance over Hybrid Approaches

Impala on KuduIncoming Data

(Messaging System)

Reporting Request

Page 23: Introduction to Apache Kudu

23© Cloudera, Inc. All rights reserved.

Design & Internals

Page 24: Introduction to Apache Kudu

24© Cloudera, Inc. All rights reserved.

Kudu Basic Design

• Typed storage•Basic Construct: Tables • Tables broken down into Tablets (roughly equivalent to partitions)

•Maintains consistency through a Paxos-like quorum model (Raft)•Architecture supports geographically disparate, active/active systems

Page 25: Introduction to Apache Kudu

25© Cloudera, Inc. All rights reserved.

Columnar Data Store

A B CA1 B1 C1A2 B2 C2A3 B3 C3

A1 B1 C1 A2 B2 C2 A3 B3 C3

A1 A2 A3 B1 B2 B3 C1 C2 C3

Row-Based Storage

Columnar Storage

Page 26: Introduction to Apache Kudu

26© Cloudera, Inc. All rights reserved.

Tables and Tablets

• Table is horizontally partitioned into tablets• Range or hash partitioning• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS

• Each tablet has N replicas (3 or 5), with Raft consensus• Allow read from any replica, plus leader-driven writes with low MTTR

• Tablet servers host tablets• Store data on local disks (no HDFS)

26

Page 27: Introduction to Apache Kudu

27© Cloudera, Inc. All rights reserved.

Tables and Tablets(2)

• CREATE TABLE customers (• first_name STRING NOT NULL,• last_name STRING NOT NULL,• order_count INT32,• PRIMARY KEY (last_name, first_name),• )• Specifying the split rows as (("b", ""), ("c", ""), ("d", ""), .., ("z", "")) (25 split rows

total) will result in the creation of 26 tablets, with each tablet containing a range of customer surnames all beginning with a given letter. This is an effective partition schema for a workload where customers are inserted and updated uniformly by last name, and scans are typically performed over a range of surnames.

Page 28: Introduction to Apache Kudu

28© Cloudera, Inc. All rights reserved.

Client

Meta Cache

Page 29: Introduction to Apache Kudu

29© Cloudera, Inc. All rights reserved.

Client

Hey Master! Where is the row for ‘[email protected]’ in table “T”?Meta Cache

Page 30: Introduction to Apache Kudu

30© Cloudera, Inc. All rights reserved.

Client

Hey Master! Where is the row for ‘[email protected]’ in table “T”?

It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, …

Meta Cache

Page 31: Introduction to Apache Kudu

31© Cloudera, Inc. All rights reserved.

Client

Hey Master! Where is the row for ‘[email protected]’ in table “T”?

It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, …

Meta CacheT1: …T2: …T3: …

Page 32: Introduction to Apache Kudu

32© Cloudera, Inc. All rights reserved.

Client

Hey Master! Where is the row for ‘[email protected]’ in table “T”?

It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, …

UPDATE [email protected] SET …

Meta CacheT1: …T2: …T3: …

Page 33: Introduction to Apache Kudu

33© Cloudera, Inc. All rights reserved.

Tablet Design

• Inserts buffered in an in-memory store (like HBase’s memstore)• Flushed to disk• Columnar layout, similar to Apache Parquet

• Updates use MVCC (updates tagged with timestamp, not in-place)• Allow “SELECT AS OF <timestamp>” queries and consistent cross-tablet scans

• Near-optimal read path for “current time” scans• No per row branches, fast vectorized decoding and predicate evaluation

• Performance worsens based on number of recent updates

33

Page 34: Introduction to Apache Kudu

34© Cloudera, Inc. All rights reserved.

Metadata

• Replicated master*• Acts as a tablet directory (“META” table)• Acts as a catalog (table schemas, etc)• Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets)

• Caches all metadata in RAM for high performance• 80-node load test, GetTableLocations RPC perf:• 99th percentile: 68us, 99.99th percentile: 657us • <2% peak CPU usage

• Client configured with master addresses• Asks master for tablet locations as needed and caches them

34

Page 35: Introduction to Apache Kudu

35© Cloudera, Inc. All rights reserved.

Kudu Trade-Offs

•Random updates will be slower•HBase model allows random updates without incurring a disk seek• Kudu requires a key lookup before update, Bloom lookup before insert

• Single-row reads may be slower• Columnar design is optimized for scans• Future: may introduce “column groups” for applications where single-row

access is more important

Page 36: Introduction to Apache Kudu

36© Cloudera, Inc. All rights reserved.

Fault tolerance

• Transient FOLLOWER failure:• Leader can still achieve majority• Restart follower TS within 5 min and it will rejoin transparently

• Transient LEADER failure:• Followers expect to hear a heartbeat from their leader every 1.5 seconds• 3 missed heartbeats: leader election!• New LEADER is elected from remaining nodes within a few seconds

• Restart within 5 min and it rejoins as a FOLLOWER• N replicas handle (N-1)/2 failures

36

Page 37: Introduction to Apache Kudu

37© Cloudera, Inc. All rights reserved.

Fault tolerance (2)

• Permanent failure:• Leader notices that a follower has been dead for 5 minutes• Evicts that follower• Master selects a new replica• Leader copies the data over to the new one, which joins as a new FOLLOWER

37

Page 38: Introduction to Apache Kudu

38© Cloudera, Inc. All rights reserved.

Benchmarks

Page 39: Introduction to Apache Kudu

39© Cloudera, Inc. All rights reserved.

TPC-H (Analytics benchmark)

• 75TS + 1 master cluster• 12 (spinning) disk each, enough RAM to fit dataset• Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4• TPC-H Scale Factor 100 (100GB)

• Example query:• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc;

39

Page 40: Introduction to Apache Kudu

40© Cloudera, Inc. All rights reserved.

- Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data- Parquet likely to outperform Kudu for HDD-resident (larger IO requests)

Page 41: Introduction to Apache Kudu

41© Cloudera, Inc. All rights reserved.

What about Apache Phoenix?• 10 node cluster (9 worker, 1 master)• HBase 1.0, Phoenix 4.3• TPC-H LINEITEM table only (6B rows)

41

Load TPCH Q1 COUNT(*)COUNT(*)WHERE…

single-rowlookup

0.01

0.1

1

10

100

1000

100002152

21976

131

0.04

1918

13.2

1.7

0.7 0.15

155

9.3

1.4 1.5 1.37

PhoenixKuduParquet

Tim

e (s

ec)

Page 42: Introduction to Apache Kudu

42© Cloudera, Inc. All rights reserved.

What about NoSQL-style random access? (YCSB)

• YCSB 0.5.0-snapshot• 10 node cluster

(9 worker, 1 master)• HBase 1.0• 100M rows, 10M ops

42

Page 43: Introduction to Apache Kudu

43© Cloudera, Inc. All rights reserved.

Xiaomi Use Case

• Gather important RPC tracing events from mobile app and backend service

• Service monitoring & troubleshooting tool

• High write throughput

• >5 Billion records/day and growing

• Query latest data and quick response

• Identify and resolve issues quickly

• Can search for individual records

• Easy for troubleshooting

Page 44: Introduction to Apache Kudu

44© Cloudera, Inc. All rights reserved.

Big Data Analytics PipelineBefore Kudu

• Long pipelinehigh latency(1 hour ~ 1 day), data conversion pains

• No orderingLog arrival(storage) order not exactly logical ordere.g. read 2-3 days of log for data in 1 day

Page 45: Introduction to Apache Kudu

45© Cloudera, Inc. All rights reserved.

Big Data Analysis PipelineSimplified With Kudu

• ETL Pipeline(0~10s latency)Apps that need to prevent backpressure or require ETL

• Direct Pipeline(no latency)Apps that don’t require ETL and no backpressure issues

OLAP scanSide table lookupResult store

Page 46: Introduction to Apache Kudu

46© Cloudera, Inc. All rights reserved.

Use Case 1: Benchmark

Environment

• 71 Node cluster• Hardware• CPU: E5-2620 2.1GHz * 24 core Memory: 64GB • Network: 1Gb Disk: 12 HDD

• Software• Hadoop2.6/Impala 2.1/Kudu

Data

• 1 day of server side tracing data• ~2.6 Billion rows• ~270 bytes/row• 17 columns, 5 key columns

Page 47: Introduction to Apache Kudu

47© Cloudera, Inc. All rights reserved.

Use Case 1: Benchmark Results

Q1 Q2 Q3 Q4 Q5 Q6

1.4 2.0 2.3 3.1 1.3 0.9 1.3

2.8 4.0

5.7 7.5

16.7 kuduparquet

Total Time(s) Throughput(Total) Throughput(per node)

Kudu 961.1 2.8M record/s 39.5k record/s

Parquet 114.6 23.5M record/s 331k records/s

Bulk load using impala (INSERT INTO):

Query latency:

* HDFS parquet file replication = 3 , kudu table replication = 3* Each query run 5 times then take average

Page 48: Introduction to Apache Kudu

48© Cloudera, Inc. All rights reserved.

Status & Getting Started

Page 49: Introduction to Apache Kudu

49© Cloudera, Inc. All rights reserved.

With Kudu:• Ingest and serve data simultaneously

• Support analytic and real-time operations on the same data set

• Make existing storage architectures simpler, and enable new architectures that previously weren’t possible

Basic Features:• High availability, no single point of failure

• Consistency by consensus, options for “tunable consistency”

• Horizontally scalable

• Efficient use of modern storage and processors

Basic Kudu value proposition

Page 50: Introduction to Apache Kudu

50© Cloudera, Inc. All rights reserved.

Current Status

✔ Completed all components core to the architecture

✔ Java and C++ API

✔ Impala, MapReduce, and Spark integration

✔ Support for SSDs and spinning disk

✔ Fault recovery

✔ Public beta available

Page 51: Introduction to Apache Kudu

51© Cloudera, Inc. All rights reserved.

Getting Started

Users:

Install the Beta or try a VM:getkudu.io

Get help:[email protected]

Read the white paper:getkudu.io/kudu.pdf

Developers:

Contribute:github.com/cloudera/kudu (commits)

gerrit.cloudera.org (reviews)issues.cloudera.org (JIRAs going back to 2013)

Join the Dev list:[email protected]

Contributions/participation are welcome and encouraged!

Page 52: Introduction to Apache Kudu

52© Cloudera, Inc. All rights reserved.

Questions?

Page 53: Introduction to Apache Kudu

53© Cloudera, Inc. All rights reserved.

Appendix

Page 54: Introduction to Apache Kudu

54© Cloudera, Inc. All rights reserved.

Fault tolerance

• Transient FOLLOWER failure:• Leader can still achieve majority• Restart follower TS within 5 min and it will rejoin transparently

• Transient LEADER failure:• Followers expect to hear a heartbeat from their leader every 1.5 seconds• 3 missed heartbeats: leader election!• New LEADER is elected from remaining nodes within a few seconds

• Restart within 5 min and it rejoins as a FOLLOWER• N replicas handle (N-1)/2 failures

54

Page 55: Introduction to Apache Kudu

55© Cloudera, Inc. All rights reserved.

Fault tolerance (2)

• Permanent failure:• Leader notices that a follower has been dead for 5 minutes• Evicts that follower• Master selects a new replica• Leader copies the data over to the new one, which joins as a new FOLLOWER

55

Page 56: Introduction to Apache Kudu

56© Cloudera, Inc. All rights reserved.

LSM vs Kudu

• LSM – Log Structured Merge (Cassandra, HBase, etc)• Inserts and updates all go to an in-memory map (MemStore) and later flush to

on-disk files (HFile/SSTable)• Reads perform an on-the-fly merge of all on-disk HFiles

• Kudu• Shares some traits (memstores, compactions)• More complex.• Slower writes in exchange for faster reads (especially scans)

56

Page 57: Introduction to Apache Kudu

57© Cloudera, Inc. All rights reserved.

Kudu storage – Compaction policy

• Solves an optimization problem (knapsack problem)• Minimize “height” of rowsets for the average key lookup• Bound on number of seeks for write or random-read

• Restrict total IO of any compaction to a budget (128MB)• No long compactions, ever• No “minor” vs “major” distinction• Always be compacting or flushing• Low IO priority maintenance threads

57