Download - Kafka Evaluation - High Throughout Message Queue

Kafka Evaluation for Data Pipeline and ETL

Shafaq Abdullah @ GREE Date: 10/21/2013

High-Level Architecture

Performance

Scalability

Fault-Tolerance/Error Recovery

Operations and Monitoring

Summary

Outline

Kafka - A distributed messaging pub/sub system In Production: Linkedin, FourSquare, Twitter, GREE (almost)

Usage: - Log aggregation (user activity stream) - Real-time events - Monitoring

- Queuing

Intro

Point-to-Point Data Pipelines

Game Server 1 Logs

Hadoop

... User Tracking

Security Search Social Graph

Rules/Recommendation/Engine

Vertica Ops

DataWare House

Message- Central Data Channel

Game Server 1 Logs

Hadoop

... User Tracking

Security Search Social Graph

Rules/Recommendation/Engine

Vertica Ops

Data Warehouse

Message Queue

Topic- A String representing Message Stream Id

Partition- Logical Division per topic-level within Broker, for

writing logs generated by producer. e.g: kafkaTopic - kafkaTopic1, kafkaTopic2

Replica-Sets- Replica within Broker for a certain partition, with

Leader(writes) and follower (read) balanced using hash-key modulu.

Kafka Jargon

Log- Message Queue

AWS m1.Large instance Dual Core Intel Xeon [email protected]

7.5 GB RAM

2 x 420 GB hardisk

Hardware Spec of Broker Cluster

Performance Results

Producer thread Transactions Processing time (s) Throughput (transaction/sec)

1 488396 64.129 7616

2 1195748 110.868 10785

4 1874713 140.375 13355

10 47269410 338.094 13981

17 7987317 568.028 14061

Latency vs Durability

Ack Status TIme to publish (ms) Tradeoff

No Ack 0.7 Greater data loss

Wait for ack 1.5 Lesser data loss

1. Create a Kafka-Replica Sink

2. Feed data to Copier via Kafka-Sink

3. Benchmark Kakfa-Copier-Vertica Pipeline

4. Improve/Refactor for Performance

Integration Plan

C- Consistency (Producer Sync Mode) A- Availability (Replication) P- Partition Tolerance (Cluster in same

network- No Network delay)

Strongly Consistent and Highly Available

CA-P theorem

Conventional Quoram Replication: 2f+1 replicas → f failures (e.g. ZooKeeper)

Kafka Replication: f+1 replicas → f failures

Failure Recovery

Leader : ➢  Message is propagated to follower ➢  Commit offset is checkpointed to disk

Follower failure and Recovery: ➢  Kicked out of ISR ➢  After restart, truncates log to last commit ➢  Catches up with leader → ISR

Error Handling: Follower Failure

➢ Embedded Controller via ZK detects leader failure

➢ Leader election from ISR ➢ Committed message not lost

Error Handling: Leader Failure

- Horizontally scalable Add partitions in Broker Cluster as higher

throughput needed (~10Mb/s /server) - Balancing for producer and consumer

Scalability

•  Number of messages the consumer lags behind the producer by

•  Max lag in messages btw follower and leader replica

•  Unclean leader election rate

•  Is controller active on broke

Monitoring via JMX

3 Node cluster ~ 25 MB/s, < 20 ms latency (end2end)

having replication factor of 3 with 6 consumer group

Monthly operations $500 + $250 (zookeeper 3 node) + 1 mm

Operation Costs

- No callback in producer’s send()

- Fully automatic balancing till script is manually run- Balancing layer of topics via ZK

Nothing is perfect

•  Infinite scaling with 10Mb/s /server with < 10 ms latency

•  Costs <$1000 /month + 1 man-month

•  No impact on current pipeline

•  0.8 final release due in days to be used in production

Summary

Download - Kafka Evaluation - High Throughout Message Queue

Top Related