netflix keystone - how netflix handles data streams up to 11m events/sec

A NETFLIX ORIGINAL SERVICE

Peter Bakas | @peter_bakas

@ Netflix : Cloud Platform Engineering - Real Time Data Infrastructure

@ Ooyala : Analytics, Discovery, Platform Engineering & Infrastructure

@ Yahoo : Display Advertising, Behavioral Targeting, Payments

@ PayPal : Site Engineering and Architecture

@ Play : Advisor to Startups (Data, Security, Containers)

Who is this guy?

common data pipeline to collect, transport, aggregate, process and visualize events

Why are we here?

● Architectural design and principles for Keystone

● Technologies that Keystone is leveraging

● Best practices

What should I expect?

Let’s get down to business

Netflix is a logging company

that occasionally streams movies

600+ billion events ingested per day

11 million events (24 GB per second) peak

Hundreds of event types

Over 1.3 Petabyte / day

Numbers Galore!

But wait, there’s more

1+ trillion events processed every day

1 trillion events ingested per day during holiday season

Numbers Galore - Part Deux

How did we get here?

Chukwa

Chukwa/Suro + Real-Time Branch

Keystone

Stream Consumers

SamzaRouter

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneHTTP

PROXY

Kafka Primer

Kafka is a distributed, partitioned, replicated commit log service.

Kafka Terminology● Producer

● Consumer

● Topic

● Partition

● Broker

Netflix Kafka Producer

● Best effort delivery

● Prefer msg drop than disrupting producer app

● Wraps Apache Kafka Producer

● Integration with Netflix ecosystem: Eureka, Atlas, etc.

Producer Impact● Kafka outage does not disrupt existing instances from serving its purpose

● Kafka outage should never prevent new instances from starting up

● After kafka cluster restored, event producing should resume automatically

Prefer Drop than Block

● Drop when buffer is full

● Handle potential blocking of first meta data request

● ack=1 (vs 2)

Sticky Partitioner

● Batching is important to reduce CPU and network I/O on brokers

● Stick to one partition for a while when producing for non-keyed messages

● “linger.ms” works well with sticky partitioner

Producing events to Keystone● Using Netflix Platform logging API

○ LogManager.logEvent(Annotatable): majority of the cases○ KeyValueSeriazlier with ILog#log(String)

● REST endpoint that proxies Platform logging○ ksproxy○ Prana sidecar

Injected Event Metadata● GUID

● Timestamp

● Host

● App

Keystone Extensible Wire Protocol● Invisible to source & sinks

● Backwards and forwards compatibility

● Supports JSON. AVRO on the horizon

● Efficient - 10 bytes overhead per message○ message size - hundreds of bytes to 10MB

Keystone Extensible Wire Protocol

● Packaged as a jar

● Why? Evolve Independently

○ event metadata & traceability metadata○ event payload serialization

Max message size 10MB

● Keystone drops if > 10MB

○ Immutable event payload

Fronting Kafka Clusters

Keystone

Stream Consumers

SamzaRouter

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control Plane

Fronting Kafka Clusters● Normal-priority (majority)● High-priority (streaming activities etc.)

Fronting Kafka Instances● 3 ASGs per cluster, 1 ASG per zone

● 3000 d2.xl AWS instances across 3 regions for regular & failover traffic

Partition Assignment● All replica assignments zone aware

○ Improved availability○ Reduce cost of maintenance

Kafka Fault Tolerance● Instance failure

○ With replication factor of N, guarantee no data loss with N-1 failures○ With zone aware replica assignment, guarantee no data loss with multiple instance failures in the same

zone● Sink failure

○ No data loss during retention period● Replication is the key

○ Data loss can happen if leader dies while follower AND consumer cannot catch up○ Usually indicated by UncleanLeaderElection metric

Kafka Auditor as a Service

● Broker monitoring

● Consumer monitoring

● Heart-beat & Continuous message latency

● On-demand Broker performance testing

● Built as a service deployable on single or multiple instances

Current Issues● By using the d2-xl there is trade off between cost and performance● Performance deteriorates with increase of partitions● Replication lag during peak traffic

Routing Service

Keystone

Stream Consumers

SamzaRouter

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control Plane

Broker

Routing Infrastructure

+

CheckpointingCluster

+ 0.9.1

Router Job Manager

(Control Plane)

EC2 InstancesZookeeper

(Instance Id assignment)

JobJob

Job

ksnode

CheckpointingCluster

ASG

Reconcile every min.

Routing Layer● Total of 13,000 containers on 1,300 AWS C3-4XL instances

○ S3 sink: ~7000 Containers

○ Consumer Kafka sink: ~ 4500 Containers

○ Elasticsearch sink: ~1500 Containers

Routing Layer● Total of ~1400 streams across all regions

○ ~1000 S3 streams

○ ~250 Consumer Kafka streams

○ ~150 Elasticsearch streams

Router Job Details● One Job per sink and Kafka source topic

○ Separate Job each for S3, ElasticSearch & Kafka sink

○ Provides better isolation & better QOS

● Batch processed message requests to sinks

● Offset checkpointed after batch request succeeds

Processing SemanticsData Loss & Duplicates

Backpressure

Producer ⇐ Kafka Cluster ⇐ Samza job router ⇐ Sink

● Keystone - at least once

Data Loss - Producer

● buffer full

● network error

● partition leader change

● partition migration

Data Loss - Kafka● Lose all Kafka replicas of data

○ Safe guards:■ AZ isolation / Alerts / Broker replacement automation■ alerts and monitoring

● Unclean partition leader election○ ack = 1 could cause loss

Data Loss - Router● Lose checkpointed offset & the router was down for retention period duration● If messages not processed past retention period (8h / 24h)

● Unclean leader election cause offset to go back

● Safe guard○ alerts for lag > 0.1% of traffic for 10 minutes

● Concerned only if unable to launch router instances

Duplicates Router - Sink

● Duplicates possible

○ messages reprocessed - retry after batch S3 upload failure

○ Loss of checkpointed offset (message processed marker)

○ Event GUID helps dedup

Measure Duplicates● Producer sent count diff with Kafka message received

● Router checkpointed offset monitored over time

Note: GUID can be used to dedup at the sink

End to End metrics● Producer to Router to Sink Average Latencies

○ Batch processing [S3 sink]: ~3 sec

○ Stream processing [Consumer Kafka sink]: ~1 sec

○ Log analysis [Elasticsearch]: ~400 seconds (with back pressure)

End to End metrics● End to End latencies

○ S3:■ 50 percentile under 1 sec■ 80 percentile under 8 seconds

○ Consumer Kafka:■ 50 percentile under 800 ms■ 80 percentile under 4 seconds

○ Elasticsearch:■ 50 percentile under 13 seconds■ 80 percentile under 53 seconds

Alerts● Producer drop rate over 1%● Consumer lag > 0.1%● Next offset after Checkpointed offset not found● Consumer stuck on partition level

Keystone Dashboard

And Then???

There’s more in the pipeline...● Self service tools

● Better management of scaling Kafka

● More capable control plane

● JSON Support exists, support for Avro on the horizon

● Multi-tenant Messaging as a Service - MaaS

● Multi-tenant Stream Processing as a Service - SPaaS