netflix keystone - how netflix handles data streams up to 11m events/sec

61
A NETFLIX ORIGINAL SERVICE

Upload: peter-bakas

Post on 16-Apr-2017

374 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

A NETFLIX ORIGINAL SERVICE

Page 2: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Peter Bakas | @peter_bakas

@ Netflix : Cloud Platform Engineering - Real Time Data Infrastructure

@ Ooyala : Analytics, Discovery, Platform Engineering & Infrastructure

@ Yahoo : Display Advertising, Behavioral Targeting, Payments

@ PayPal : Site Engineering and Architecture

@ Play : Advisor to Startups (Data, Security, Containers)

Who is this guy?

Page 3: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

common data pipeline to collect, transport, aggregate, process and visualize events

Why are we here?

Page 4: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

● Architectural design and principles for Keystone

● Technologies that Keystone is leveraging

● Best practices

What should I expect?

Page 5: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Let’s get down to business

Page 6: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Netflix is a logging company

Page 7: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

that occasionally streams movies

Page 8: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

600+ billion events ingested per day

11 million events (24 GB per second) peak

Hundreds of event types

Over 1.3 Petabyte / day

Numbers Galore!

Page 9: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

But wait, there’s more

Page 10: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

1+ trillion events processed every day

1 trillion events ingested per day during holiday season

Numbers Galore - Part Deux

Page 11: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

How did we get here?

Page 12: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Chukwa

Page 13: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Chukwa/Suro + Real-Time Branch

Page 14: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Keystone

Stream Consumers

SamzaRouter

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control PlaneHTTP

PROXY

Page 15: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Kafka Primer

Kafka is a distributed, partitioned, replicated commit log service.

Page 16: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Kafka Terminology● Producer

● Consumer

● Topic

● Partition

● Broker

Page 17: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Page 18: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Netflix Kafka Producer

● Best effort delivery

● Prefer msg drop than disrupting producer app

● Wraps Apache Kafka Producer

● Integration with Netflix ecosystem: Eureka, Atlas, etc.

Page 19: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Producer Impact● Kafka outage does not disrupt existing instances from serving its purpose

● Kafka outage should never prevent new instances from starting up

● After kafka cluster restored, event producing should resume automatically

Page 20: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Prefer Drop than Block

● Drop when buffer is full

● Handle potential blocking of first meta data request

● ack=1 (vs 2)

Page 21: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Sticky Partitioner

● Batching is important to reduce CPU and network I/O on brokers

● Stick to one partition for a while when producing for non-keyed messages

● “linger.ms” works well with sticky partitioner

Page 22: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Producing events to Keystone● Using Netflix Platform logging API

○ LogManager.logEvent(Annotatable): majority of the cases○ KeyValueSeriazlier with ILog#log(String)

● REST endpoint that proxies Platform logging○ ksproxy○ Prana sidecar

Page 23: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Injected Event Metadata● GUID

● Timestamp

● Host

● App

Page 24: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Keystone Extensible Wire Protocol● Invisible to source & sinks

● Backwards and forwards compatibility

● Supports JSON. AVRO on the horizon

● Efficient - 10 bytes overhead per message○ message size - hundreds of bytes to 10MB

Page 25: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Keystone Extensible Wire Protocol

● Packaged as a jar

● Why? Evolve Independently

○ event metadata & traceability metadata○ event payload serialization

Page 26: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Max message size 10MB

● Keystone drops if > 10MB

○ Immutable event payload

Page 27: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Fronting Kafka Clusters

Page 28: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Keystone

Stream Consumers

SamzaRouter

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control Plane

Page 29: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Fronting Kafka Clusters● Normal-priority (majority)● High-priority (streaming activities etc.)

Page 30: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Fronting Kafka Instances● 3 ASGs per cluster, 1 ASG per zone

● 3000 d2.xl AWS instances across 3 regions for regular & failover traffic

Page 31: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Partition Assignment● All replica assignments zone aware

○ Improved availability○ Reduce cost of maintenance

Page 32: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Kafka Fault Tolerance● Instance failure

○ With replication factor of N, guarantee no data loss with N-1 failures○ With zone aware replica assignment, guarantee no data loss with multiple instance failures in the same

zone● Sink failure

○ No data loss during retention period● Replication is the key

○ Data loss can happen if leader dies while follower AND consumer cannot catch up○ Usually indicated by UncleanLeaderElection metric

Page 33: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Kafka Auditor as a Service

● Broker monitoring

● Consumer monitoring

● Heart-beat & Continuous message latency

● On-demand Broker performance testing

● Built as a service deployable on single or multiple instances

Page 34: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Current Issues● By using the d2-xl there is trade off between cost and performance● Performance deteriorates with increase of partitions● Replication lag during peak traffic

Page 35: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Routing Service

Page 36: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Keystone

Stream Consumers

SamzaRouter

EMR

FrontingKafka

EventProducer

ConsumerKafka

Control Plane

Page 37: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Page 38: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Broker

Page 39: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Routing Infrastructure

+

CheckpointingCluster

+ 0.9.1

Page 40: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Router Job Manager

(Control Plane)

EC2 InstancesZookeeper

(Instance Id assignment)

JobJob

Job

ksnode

CheckpointingCluster

ASG

Reconcile every min.

Page 41: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Routing Layer● Total of 13,000 containers on 1,300 AWS C3-4XL instances

○ S3 sink: ~7000 Containers

○ Consumer Kafka sink: ~ 4500 Containers

○ Elasticsearch sink: ~1500 Containers

Page 42: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Routing Layer● Total of ~1400 streams across all regions

○ ~1000 S3 streams

○ ~250 Consumer Kafka streams

○ ~150 Elasticsearch streams

Page 43: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Router Job Details● One Job per sink and Kafka source topic

○ Separate Job each for S3, ElasticSearch & Kafka sink

○ Provides better isolation & better QOS

● Batch processed message requests to sinks

● Offset checkpointed after batch request succeeds

Page 44: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Processing SemanticsData Loss & Duplicates

Page 45: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Backpressure

Producer ⇐ Kafka Cluster ⇐ Samza job router ⇐ Sink

● Keystone - at least once

Page 46: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Data Loss - Producer

● buffer full

● network error

● partition leader change

● partition migration

Page 47: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Data Loss - Kafka● Lose all Kafka replicas of data

○ Safe guards:■ AZ isolation / Alerts / Broker replacement automation■ alerts and monitoring

● Unclean partition leader election○ ack = 1 could cause loss

Page 48: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Data Loss - Router● Lose checkpointed offset & the router was down for retention period duration● If messages not processed past retention period (8h / 24h)

● Unclean leader election cause offset to go back

● Safe guard○ alerts for lag > 0.1% of traffic for 10 minutes

● Concerned only if unable to launch router instances

Page 49: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Duplicates Router - Sink

● Duplicates possible

○ messages reprocessed - retry after batch S3 upload failure

○ Loss of checkpointed offset (message processed marker)

○ Event GUID helps dedup

Page 50: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Measure Duplicates● Producer sent count diff with Kafka message received

● Router checkpointed offset monitored over time

Note: GUID can be used to dedup at the sink

Page 51: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Page 52: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

End to End metrics● Producer to Router to Sink Average Latencies

○ Batch processing [S3 sink]: ~3 sec

○ Stream processing [Consumer Kafka sink]: ~1 sec

○ Log analysis [Elasticsearch]: ~400 seconds (with back pressure)

Page 53: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

End to End metrics● End to End latencies

○ S3:■ 50 percentile under 1 sec■ 80 percentile under 8 seconds

○ Consumer Kafka:■ 50 percentile under 800 ms■ 80 percentile under 4 seconds

○ Elasticsearch:■ 50 percentile under 13 seconds■ 80 percentile under 53 seconds

Page 54: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Alerts● Producer drop rate over 1%● Consumer lag > 0.1%● Next offset after Checkpointed offset not found● Consumer stuck on partition level

Page 55: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Keystone Dashboard

Page 56: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Keystone Dashboard

Page 57: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Keystone Dashboard

Page 58: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

Keystone Dashboard

Page 59: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

And Then???

Page 60: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

There’s more in the pipeline...● Self service tools

● Better management of scaling Kafka

● More capable control plane

● JSON Support exists, support for Avro on the horizon

● Multi-tenant Messaging as a Service - MaaS

● Multi-tenant Stream Processing as a Service - SPaaS

Page 61: Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec

???s