netflix keystone - how netflix handles data streams up to 11m events/sec
TRANSCRIPT
A NETFLIX ORIGINAL SERVICE
Peter Bakas | @peter_bakas
@ Netflix : Cloud Platform Engineering - Real Time Data Infrastructure
@ Ooyala : Analytics, Discovery, Platform Engineering & Infrastructure
@ Yahoo : Display Advertising, Behavioral Targeting, Payments
@ PayPal : Site Engineering and Architecture
@ Play : Advisor to Startups (Data, Security, Containers)
Who is this guy?
common data pipeline to collect, transport, aggregate, process and visualize events
Why are we here?
● Architectural design and principles for Keystone
● Technologies that Keystone is leveraging
● Best practices
What should I expect?
Let’s get down to business
Netflix is a logging company
that occasionally streams movies
600+ billion events ingested per day
11 million events (24 GB per second) peak
Hundreds of event types
Over 1.3 Petabyte / day
Numbers Galore!
But wait, there’s more
1+ trillion events processed every day
1 trillion events ingested per day during holiday season
Numbers Galore - Part Deux
How did we get here?
Chukwa
Chukwa/Suro + Real-Time Branch
Keystone
Stream Consumers
SamzaRouter
EMR
FrontingKafka
EventProducer
ConsumerKafka
Control PlaneHTTP
PROXY
Kafka Primer
Kafka is a distributed, partitioned, replicated commit log service.
Kafka Terminology● Producer
● Consumer
● Topic
● Partition
● Broker
Netflix Kafka Producer
● Best effort delivery
● Prefer msg drop than disrupting producer app
● Wraps Apache Kafka Producer
● Integration with Netflix ecosystem: Eureka, Atlas, etc.
Producer Impact● Kafka outage does not disrupt existing instances from serving its purpose
● Kafka outage should never prevent new instances from starting up
● After kafka cluster restored, event producing should resume automatically
Prefer Drop than Block
● Drop when buffer is full
● Handle potential blocking of first meta data request
● ack=1 (vs 2)
Sticky Partitioner
● Batching is important to reduce CPU and network I/O on brokers
● Stick to one partition for a while when producing for non-keyed messages
● “linger.ms” works well with sticky partitioner
Producing events to Keystone● Using Netflix Platform logging API
○ LogManager.logEvent(Annotatable): majority of the cases○ KeyValueSeriazlier with ILog#log(String)
● REST endpoint that proxies Platform logging○ ksproxy○ Prana sidecar
Injected Event Metadata● GUID
● Timestamp
● Host
● App
Keystone Extensible Wire Protocol● Invisible to source & sinks
● Backwards and forwards compatibility
● Supports JSON. AVRO on the horizon
● Efficient - 10 bytes overhead per message○ message size - hundreds of bytes to 10MB
Keystone Extensible Wire Protocol
● Packaged as a jar
● Why? Evolve Independently
○ event metadata & traceability metadata○ event payload serialization
Max message size 10MB
● Keystone drops if > 10MB
○ Immutable event payload
Fronting Kafka Clusters
Keystone
Stream Consumers
SamzaRouter
EMR
FrontingKafka
EventProducer
ConsumerKafka
Control Plane
Fronting Kafka Clusters● Normal-priority (majority)● High-priority (streaming activities etc.)
Fronting Kafka Instances● 3 ASGs per cluster, 1 ASG per zone
● 3000 d2.xl AWS instances across 3 regions for regular & failover traffic
Partition Assignment● All replica assignments zone aware
○ Improved availability○ Reduce cost of maintenance
Kafka Fault Tolerance● Instance failure
○ With replication factor of N, guarantee no data loss with N-1 failures○ With zone aware replica assignment, guarantee no data loss with multiple instance failures in the same
zone● Sink failure
○ No data loss during retention period● Replication is the key
○ Data loss can happen if leader dies while follower AND consumer cannot catch up○ Usually indicated by UncleanLeaderElection metric
Kafka Auditor as a Service
● Broker monitoring
● Consumer monitoring
● Heart-beat & Continuous message latency
● On-demand Broker performance testing
● Built as a service deployable on single or multiple instances
Current Issues● By using the d2-xl there is trade off between cost and performance● Performance deteriorates with increase of partitions● Replication lag during peak traffic
Routing Service
Keystone
Stream Consumers
SamzaRouter
EMR
FrontingKafka
EventProducer
ConsumerKafka
Control Plane
Broker
Routing Infrastructure
+
CheckpointingCluster
+ 0.9.1
Router Job Manager
(Control Plane)
EC2 InstancesZookeeper
(Instance Id assignment)
JobJob
Job
ksnode
CheckpointingCluster
ASG
Reconcile every min.
Routing Layer● Total of 13,000 containers on 1,300 AWS C3-4XL instances
○ S3 sink: ~7000 Containers
○ Consumer Kafka sink: ~ 4500 Containers
○ Elasticsearch sink: ~1500 Containers
Routing Layer● Total of ~1400 streams across all regions
○ ~1000 S3 streams
○ ~250 Consumer Kafka streams
○ ~150 Elasticsearch streams
Router Job Details● One Job per sink and Kafka source topic
○ Separate Job each for S3, ElasticSearch & Kafka sink
○ Provides better isolation & better QOS
● Batch processed message requests to sinks
● Offset checkpointed after batch request succeeds
Processing SemanticsData Loss & Duplicates
Backpressure
Producer ⇐ Kafka Cluster ⇐ Samza job router ⇐ Sink
● Keystone - at least once
Data Loss - Producer
● buffer full
● network error
● partition leader change
● partition migration
Data Loss - Kafka● Lose all Kafka replicas of data
○ Safe guards:■ AZ isolation / Alerts / Broker replacement automation■ alerts and monitoring
● Unclean partition leader election○ ack = 1 could cause loss
Data Loss - Router● Lose checkpointed offset & the router was down for retention period duration● If messages not processed past retention period (8h / 24h)
● Unclean leader election cause offset to go back
● Safe guard○ alerts for lag > 0.1% of traffic for 10 minutes
● Concerned only if unable to launch router instances
Duplicates Router - Sink
● Duplicates possible
○ messages reprocessed - retry after batch S3 upload failure
○ Loss of checkpointed offset (message processed marker)
○ Event GUID helps dedup
Measure Duplicates● Producer sent count diff with Kafka message received
● Router checkpointed offset monitored over time
Note: GUID can be used to dedup at the sink
End to End metrics● Producer to Router to Sink Average Latencies
○ Batch processing [S3 sink]: ~3 sec
○ Stream processing [Consumer Kafka sink]: ~1 sec
○ Log analysis [Elasticsearch]: ~400 seconds (with back pressure)
End to End metrics● End to End latencies
○ S3:■ 50 percentile under 1 sec■ 80 percentile under 8 seconds
○ Consumer Kafka:■ 50 percentile under 800 ms■ 80 percentile under 4 seconds
○ Elasticsearch:■ 50 percentile under 13 seconds■ 80 percentile under 53 seconds
Alerts● Producer drop rate over 1%● Consumer lag > 0.1%● Next offset after Checkpointed offset not found● Consumer stuck on partition level
Keystone Dashboard
Keystone Dashboard
Keystone Dashboard
Keystone Dashboard
And Then???
There’s more in the pipeline...● Self service tools
● Better management of scaling Kafka
● More capable control plane
● JSON Support exists, support for Avro on the horizon
● Multi-tenant Messaging as a Service - MaaS
● Multi-tenant Stream Processing as a Service - SPaaS
???s