netflix development patterns for scale, performance & availability (dmg206) | aws re:invent 2013
DESCRIPTION
This session explains how Netflix is using the capabilities of AWS to balance the rate of change against the risk of introducing a fault. Netflix uses a modular architecture with fault isolation and fallback logic for dependencies to maximize availability. This approach allows for rapid independent evolution of individual components to maximize the pace of innovation and A/B testing, and offers nearly unlimited scalability as the business grows. Learn how we balance managing change to (or subtraction from) the customer experience, while aggressively scraping barnacle features that add complexity for little value.TRANSCRIPT
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Netflix Development Patterns for Rapid Iteration, Scale, Performance, & Availability
Neil Hunt, Netflix
November 13, 2013
Are You Designing Systems That Are: • Web-scale • Global • Highly-available • Consumer-facing
• Cloud Native
Cloud Native • Service oriented architecture • Redundancy • Statelessness • NoSQL • Eventual consistency
Assumptions
Slowly Changing Large Scale
Rapid Change Large Scale
Slowly Changing Small Scale
Rapid Change Small Scale
Speed
Sca
le
Everything works
Everything is Broken Hardware will fail
Software will fail
Enterprise IT Telcos
Startups Web-Scale
Netflix Cloud Goals: Availability, Scale, Performance
Performance • Reduce session start by 1s
Save 1 human lifetime per day! Win more moments of truth
• Suggest choices 1% better 500k hours/day additional value delivered
Scale • 50% y/y traffic growth • 50 Countries, 3 continents • Tens of thousands of instances at peak • 4 AWS regions, 12 datacenters • ~$.001 per start
Availability • Aspire to 4 x nines (99.99% of starts successful) • Per Quarter:
– Downtime: < 3 mins (peak time) – Successful starts: 9.999B – Failures: 1M frustration, calls, lost business
Availabilities Compound N Service Dependencies
Availability
2 .9998 10 .999 100 .99 1000 .9
99.99N%
99.99% 99.99% 99.99% …
N dependencies
Availabilities Compound
99.9999% availability for each dependency
Isolation for independence
To achieve 99.99% availability with 1000 components
requires:
or
Component failure leads to degradation rather than
system failure
Component failure leads to system failure
Availability, Scale, Performance Are Not Enough!
Rapid Iteration – Rate of Change • Running tests • Rolling out tests
– Engineering the winning test experience for scale
• Adding features • Scaling up • Removing features, simplifying, minimizing
Testing • Up to 1,000 changes per day!
Rate of Change • Change leads to bugs
– New features – New configurations – New types of inputs – Scaling up
• Availability is in tension with rate of change
Availability / Rate of Change Tradeoff
1 10 100 1000
99.999%
99.99%
99.9%
99%
Rate of Change
Avai
labi
lity
Frontier of availability/change
Availability / Rate of Change Tradeoff
1 10 100 1000
99.999%
99.99%
99.9%
99%
Rate of Change
Avai
labi
lity
Frontier of availability/change
Shifting the Curve…
1 10 100 1000
99.999%
99.99%
99.9%
99%
Rate of Change
Avai
labi
lity
Shifting the Curve • Must break the chained dependencies
that compound in cascading system failure
• Subsystem isolation: – Failure in one component
should never result in cascading system failure
Isolating Subsystems Redundant systems with timeout & failover • Failure of instance • Failure of network
• Latency monkey to
test
Dependent System
Dependence
Timeout
Isolating Subsystems Redundant systems with timeout & failover • Failure of instance • Failure of network
• Latency monkey to
test
Dependent System
Dependence
Higher Tier System
Short timeout
Longer timeout
Isolating Subsystems Timeout with fallback default response • Network failure • Software bug
Dependent System
Dependence
Timeout & Default response
{ status=mem, plan=4, device=true }
Isolating Subsystems Canary Push • Network failure • Software bug
Dependent System
Dependence
Timeout
Canary instance new code
Isolating Subsystems Red/Black deployment • Software bugs Dependent
System
Dependence V2.3
Bad code pushed Dependence
V2.2
Fail back to old code
Isolating Subsystems Standby Blue system
• Independent
implementation • Simplified logic
Dependent System
Dependence V2.3
Static reference implementation
Fail to static version
Isolating Subsystems Zone isolation • Infrastructure failure
(e.g. power outage)
• Chaos Gorilla
Dependent System
Dependence
Zone A
Dependent System
Dependence
Zone B
Load Balancer
Isolating Subsystems Region isolation • Infrastructure
software bugs (e.g. load balancer fail)
• Chaos Kong
Dependent System
Dependence
Zone A
Dependent System
Dependence
Zone B
Load Balancer
Dependent System
Dependence
Zone A
Dependent System
Dependence
Zone B
Load Balancer
Region E Region W
DNS
Isolating Subsystems
Dependency Mode Isolating Technique Instance Failure Network failure
Redundant systems with failover and timeout Timeout with default response
Network failure Software bug
Canary push Red-black deployment Blue systems
Infrastructure failure Zone isolation Cross-zone software bugs Region isolation
Trying Harder Won’t Cut It • Trying harder gets a linear return on an exponential
problem
• Need to be great at execution AND Have the right architecture
• What architectural features are you using to ensure availability, scale, performance, & rapid rate of change?
Please give us your feedback on this presentation
As a thank you, we will select prize winners daily for completed surveys!
DMG206