netflix development patterns for scale, performance & availability (dmg206) | aws re:invent 2013

Netflix Development Patterns for Rapid Iteration, Scale, Performance, & Availability

Neil Hunt, Netflix

November 13, 2013

Are You Designing Systems That Are: • Web-scale • Global • Highly-available • Consumer-facing

• Cloud Native

Cloud Native • Service oriented architecture • Redundancy • Statelessness • NoSQL • Eventual consistency

Assumptions

Slowly Changing Large Scale

Rapid Change Large Scale

Slowly Changing Small Scale

Rapid Change Small Scale

Everything works

Everything is Broken Hardware will fail

Software will fail

Enterprise IT Telcos

Startups Web-Scale

Netflix Cloud Goals: Availability, Scale, Performance

Performance • Reduce session start by 1s

Save 1 human lifetime per day! Win more moments of truth

• Suggest choices 1% better 500k hours/day additional value delivered

Scale • 50% y/y traffic growth • 50 Countries, 3 continents • Tens of thousands of instances at peak • 4 AWS regions, 12 datacenters • ~$.001 per start

Availability • Aspire to 4 x nines (99.99% of starts successful) • Per Quarter:

– Downtime: < 3 mins (peak time) – Successful starts: 9.999B – Failures: 1M frustration, calls, lost business

Availabilities Compound N Service Dependencies

Availability

2 .9998 10 .999 100 .99 1000 .9

99.99N%

99.99% 99.99% 99.99% …

N dependencies

Availabilities Compound

99.9999% availability for each dependency

Isolation for independence

To achieve 99.99% availability with 1000 components

requires:

Component failure leads to degradation rather than

system failure

Component failure leads to system failure

Availability, Scale, Performance Are Not Enough!

Rapid Iteration – Rate of Change • Running tests • Rolling out tests

– Engineering the winning test experience for scale

• Adding features • Scaling up • Removing features, simplifying, minimizing

Testing • Up to 1,000 changes per day!

Rate of Change • Change leads to bugs

– New features – New configurations – New types of inputs – Scaling up

• Availability is in tension with rate of change

Availability / Rate of Change Tradeoff

1 10 100 1000

99.999%

99.99%

Rate of Change

Frontier of availability/change

Availability / Rate of Change Tradeoff

1 10 100 1000

99.999%

99.99%

Rate of Change

Frontier of availability/change

Shifting the Curve…

1 10 100 1000

99.999%

99.99%

Rate of Change

Shifting the Curve • Must break the chained dependencies

that compound in cascading system failure

• Subsystem isolation: – Failure in one component

should never result in cascading system failure

Isolating Subsystems Redundant systems with timeout & failover • Failure of instance • Failure of network

• Latency monkey to

Dependent System

Dependence

Timeout

Isolating Subsystems Redundant systems with timeout & failover • Failure of instance • Failure of network

• Latency monkey to

Dependent System

Dependence

Higher Tier System

Short timeout

Longer timeout

Isolating Subsystems Timeout with fallback default response • Network failure • Software bug

Dependent System

Dependence

Timeout & Default response

{ status=mem, plan=4, device=true }

Isolating Subsystems Canary Push • Network failure • Software bug

Dependent System

Dependence

Timeout

Canary instance new code

Isolating Subsystems Red/Black deployment • Software bugs Dependent

System

Dependence V2.3

Bad code pushed Dependence

Fail back to old code

Isolating Subsystems Standby Blue system

• Independent

implementation • Simplified logic

Dependent System

Dependence V2.3

Static reference implementation

Fail to static version

Isolating Subsystems Zone isolation • Infrastructure failure

(e.g. power outage)

• Chaos Gorilla

Dependent System

Dependence

Zone A

Dependent System

Dependence

Zone B

Load Balancer

Isolating Subsystems Region isolation • Infrastructure

software bugs (e.g. load balancer fail)

• Chaos Kong

Dependent System

Dependence

Zone A

Dependent System

Dependence

Zone B

Load Balancer

Dependent System

Dependence

Zone A

Dependent System

Dependence

Zone B

Load Balancer

Region E Region W

Isolating Subsystems

Dependency Mode Isolating Technique Instance Failure Network failure

Redundant systems with failover and timeout Timeout with default response

Network failure Software bug

Canary push Red-black deployment Blue systems

Infrastructure failure Zone isolation Cross-zone software bugs Region isolation

Trying Harder Won’t Cut It • Trying harder gets a linear return on an exponential

problem

• Need to be great at execution AND Have the right architecture

• What architectural features are you using to ensure availability, scale, performance, & rapid rate of change?

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

DMG206

netflix development patterns for scale, performance & availability (dmg206) | aws re:invent 2013

Technology

aws re:invent 2016 photo report

aws re:invent 2016 security follow up aws organizations

(sov203) understanding aws storage options | aws re:invent...

aws re:invent 2013 recap

aws re:invent 2016: another day in the life of a netflix...

aws re:invent 2016 후기

[aws re:invent 2013 report] aws cloudtrail

data science at netflix with amazon emr (bdt306) | aws...

bluesoft @ aws re:invent 2017 + aws 101

aws re:invent hackathon

netflix: amazon s3 & amazon elastic mapreduce to monitor at...

aws re:invent - accelerating research

aws re:invent 2016 fast forward

2012 re:invent netflix: embracing the cloud final

navigating aws re:invent 2015

aws security – keynote address (sec101) | aws re:invent...

aws billing deep dive (dmg203) | aws re:invent 2013

aws re:invent 2016: netflix: container scheduling,...

(sec201) aws security keynote address | aws re:invent 2014

aws re:invent 2016: netflix: using amazon s3 as the fabric...