edge architecture ieee international conference on cloud engineering
Post on 16-Sep-2014
833 views
DESCRIPTION
TRANSCRIPT
Over 44 million subscribers
in over 40 countries
Netflix accounts for over 30% of peak internet traffic in North America
One billion hours ~ 100,000 years
per month...
Netflix supports over 1000 device types
Edge Services
● Front door to Netflix● Edge Routing - Zuul● API - Edge Server● Playback services
How does Netflix Streaming work?*
* A simplified view
How does Netflix Streaming work?
Netflix Services in Amazon Cloud
Your CE DeviceCDN
Device Under the Hood
Netflix Services in Amazon Cloud
Your CE DeviceCDN
User Interface
Netflix Streaming Platform
DRM encodingCE integration
User Interface loaded, data retrieved from Netflix Edge Service
User Interface
Netflix Streaming Platform
DRM
Netflix Services in Amazon Cloud
encoding
Your CE DeviceCDN
CE integration
Edge Services
User Interface loaded, data retrieved from Netflix Edge Service
User Interface
Netflix Streaming Platform
DRM
Netflix Services in Amazon Cloud
encoding
Your CE DeviceCDN
CE integration
Edge Services
User Interface Loaded
User Interface
Netflix Streaming Platform
DRM
Netflix Services in Amazon Cloud
encoding
Your CE DeviceCDN
CE integration
Edge Services
Movie Authorization
User Interface
Netflix Streaming Platform
DRM
Netflix Services in Amazon Cloud
encoding
Your CE DeviceCDN
CE integration
Edge Services
Authorize
Movie Authorization
User Interface
Netflix Streaming Platform
DRM
Netflix Services in Amazon Cloud
encoding
Your CE DeviceCDN
CE integration
Edge Services
Authorize
Obtaining License
User Interface
Netflix Streaming Platform
DRM
Netflix Services in Amazon Cloud
encoding
Your CE DeviceCDN
CE integration
Edge Services
License
Movie starts streaming
User Interface
Netflix Streaming Platform
DRM
Netflix Services in Amazon Cloud
encoding
Your CE DeviceCDN
CE integration
Edge Services
PlayData
Movie starts streaming
User Interface
Netflix Streaming Platform
DRM
Netflix Services in Amazon Cloud
encoding
Your CE DeviceCDN
CE integration
Edge Services
Periodic “bookmark” calls note place in movie
User Interface
Netflix Streaming Platform
DRM
Netflix Services in Amazon Cloud
encoding
Your CE DeviceCDN
CE integration
Edge Services
bookmark
Edge Services - What we are talking about today
User Interface
Netflix Streaming Platform
DRM encoding
Your CE DeviceCDN
CE integration
bookmarkNetflix Services in Amazon Cloud Edge Services
Edge’s lofty mission
● High Availability ● Good performance● Data broker between many services and
devices in a global, high volume, rapidly innovating, highly dynamic service
● Clients and services are constantly changing
Edge stats
● Billions of incoming requests per day○ Over 10X outgoing service calls per request
● About 10 device changes per day● Daily service pushes● Daily routing changes
Architecture Goals● Infrastructure
○ Availability○ Resiliency○ Scalability
● Application○ Platform diversity○ Rapid innovation○ A/B Testing
● Delivery○ Automation○ Insights
Netflix’s Global Cloud Architecture
High Level Regional Edge Architecture
ELB
Edge Service
Netflix Services
ELB
Playback Service
ELB
Zuul
Website Service
Zuul
ELB
Edge Service
Netflix Services
ELB
Playback Service
ELB
Zuul
Website Service
What is Zuul?
● Open source framework for dynamically reading, writing, and executing filters that act on incoming HTTP requests
● Dynamically compiled filters written in Groovy○ Any JVM language supported
● Filters share state through a request scoped context
How we use Zuul● Authentication● Insights● Stress Testing● Canary Testing● Dynamic Routing● Service Migration● Load Shedding● Security● Static Response handling● Active/Active traffic management
Zuul Filter Characteristics
● Type● Execution Order● Criteria● Action
Zuul Filter LifecycleHTTP Request
"pre" filters "routing" filter(s) "post" filters
Origin Server
"custom" filters
Http Request Http Response
"error" filters
Zuul Filter LifecycleHTTP Request
"pre" filters "routing" filter(s) "post" filters
Origin Server
"custom" filters
Http Request Http Response
"error" filters
zuul filter lifecycleHTTP Request
"pre" filters "routing" filter(s) "post" filters
Origin Server
"custom" filters
Http Request Http Response
"error" filters
zuul filter lifecycleHTTP Request
"pre" filters "routing" filter(s) "post" filters
Origin Server
"custom" filters
Http Request Http Response
"error" filters
zuul filter lifecycleHTTP Request
"pre" filters "routing" filter(s) "post" filters
Origin Server
"custom" filters
Http Request Http Response
"error" filters
zuul filter lifecycleHTTP Request
"pre" filters "routing" filter(s) "post" filters
Origin Server
"custom" filters
Http Request Http Response
"error" filters
zuul filter lifecycleHTTP Request
"pre" filters "routing" filter(s) "post" filters
Origin Server
"custom" filters
Http Request Http Response
"error" filters
zuul filter lifecycleHTTP Request
"pre" filters "routing" filter(s) "post" filters
Origin Server
"custom" filters
Http Request Http Response
"error" filters
zuul filter lifecycleHTTP Request
"pre" filters "routing" filter(s) "post" filters
Origin Server
"custom" filters
Http Request Http Response
"error" filters
Example FilterFile: DeviceDelayFilter.groovy
1 class DeviceDelayFilter extends ZuulFilter { 2 3 def static Random rand = new Random() 4 @Override 5 String filterType() { 6 return 'pre' 7 } 8 9 @Override10 int filterOrder() {11 return 512 }13 14 @Override15 boolean shouldFilter() {16 return RequestContext.getRequest().17 getParameter("deviceType")?equals("BrokenDevice"):false18 }1920 @Override21 Object run() {22 sleep(rand.nextInt(20000)) //Sleep for a random number of seconds between [0-20]23 }24 }
Filter deployment
Active/Active
Multiple Active Regions
ZUUL
API
Cassandra
Services
ZUUL
API
Cassandra
Services
Multiple Active Regions - NM vs GE
ZUUL
API
Cassandra
Services
ZUUL
API
Cassandra
ServicesDNS
DNS
Multiple Active Regions- Cassandra Replication across regions
ZUUL
API
Cassandra
Services
ZUUL
API
Cassandra
ServicesDNS
DNS
DNS Misrouting
ZUUL
API
Cassandra
Services
ZUUL
API
Cassandra
ServicesDNS
DNS
DNS Misrouting
ZUUL
API
Cassandra
Services
ZUUL
API
Cassandra
ServicesDNS
DNS
Geo lookup resolves IP in west
ZUUL
API
Cassandra
Services
ZUUL
API
Cassandra
ServicesDNS
DNS
GEO
Zuul east routes to Zuul west
ZUUL
API
Cassandra
Services
ZUUL
API
Cassandra
ServicesDNS
DNS
GEO
Response is from west
ZUUL
API
Cassandra
Services
ZUUL
API
Cassandra
ServicesDNS
DNS
GEO
Regional Failure
ZUUL
API
Cassandra
Services
ZUUL
API
Cassandra
ServicesDNS
DNS
Catastrophe in US-East
ZUUL
API
Cassandra
Services
ZUUL
API
Cassandra
ServicesDNS
DNS
East Coast is Down
ZUUL
API
Cassandra
Services
ZUUL
API
Cassandra
ServicesDNS
DNS
Switch DNS to point to US-West
ZUUL
API
Cassandra
Services
ZUUL
API
Cassandra
ServicesDNSDNS
East traffic flows to West
ZUUL
API
Cassandra
Services
ZUUL
API
Cassandra
ServicesDNS
DNS
Edge Server (API)
The Edge Service - Netflix’s API Tier
ELB
Edge Service
Netflix Services
ELB
Playback Service
ELB
Zuul
Website Service
What’s wrong with REST for Netflix?
REST
● One Size Fits all● One Data Format Fits All● REST tends to be atomic● Average 25 REST requests to build up a
page.
Netflix’s Groovy Scripting Layer
Edge Scripting Tier
● Device teams write scripts for their device○ control content, format, endpoints
● Code injected directly into Edge Service at runtime○ Scripts are in production in about 30 seconds
Endpoint Code
(Groovy)
Endpoint Controller
RxJava
Async Service Layer API
Hystrix (Fault tolerance)Endpoint Manager
JVM
Netflix Services
Edge Server Architecture
Endpoint Code
(Groovy)
Endpoint Controller
RxJava
Async Service Layer API
Hystrix (Fault tolerance)Endpoint Manager
JVM
Netflix Services
Pushing a Script
UI Engineer
/ps3/home script
Endpoint Code
(Groovy)
Endpoint Controller
RxJava
Async Service Layer API
Hystrix (Fault tolerance)Endpoint Manager
JVM
Netflix Services
Pushing a Script
UI Engineer
/ps3/home script
Endpoint Code
(Groovy)
Endpoint Controller
RxJava
Async Service Layer API
Hystrix (Fault tolerance)Endpoint Manager
JVM
Netflix Services
Controller pulls new script / compiles
UI Engineer
/ps3/home script
Endpoint Code
(Groovy)
Endpoint Controller
RxJava
Async Service Layer API
Hystrix (Fault tolerance)Endpoint Manager
JVM
Netflix Services
Script Activated
UI Engineer
Activate
Service Layer
Endpoint Code
(Groovy)
Endpoint Controller
RxJava
Async Service Layer API
Hystrix (Fault tolerance)Endpoint Manager
JVM
Netflix Services
Service Layer
Purpose of the Service Layer
● Interface to business logic (our API)● Shield data consumers from service
changes● Combine and expose business data in a
logical and consistent manner● All Service Layer methods are async using
RxJava○ Hides concurrency and underlying implementation
from callers
Endpoint Code
(Groovy)
Endpoint Controller
RxJava
Async Service Layer API
Hystrix (Fault tolerance)Endpoint Manager
JVM
Netflix Services
RxJava
RxJava● Why?
○ How do you expose an async service as an API?○ Solution to compose async flows and sequences of
data○ Rich set of operators to filter and interact with data
How RxJava Helps
● Need to hide concurrency from script writers○ Minimize the “bad things” consumers of our API on
box can do. ○ Hide the internal implementation
■ Change concurrency of any given call■ Switch to non-blocking IO
Endpoint Code
(Groovy)
Endpoint Controller
RxJava
Async Service Layer API
Hystrix (Fault tolerance)Endpoint Manager
JVM
Netflix Services
Hystrix
How Hystrix helps● Latency and Fault Tolerance
○ Stop cascading failures. Fallbacks and graceful degradation. Fail fast and rapid recovery.
○ Thread and semaphore isolation with circuit breakers.
● Realtime Operations
○ Realtime monitoring and configuration changes. Watch service and property changes take effect
immediately as they spread across a fleet.
○ Be alerted, make decisions, affect change and see results in seconds.
● Concurrency
○ Parallel execution. Concurrency aware request caching. Automated batching through request collapsing.
Hystrix Dashboard Example
DELIVERY
Edge Delivery
● Continuous deployment● Automated system integrity analysis ● Tools for facilitating delivery
Automated Deployment Pipeline
ZUUL ZUUL-CANARY
ZUUL-DEBUG ZUUL-SQUEEZE
MAIN ORIGIN
CANARY ORIGIN
DEBUG ORIGIN
SQUEEZE ORIGIN
ELB
Edge Cluster Organization
ZUUL ZUUL-CANARY
ZUUL-DEBUG ZUUL-SQUEEZE
MAIN ORIGIN
CANARY ORIGIN
DEBUG ORIGIN
SQUEEZE ORIGIN
ELB
Most Requests to Main Origin
ZUUL ZUUL-CANARY
ZUUL-DEBUG ZUUL-SQUEEZE
MAIN ORIGIN
CANARY ORIGIN
DEBUG ORIGIN
SQUEEZE ORIGIN
ELB
Some requests to Canary
Canary Analysis
Canary Analysis Detail
ZUUL ZUUL-CANARY
ZUUL-DEBUG ZUUL-SQUEEZE
MAIN ORIGIN
CANARY ORIGIN
DEBUG ORIGIN
SQUEEZE ORIGIN
ELB
Response Validation
ZUUL ZUUL-CANARY
ZUUL-DEBUG ZUUL-SQUEEZE
MAIN ORIGIN
CANARY ORIGIN
DEBUG ORIGIN
SQUEEZE ORIGIN
ELB
Fork response to Main and Canary
ZUUL ZUUL-CANARY
ZUUL-DEBUG ZUUL-SQUEEZE
MAIN ORIGIN
CANARY ORIGIN
DEBUG ORIGIN
SQUEEZE ORIGIN
ELB
Validate response
Validate response integrity
ZUUL ZUUL-CANARY
ZUUL-DEBUG ZUUL-SQUEEZE
MAIN ORIGIN
CANARY ORIGIN
DEBUG ORIGIN
SQUEEZE ORIGIN
ELB
Targeted Debugging
ZUUL ZUUL-CANARY
ZUUL-DEBUG ZUUL-SQUEEZE
MAIN ORIGIN
CANARY ORIGIN
DEBUG ORIGIN
SQUEEZE ORIGIN
ELB
Targeted Debugging
ZUUL ZUUL-CANARY
ZUUL-DEBUG ZUUL-SQUEEZE
MAIN ORIGIN
CANARY ORIGIN
DEBUG ORIGIN
SQUEEZE ORIGIN
ELB
Targeted Debugging
ZUUL ZUUL-CANARY
ZUUL-DEBUG ZUUL-SQUEEZE
MAIN ORIGIN
CANARY ORIGIN
DEBUG ORIGIN
SQUEEZE ORIGIN
ELB
Squeezing the Origin
ZUUL ZUUL-CANARY
ZUUL-DEBUG ZUUL-SQUEEZE
MAIN ORIGIN
CANARY ORIGIN
DEBUG ORIGIN
SQUEEZE ORIGIN
ELB
Squeezing the Origin
ZUUL ZUUL-CANARY
ZUUL-DEBUG ZUUL-SQUEEZE
MAIN ORIGIN
CANARY ORIGIN
DEBUG ORIGIN
ELB
Finding service Capacity
SQUEEZE ORIGIN
Scryer - Predictive auto-scaling
● Why?○ Reactive doesn’t work in all cases○ Reacting is sometimes too late
■ Sunday morning cartoons○ Reactive overreacts
■ Superbowl, World Cup, Outages■ Fixed size scaling
○ All in All - more reliable and saves money
Daily Traffic Patterns
Scryer Predictions
How does Scryer work?● Traffic shape analysis
○ Monday vs Monday○ Sunday vs Sunday, etc○ FFT based smoothing
Filtering out Noise
Ignoring outages
Accounting for regular spikey traffic
Iteratively apply FFT
Other Scryer Factors● Traffic volume analysis
○ At least 4 weeks of data○ Linear regression based on time of day○ Correct the prediction based on today’s trend.
● Instance factors○ Instance startup time○ Instance capacity (obtained by squeeze testing)
● Scale (up/down) actions scheduled based on prediction
The Future
Future - Large Projects on Edge
● Async, non-blocking servers● Service layer redesign● Internal Insights● Global Insights
Edge Architecture Today
ELB
API Service
Netflix Services
ELB
Streaming Service
ELB
Zuul
Website Service
Zuul Zuul
Future Edge Architecture
ELB
API/ Edge Service
Netflix Services Playback Services
ELB
Zuul
Website
Future Edge Architecture
ELB
API/ Edge Service
Netflix Services Playback Services
ELB
Zuul
Website
Future Edge Architecture
ELB
API/ Edge Service
Netflix Services Playback Services
ELB
Zuul
Website
Future Edge Architecture
ELB
API/ Edge Service
Netflix Services Playback Services
ELB
Zuul
Website
Future Edge Architecture
ELB
API/ Edge Service
Netflix Services Playback Services
ELB
Zuul
Website
Global Insights
API/ Edge Service
Netflix Services
Playback Services
Zuul User Interface
Insight EngineEvent Stream
Client Data
User Interface Designs
Netflix in the Cloud - 5 years later
Lessons learned
What Did We Learn?
Failure is Assured!
● Code failure - Continuous delivery ● Service failure - fallbacks and redundancy● Instances and Zone failure - redundancy ● Cloud infrastructure failure - Multiple active regions● Human failure - Automation
Building for Failure
Drawbacks of the cloud● Some failures are difficult to detect the cause
○ Huge variability in instance performance that are almost impossible to explain.
○ Network barriers○ Multi tenancy○ Firewalls
● Very limited access to information/ ability to fix issues
Software focus: Cloud’s greatest strength
● Scale our business● Automate processes● Radically experiment● Remain resilient● Move quickly
Netflix Culture - Our secret sauce
● Freedom and responsibility● Highly aligned teams● Aversion to process● Design for necessity● Design for failure● Engineering teams operating their services
Netflix OSS
● Zuul - Smart edge router● RxJava - Functional reactive libraries● Hystrix - SOA resiliency● + a lot more!
For more Info on Netflix Cloud Technology:
Read our Technology Blog : http://techblog.netflix.com/ Check out our Open Source Cloud Projects : http://netflix.github.io