edge architecture ieee international conference on cloud engineering

Post on 16-Sep-2014

833 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Netflix’s Global Cloud Edge Architecture

Mikey Cohen mikey@netflix.comEdge Engineering Platform

Netflix

Over 44 million subscribers

in over 40 countries

Netflix accounts for over 30% of peak internet traffic in North America

One billion hours ~ 100,000 years

per month...

Netflix supports over 1000 device types

Edge Services

● Front door to Netflix● Edge Routing - Zuul● API - Edge Server● Playback services

How does Netflix Streaming work?*

* A simplified view

How does Netflix Streaming work?

Netflix Services in Amazon Cloud

Your CE DeviceCDN

Device Under the Hood

Netflix Services in Amazon Cloud

Your CE DeviceCDN

User Interface

Netflix Streaming Platform

DRM encodingCE integration

User Interface loaded, data retrieved from Netflix Edge Service

User Interface

Netflix Streaming Platform

DRM

Netflix Services in Amazon Cloud

encoding

Your CE DeviceCDN

CE integration

Edge Services

User Interface loaded, data retrieved from Netflix Edge Service

User Interface

Netflix Streaming Platform

DRM

Netflix Services in Amazon Cloud

encoding

Your CE DeviceCDN

CE integration

Edge Services

User Interface Loaded

User Interface

Netflix Streaming Platform

DRM

Netflix Services in Amazon Cloud

encoding

Your CE DeviceCDN

CE integration

Edge Services

Movie Authorization

User Interface

Netflix Streaming Platform

DRM

Netflix Services in Amazon Cloud

encoding

Your CE DeviceCDN

CE integration

Edge Services

Authorize

Movie Authorization

User Interface

Netflix Streaming Platform

DRM

Netflix Services in Amazon Cloud

encoding

Your CE DeviceCDN

CE integration

Edge Services

Authorize

Obtaining License

User Interface

Netflix Streaming Platform

DRM

Netflix Services in Amazon Cloud

encoding

Your CE DeviceCDN

CE integration

Edge Services

License

Movie starts streaming

User Interface

Netflix Streaming Platform

DRM

Netflix Services in Amazon Cloud

encoding

Your CE DeviceCDN

CE integration

Edge Services

PlayData

Movie starts streaming

User Interface

Netflix Streaming Platform

DRM

Netflix Services in Amazon Cloud

encoding

Your CE DeviceCDN

CE integration

Edge Services

Periodic “bookmark” calls note place in movie

User Interface

Netflix Streaming Platform

DRM

Netflix Services in Amazon Cloud

encoding

Your CE DeviceCDN

CE integration

Edge Services

bookmark

Edge Services - What we are talking about today

User Interface

Netflix Streaming Platform

DRM encoding

Your CE DeviceCDN

CE integration

bookmarkNetflix Services in Amazon Cloud Edge Services

Edge’s lofty mission

● High Availability ● Good performance● Data broker between many services and

devices in a global, high volume, rapidly innovating, highly dynamic service

● Clients and services are constantly changing

Edge stats

● Billions of incoming requests per day○ Over 10X outgoing service calls per request

● About 10 device changes per day● Daily service pushes● Daily routing changes

Architecture Goals● Infrastructure

○ Availability○ Resiliency○ Scalability

● Application○ Platform diversity○ Rapid innovation○ A/B Testing

● Delivery○ Automation○ Insights

Netflix’s Global Cloud Architecture

High Level Regional Edge Architecture

ELB

Edge Service

Netflix Services

ELB

Playback Service

ELB

Zuul

Website Service

Zuul

ELB

Edge Service

Netflix Services

ELB

Playback Service

ELB

Zuul

Website Service

What is Zuul?

● Open source framework for dynamically reading, writing, and executing filters that act on incoming HTTP requests

● Dynamically compiled filters written in Groovy○ Any JVM language supported

● Filters share state through a request scoped context

How we use Zuul● Authentication● Insights● Stress Testing● Canary Testing● Dynamic Routing● Service Migration● Load Shedding● Security● Static Response handling● Active/Active traffic management

Zuul Filter Characteristics

● Type● Execution Order● Criteria● Action

Zuul Filter LifecycleHTTP Request

"pre" filters "routing" filter(s) "post" filters

Origin Server

"custom" filters

Http Request Http Response

"error" filters

Zuul Filter LifecycleHTTP Request

"pre" filters "routing" filter(s) "post" filters

Origin Server

"custom" filters

Http Request Http Response

"error" filters

zuul filter lifecycleHTTP Request

"pre" filters "routing" filter(s) "post" filters

Origin Server

"custom" filters

Http Request Http Response

"error" filters

zuul filter lifecycleHTTP Request

"pre" filters "routing" filter(s) "post" filters

Origin Server

"custom" filters

Http Request Http Response

"error" filters

zuul filter lifecycleHTTP Request

"pre" filters "routing" filter(s) "post" filters

Origin Server

"custom" filters

Http Request Http Response

"error" filters

zuul filter lifecycleHTTP Request

"pre" filters "routing" filter(s) "post" filters

Origin Server

"custom" filters

Http Request Http Response

"error" filters

zuul filter lifecycleHTTP Request

"pre" filters "routing" filter(s) "post" filters

Origin Server

"custom" filters

Http Request Http Response

"error" filters

zuul filter lifecycleHTTP Request

"pre" filters "routing" filter(s) "post" filters

Origin Server

"custom" filters

Http Request Http Response

"error" filters

zuul filter lifecycleHTTP Request

"pre" filters "routing" filter(s) "post" filters

Origin Server

"custom" filters

Http Request Http Response

"error" filters

Example FilterFile: DeviceDelayFilter.groovy

1 class DeviceDelayFilter extends ZuulFilter { 2 3 def static Random rand = new Random() 4 @Override 5 String filterType() { 6 return 'pre' 7 } 8 9 @Override10 int filterOrder() {11 return 512 }13 14 @Override15 boolean shouldFilter() {16 return RequestContext.getRequest().17 getParameter("deviceType")?equals("BrokenDevice"):false18 }1920 @Override21 Object run() {22 sleep(rand.nextInt(20000)) //Sleep for a random number of seconds between [0-20]23 }24 }

Filter deployment

Active/Active

Multiple Active Regions

ZUUL

API

Cassandra

Services

ZUUL

API

Cassandra

Services

Multiple Active Regions - NM vs GE

ZUUL

API

Cassandra

Services

ZUUL

API

Cassandra

ServicesDNS

DNS

Multiple Active Regions- Cassandra Replication across regions

ZUUL

API

Cassandra

Services

ZUUL

API

Cassandra

ServicesDNS

DNS

DNS Misrouting

ZUUL

API

Cassandra

Services

ZUUL

API

Cassandra

ServicesDNS

DNS

DNS Misrouting

ZUUL

API

Cassandra

Services

ZUUL

API

Cassandra

ServicesDNS

DNS

Geo lookup resolves IP in west

ZUUL

API

Cassandra

Services

ZUUL

API

Cassandra

ServicesDNS

DNS

GEO

Zuul east routes to Zuul west

ZUUL

API

Cassandra

Services

ZUUL

API

Cassandra

ServicesDNS

DNS

GEO

Response is from west

ZUUL

API

Cassandra

Services

ZUUL

API

Cassandra

ServicesDNS

DNS

GEO

Regional Failure

ZUUL

API

Cassandra

Services

ZUUL

API

Cassandra

ServicesDNS

DNS

Catastrophe in US-East

ZUUL

API

Cassandra

Services

ZUUL

API

Cassandra

ServicesDNS

DNS

East Coast is Down

ZUUL

API

Cassandra

Services

ZUUL

API

Cassandra

ServicesDNS

DNS

Switch DNS to point to US-West

ZUUL

API

Cassandra

Services

ZUUL

API

Cassandra

ServicesDNSDNS

East traffic flows to West

ZUUL

API

Cassandra

Services

ZUUL

API

Cassandra

ServicesDNS

DNS

Edge Server (API)

The Edge Service - Netflix’s API Tier

ELB

Edge Service

Netflix Services

ELB

Playback Service

ELB

Zuul

Website Service

What’s wrong with REST for Netflix?

REST

● One Size Fits all● One Data Format Fits All● REST tends to be atomic● Average 25 REST requests to build up a

page.

Netflix’s Groovy Scripting Layer

Edge Scripting Tier

● Device teams write scripts for their device○ control content, format, endpoints

● Code injected directly into Edge Service at runtime○ Scripts are in production in about 30 seconds

Endpoint Code

(Groovy)

Endpoint Controller

RxJava

Async Service Layer API

Hystrix (Fault tolerance)Endpoint Manager

JVM

Netflix Services

Edge Server Architecture

Endpoint Code

(Groovy)

Endpoint Controller

RxJava

Async Service Layer API

Hystrix (Fault tolerance)Endpoint Manager

JVM

Netflix Services

Pushing a Script

UI Engineer

/ps3/home script

Endpoint Code

(Groovy)

Endpoint Controller

RxJava

Async Service Layer API

Hystrix (Fault tolerance)Endpoint Manager

JVM

Netflix Services

Pushing a Script

UI Engineer

/ps3/home script

Endpoint Code

(Groovy)

Endpoint Controller

RxJava

Async Service Layer API

Hystrix (Fault tolerance)Endpoint Manager

JVM

Netflix Services

Controller pulls new script / compiles

UI Engineer

/ps3/home script

Endpoint Code

(Groovy)

Endpoint Controller

RxJava

Async Service Layer API

Hystrix (Fault tolerance)Endpoint Manager

JVM

Netflix Services

Script Activated

UI Engineer

Activate

Service Layer

Endpoint Code

(Groovy)

Endpoint Controller

RxJava

Async Service Layer API

Hystrix (Fault tolerance)Endpoint Manager

JVM

Netflix Services

Service Layer

Purpose of the Service Layer

● Interface to business logic (our API)● Shield data consumers from service

changes● Combine and expose business data in a

logical and consistent manner● All Service Layer methods are async using

RxJava○ Hides concurrency and underlying implementation

from callers

Endpoint Code

(Groovy)

Endpoint Controller

RxJava

Async Service Layer API

Hystrix (Fault tolerance)Endpoint Manager

JVM

Netflix Services

RxJava

RxJava● Why?

○ How do you expose an async service as an API?○ Solution to compose async flows and sequences of

data○ Rich set of operators to filter and interact with data

How RxJava Helps

● Need to hide concurrency from script writers○ Minimize the “bad things” consumers of our API on

box can do. ○ Hide the internal implementation

■ Change concurrency of any given call■ Switch to non-blocking IO

Endpoint Code

(Groovy)

Endpoint Controller

RxJava

Async Service Layer API

Hystrix (Fault tolerance)Endpoint Manager

JVM

Netflix Services

Hystrix

How Hystrix helps● Latency and Fault Tolerance

○ Stop cascading failures. Fallbacks and graceful degradation. Fail fast and rapid recovery.

○ Thread and semaphore isolation with circuit breakers.

● Realtime Operations

○ Realtime monitoring and configuration changes. Watch service and property changes take effect

immediately as they spread across a fleet.

○ Be alerted, make decisions, affect change and see results in seconds.

● Concurrency

○ Parallel execution. Concurrency aware request caching. Automated batching through request collapsing.

Hystrix Dashboard Example

DELIVERY

Edge Delivery

● Continuous deployment● Automated system integrity analysis ● Tools for facilitating delivery

Automated Deployment Pipeline

ZUUL ZUUL-CANARY

ZUUL-DEBUG ZUUL-SQUEEZE

MAIN ORIGIN

CANARY ORIGIN

DEBUG ORIGIN

SQUEEZE ORIGIN

ELB

Edge Cluster Organization

ZUUL ZUUL-CANARY

ZUUL-DEBUG ZUUL-SQUEEZE

MAIN ORIGIN

CANARY ORIGIN

DEBUG ORIGIN

SQUEEZE ORIGIN

ELB

Most Requests to Main Origin

ZUUL ZUUL-CANARY

ZUUL-DEBUG ZUUL-SQUEEZE

MAIN ORIGIN

CANARY ORIGIN

DEBUG ORIGIN

SQUEEZE ORIGIN

ELB

Some requests to Canary

Canary Analysis

Canary Analysis Detail

ZUUL ZUUL-CANARY

ZUUL-DEBUG ZUUL-SQUEEZE

MAIN ORIGIN

CANARY ORIGIN

DEBUG ORIGIN

SQUEEZE ORIGIN

ELB

Response Validation

ZUUL ZUUL-CANARY

ZUUL-DEBUG ZUUL-SQUEEZE

MAIN ORIGIN

CANARY ORIGIN

DEBUG ORIGIN

SQUEEZE ORIGIN

ELB

Fork response to Main and Canary

ZUUL ZUUL-CANARY

ZUUL-DEBUG ZUUL-SQUEEZE

MAIN ORIGIN

CANARY ORIGIN

DEBUG ORIGIN

SQUEEZE ORIGIN

ELB

Validate response

Validate response integrity

ZUUL ZUUL-CANARY

ZUUL-DEBUG ZUUL-SQUEEZE

MAIN ORIGIN

CANARY ORIGIN

DEBUG ORIGIN

SQUEEZE ORIGIN

ELB

Targeted Debugging

ZUUL ZUUL-CANARY

ZUUL-DEBUG ZUUL-SQUEEZE

MAIN ORIGIN

CANARY ORIGIN

DEBUG ORIGIN

SQUEEZE ORIGIN

ELB

Targeted Debugging

ZUUL ZUUL-CANARY

ZUUL-DEBUG ZUUL-SQUEEZE

MAIN ORIGIN

CANARY ORIGIN

DEBUG ORIGIN

SQUEEZE ORIGIN

ELB

Targeted Debugging

ZUUL ZUUL-CANARY

ZUUL-DEBUG ZUUL-SQUEEZE

MAIN ORIGIN

CANARY ORIGIN

DEBUG ORIGIN

SQUEEZE ORIGIN

ELB

Squeezing the Origin

ZUUL ZUUL-CANARY

ZUUL-DEBUG ZUUL-SQUEEZE

MAIN ORIGIN

CANARY ORIGIN

DEBUG ORIGIN

SQUEEZE ORIGIN

ELB

Squeezing the Origin

ZUUL ZUUL-CANARY

ZUUL-DEBUG ZUUL-SQUEEZE

MAIN ORIGIN

CANARY ORIGIN

DEBUG ORIGIN

ELB

Finding service Capacity

SQUEEZE ORIGIN

Scryer - Predictive auto-scaling

● Why?○ Reactive doesn’t work in all cases○ Reacting is sometimes too late

■ Sunday morning cartoons○ Reactive overreacts

■ Superbowl, World Cup, Outages■ Fixed size scaling

○ All in All - more reliable and saves money

Daily Traffic Patterns

Scryer Predictions

How does Scryer work?● Traffic shape analysis

○ Monday vs Monday○ Sunday vs Sunday, etc○ FFT based smoothing

Filtering out Noise

Ignoring outages

Accounting for regular spikey traffic

Iteratively apply FFT

Other Scryer Factors● Traffic volume analysis

○ At least 4 weeks of data○ Linear regression based on time of day○ Correct the prediction based on today’s trend.

● Instance factors○ Instance startup time○ Instance capacity (obtained by squeeze testing)

● Scale (up/down) actions scheduled based on prediction

The Future

Future - Large Projects on Edge

● Async, non-blocking servers● Service layer redesign● Internal Insights● Global Insights

Edge Architecture Today

ELB

API Service

Netflix Services

ELB

Streaming Service

ELB

Zuul

Website Service

Zuul Zuul

Future Edge Architecture

ELB

API/ Edge Service

Netflix Services Playback Services

ELB

Zuul

Website

Future Edge Architecture

ELB

API/ Edge Service

Netflix Services Playback Services

ELB

Zuul

Website

Future Edge Architecture

ELB

API/ Edge Service

Netflix Services Playback Services

ELB

Zuul

Website

Future Edge Architecture

ELB

API/ Edge Service

Netflix Services Playback Services

ELB

Zuul

Website

Future Edge Architecture

ELB

API/ Edge Service

Netflix Services Playback Services

ELB

Zuul

Website

Global Insights

API/ Edge Service

Netflix Services

Playback Services

Zuul User Interface

Insight EngineEvent Stream

Client Data

User Interface Designs

Netflix in the Cloud - 5 years later

Lessons learned

What Did We Learn?

Failure is Assured!

● Code failure - Continuous delivery ● Service failure - fallbacks and redundancy● Instances and Zone failure - redundancy ● Cloud infrastructure failure - Multiple active regions● Human failure - Automation

Building for Failure

Drawbacks of the cloud● Some failures are difficult to detect the cause

○ Huge variability in instance performance that are almost impossible to explain.

○ Network barriers○ Multi tenancy○ Firewalls

● Very limited access to information/ ability to fix issues

Software focus: Cloud’s greatest strength

● Scale our business● Automate processes● Radically experiment● Remain resilient● Move quickly

Netflix Culture - Our secret sauce

● Freedom and responsibility● Highly aligned teams● Aversion to process● Design for necessity● Design for failure● Engineering teams operating their services

Netflix OSS

● Zuul - Smart edge router● RxJava - Functional reactive libraries● Hystrix - SOA resiliency● + a lot more!

For more Info on Netflix Cloud Technology:

Read our Technology Blog : http://techblog.netflix.com/ Check out our Open Source Cloud Projects : http://netflix.github.io

top related