the aws cloud : leveraging the state of the art

35
The AWS Cloud Leveraging the State of the Art Sid Anand ( @r39132 ) SAP Cloud Inside Track 2012 1 Thursday, February 16, 2012

Upload: siddharth-anand

Post on 11-May-2015

1.774 views

Category:

Technology


0 download

DESCRIPTION

Keynote at the SAP Cloud Conference, February 2012

TRANSCRIPT

Page 1: The AWS Cloud : Leveraging the State of the Art

The AWS Cloud Leveraging the State of the Art

Sid Anand (@r39132)

SAP Cloud Inside Track 2012

1

Thursday, February 16, 2012

Page 2: The AWS Cloud : Leveraging the State of the Art

What is the AWS Cloud?A Real World Scenario

2

Thursday, February 16, 2012

Page 3: The AWS Cloud : Leveraging the State of the Art

3

Question

If you were to build your own website today, what would you need?

Answer

You need a machine!

For simplicity, we will assume that your web server and application server code run on the same box!

AWS offers EC2 instances (i.e. virtual instances) to host your code

- Various sizes (e.g. IOps, # of Spindles, CPUs, Memory, Network bandwidth)

- Various configurations (e.g. Virtual Private Cloud, High Performance Cluster )

- Various pricing schemes (e.g. on-demand, reserved, SPOT, etc....)

A Real World Scenario

Thursday, February 16, 2012

Page 4: The AWS Cloud : Leveraging the State of the Art

4

Question

Is one machine enough to handle traffic from all of your users?

What if that machine were to fall over or need maintenance (i.e. a restart)?

Answer

Add many machines!

A Real World Scenario

Thursday, February 16, 2012

Page 5: The AWS Cloud : Leveraging the State of the Art

5

Question

This handles more traffic, but what if your servers were to fall over or need maintenance?

Answer

AWS offers AutoScaleGroups (a.k.a. ASG)!

You can deploy your servers under the protection of an ASG with a min and max pool size set.

The ASG ensures that machines are replaced when they die to guarantee your “min” pool size

ASGs monitor the health of your machines by polling an http port on each machine

A Real World Scenario

Thursday, February 16, 2012

Page 6: The AWS Cloud : Leveraging the State of the Art

6

Question

How do you distribute traffic to all of your machines evenly?

Answer

Deploy your favorite software load balancer!

And write some custom code to register/deregister your machine instances with the load balancer

A Real World Scenario

Thursday, February 16, 2012

Page 7: The AWS Cloud : Leveraging the State of the Art

7

Question

What if the load balancer were to fall over or to need maintenance or to become a traffic choke point?

Answer

Add multiple servers and deploy them under an ASG!

This is not ideal for a few reasons

- Need to register/deregister your Load Balancer instances with DNS

- Need to sync with ASGsʼs view of what is alive and dead, being added or removed, etc...

A Real World Scenario

Thursday, February 16, 2012

Page 8: The AWS Cloud : Leveraging the State of the Art

8

Answer

AWS offers Elastic Load Balancers (i.e. ELB)

- Conceptually similar to having many LBs in an ASG, with some additional features:

- Provides DNS hostname (e.g. mysite-11111111.us-east-1.elb.amazonaws.com)

- Maps all of the load balancer instances to this hostname

- Takes care of maintenance of the load balancer machines and the requisite DNS registrations/deregistrations

- Syncs with the ASG -- if the ASG replaces one of your instances, the ELB will also remove that instance

- Letʼs see how it works in action!

A Real World Scenario

Thursday, February 16, 2012

Page 9: The AWS Cloud : Leveraging the State of the Art

@r39132 23 9

Thursday, February 16, 2012

Page 10: The AWS Cloud : Leveraging the State of the Art

10

A Real World ScenarioQuestion

What about a DB to persist my data?

Answer

Multiple AWS hosted/managed options!

- DynamoDB (the new SimpleDB replacement) offers key-value semantics

Netflix replaced Oracle with SimpleDB and ran on it 2010-2011

- 4.5 Billion user-facing request a day

- S3 offers key-value semantics for very large files (e.g. 5TB). Typically for Map-Reduce files, media files, or Oracle BLOBS/CLOBS

- RDS - hosted Oracle or MySQL if you need relations and complex queries

Thursday, February 16, 2012

Page 11: The AWS Cloud : Leveraging the State of the Art

11

A Real World ScenarioQuestion

What if I have high-volume writes, but donʼt care when they are written -- e.g. event streams

Answer

Simple Queue Service

- Think Enterprise Message Bus

- Highly available, infinitely scalable

- Handles application/system monitoring event traffic and social graph events at Netflix

Thursday, February 16, 2012

Page 12: The AWS Cloud : Leveraging the State of the Art

12

A Real World Scenario

Question

What if the whole Data Center goes down? How do I keep my service available?

Answer

Amazon Data Center = Availability Zone

Thursday, February 16, 2012

Page 13: The AWS Cloud : Leveraging the State of the Art

13

A Real World Scenario

Answer

Always deploy your code in multiple Availability Zones!

- Netflix deploys in 3 AZs in Virgina

- Best Practice : Always deploy enough capacity in each AZ to handle losing one AZ during peak

- Netflix follows this best practice!

Thursday, February 16, 2012

Page 14: The AWS Cloud : Leveraging the State of the Art

14

A Real World ScenarioQuestion

What if your Asian and European customers complain of slow response times?

Recall : Higher Response times, lower scalability

Answer

AWS has 8 global regions! Each region has between 3 and 4 AZs

- Netflixʼs launch in the UK and Ireland were out of AWS EU-West Region

Thursday, February 16, 2012

Page 15: The AWS Cloud : Leveraging the State of the Art

15

A Real World Scenario

Thursday, February 16, 2012

Page 16: The AWS Cloud : Leveraging the State of the Art

16

A Real World Scenario

Other AWS Services:

- Elastic Map Reduce : Map-Reduce as a Service for analytics. Supports PIG and Hive

- ElastiCache : A hosted cache service (think Memcached as a Service)

Whatʼs Missing (or coming soon)?:

- Discovery & Load Balancing for N-tier applications!

- In effect, weʼd like ELB for internal traffic

- Crypto as a Service

- Currently, none of the services are cross-region! Itʼs left to the user to transfer data or proxy requests between regions

Thursday, February 16, 2012

Page 17: The AWS Cloud : Leveraging the State of the Art

Who Uses AWS?Netflix’s Cloud Architecture

17

Thursday, February 16, 2012

Page 18: The AWS Cloud : Leveraging the State of the Art

Netflix’s Cloud Architecture

Components

Many (~100) applications, organized in clusters (a.k.a. ASGs)

Clusters can be at different levels in the call stack

Clusters can call each other

ELB ELB

NES NES NES NES

Discovery

NMTS NMTS

NMTS NMTS

NMTS NMTS

NBES NBES

IAAS IAAS IAAS

18

Thursday, February 16, 2012

Page 19: The AWS Cloud : Leveraging the State of the Art

Netflix’s Cloud Architecture Levels

NES : Netflix Edge Services

NMTS : Netflix Mid-tier Services

NBES : Netflix Back-end Services

IAAS : AWS IAAS Services

Discovery : Help services discover NMTS and NBES services

ELB ELB

NES NES NES NES

Discovery

NMTS NMTS

NMTS NMTS

NMTS NMTS

NBES NBES

IAAS IAAS IAAS

19

Thursday, February 16, 2012

Page 20: The AWS Cloud : Leveraging the State of the Art

Netflix’s Cloud Architecture Components (NES)

Overview

Any service that browsers and streaming devices connect to over the internet

They sit behind AWS Elastic Load Balancers (a.k.a. ELB)

They call clusters at lower levels

ELB ELB

NES NES NES NES

Discovery

NMTS NMTS

NMTS NMTS

NMTS NMTS

NBES NBES

IAAS IAAS IAAS

20

Thursday, February 16, 2012

Page 21: The AWS Cloud : Leveraging the State of the Art

Netflix’s Cloud Architecture Components (NES)

Examples

API Servers

Support the video browsing experience

Also allows users to modify their Q

Serves 1.4 Billions calls/day

Streaming Control Servers

Support streaming video playback

Authenticate your Wii, PS3, etc...

Download DRM to the Wii, PS3, etc...

Return a list of CDN urls to the Wii, PS3, etc...

ELB ELB

NES NES NES NES

Discovery

NMTS NMTS

NMTS NMTS

NMTS NMTS

NBES NBES

IAAS IAAS IAAS

21

Thursday, February 16, 2012

Page 22: The AWS Cloud : Leveraging the State of the Art

Netflix’s Cloud Architecture

Components (NMTS)

Overview

Can call services at the same or lower levels

Other NMTS

NBES, IAAS

Not NES

Exposed through our Discovery service

ELB ELB

NES NES NES NES

Discovery

NMTS NMTS

NMTS NMTS

NMTS NMTS

NBES NBES

IAAS IAAS IAAS

22

Thursday, February 16, 2012

Page 23: The AWS Cloud : Leveraging the State of the Art

Netflix’s Cloud Architecture Components (NMTS)

Examples

Netflix Queue Servers

Modify items in the usersʼ movie queue

Viewing History Servers

Record and track all streaming movie watching

SIMS Servers

Compute and serve user-to-user and movie-to-movie similarities

ELB ELB

NES NES NES NES

Discovery

NMTS NMTS

NMTS NMTS

NMTS NMTS

NBES NBES

IAAS IAAS IAAS

23

Thursday, February 16, 2012

Page 24: The AWS Cloud : Leveraging the State of the Art

Netflix’s Cloud Architecture Components (NBES)

Overview

A back-end, usually 3rd party, open-source service

Leaf in the call tree. Cannot call anything else

ELB ELB

NES NES NES NES

Discovery

NMTS NMTS

NMTS NMTS

NMTS NMTS

NBES NBES

IAAS IAAS IAAS

24

Thursday, February 16, 2012

Page 25: The AWS Cloud : Leveraging the State of the Art

Netflix’s Cloud Architecture

Components (NBES)

Examples

Cassandra Clusters

Our new cloud database is Cassandra and stores all sorts of data to support application needs

Zookeeper Clusters

Our distributed lock service and sequence generator

Memcached Clusters

Typically caches things that we store in S3 but need to access quickly or often

ELB ELB

NES NES NES NES

Discovery

NMTS NMTS

NMTS NMTS

NMTS NMTS

NBES NBES

IAAS IAAS IAAS

25

Thursday, February 16, 2012

Page 26: The AWS Cloud : Leveraging the State of the Art

Netflix’s Cloud Architecture Components (IAAS)

Examples

AWS S3

Large-sized data (e.g. video encodes, application logs, etc...) is stored here, not Cassandra

AWS SQS

Amazonʼs message queue to send events (e.g. Facebook network updates are processed asynchronously over SQS)

ELB ELB

NES NES NES NES

Discovery

NMTS NMTS

NMTS NMTS

NMTS NMTS

NBES NBES

IAAS IAAS IAAS

26

Thursday, February 16, 2012

Page 27: The AWS Cloud : Leveraging the State of the Art

Netflix’s Cloud Architecture

Architecture Pros

Horizontally scalable at every level

Should give us maximum availability

Architecture Cons

A user-issued call will pass through multiple levels (a.k.a. hops) during normal operation

Latency can be a concern

EC2 instances in AWS can die at any time!

A lot of moving parts

27

Thursday, February 16, 2012

Page 28: The AWS Cloud : Leveraging the State of the Art

Dealing with the Cons!

We have a little help

28

Thursday, February 16, 2012

Page 29: The AWS Cloud : Leveraging the State of the Art

Simian ArmyPrevention (& Early Detection) is the best medicine

29

Thursday, February 16, 2012

Page 30: The AWS Cloud : Leveraging the State of the Art

Simian Army• Chaos Monkey

• Simulates hard failures in AWS by killing a few instances per ASG (e.g. Auto Scale Group)

• Similar to how EC2 instances can be killed by AWS with little warning

• Tests Netflixʼs ability to gracefully deal with broken connections, interrupted calls, etc...

• Verifies that all services are running within the protection of AWS Auto Scale Groups, which reincarnates killed instances

• If not, the Chaos monkey will win!

30

Thursday, February 16, 2012

Page 31: The AWS Cloud : Leveraging the State of the Art

Simian Army

• Latency Monkey

• Simulates soft failures -- i.e. a service gets slower

• Injects random delays in servers!

• Tests the ability of applications to detect and recover (i.e. Graceful Degradation) from the harder problem of delays

• Delays cause Thundering Herds (outside of the scope of this talk!)

31

Thursday, February 16, 2012

Page 32: The AWS Cloud : Leveraging the State of the Art

Simian Army

Does this solve all of our issues?

32

Thursday, February 16, 2012

Page 33: The AWS Cloud : Leveraging the State of the Art

Simian Army

The infinite cloud is infinite when your needs are moderate!

To ensure fairness among tenants, AWS meters or limits every resource

Hence, we hit limits quite often. Our “velocity” is limited by how long it takes for AWS to turn around and raise the limit -- a few hours!

33

Thursday, February 16, 2012

Page 34: The AWS Cloud : Leveraging the State of the Art

Simian Army

• Limits Monkey

• Checks once an hour whether we are approaching one of our limits and triggers alerts for us to proactively reach out to AWS!

• Conformity & Janitor Monkeys

• Finds and clean up orphaned resources (e.g. EC2 instances that are not in an ASG, unreferenced security groups, ELBs, ASGs, etc...) to increase head-room

• Buys us more time before we run out of resources and also saves us $$$$

34

Thursday, February 16, 2012

Page 35: The AWS Cloud : Leveraging the State of the Art

Questions?

Sid Anand

@r39132

http://www.linkedin.com/in/siddharthanand

35

Thursday, February 16, 2012