building a cloud service on a cloud infrastructure. also, cloud

70
Building a cloud service on a cloud infrastructure at Building a cloud service on a cloud infrastructure at Also, cloud. Also, cloud. Mikhail Panchenko, Surge 2011

Upload: mikhail-panchenko

Post on 16-Apr-2017

957 views

Category:

Technology


2 download

TRANSCRIPT

Building a cloud service on a cloud infrastructure atBuilding a cloud service on a cloud infrastructure at

Also, cloud.Also, cloud.Mikhail Panchenko, Surge 2011

Who Am I?Who Am I?

PancakesInfrastructure Engineer at SimpleGeoBackend Engineer at Flickr before thatBackend and Frontend Engineer at Yahoo!Ops/Tools before thatPhilosophy, Economics, and French majorbefore that

@[email protected]

Tools for mobile/geo developersPrimarily focused on services, some data-oriented APIsPaaS, I guess? I've lost track a bitAvailability, redundancy part of brand

Our outage = your outageNo pressure

AgendaAgenda

Goals

A little bit of theory

Challenges in The Cloud

General Architecture

Implementation Details

Architectural GoalsArchitectural Goals

High availability

Linear scalability

Elasticity/Flexibility

Redundancy/Fault Tolerance

Read: don't wake me up, pleaseRead: don't wake me up, please

Sound Familiar?Sound Familiar?

Some Theory, Food for ThoughtSome Theory, Food for Thought

The Internets as Complex SystemsThe Internets as Complex Systems

http://www.amazon.com/Normal-Accidents-Living-High-Risk-Technologies/dp/0691004129

"Complex interactions are those of unfamiliarsequences, or unplanned and unexpectedsequences, and either not visible or not

immediately comprehensible."

Charles Perrow. Normal Accidents: Living with High-Risk Technologies (p. 78). Kindle Edition.

"The notion of baffling interactions is increasinglyfamiliar to all of us. [...] As systems grow in size andin the number of diverse functions they serve, and

are built to function in ever more hostileenvironments, increasing their ties to other systems,they experience more and more incomprehensible

or unexpected interactions. They become morevulnerable to unavoidable system accidents."

Charles Perrow. Normal Accidents: Living with High-Risk Technologies (p. 72). Kindle Edition.

Fortunately,Fortunately,This Is Only The InternetThis Is Only The Internet

"The beauty of this is its simplicity. Once a plangets too complex, everything can go wrong."

Walter Sobchak, The Big Lebowski

InteractionsInteractionsLinear vs ComplexLinear vs Complex

CouplingCouplingTight vs LooseTight vs Loose

Three Mile IslandThree Mile Island"... they found that radioactive water was not

traveling to the tank they intended, but because ofcomplex flow and pressure interactions, was goingto a different, wrong tank, which also overflowed,

this time in the auxiliary building."

Charles Perrow. Normal Accidents: Living with High-Risk Technologies (pp. 22-23). Kindle Edition.

Amazon Web ServicesAmazon Web Services"The traffic shift was executed incorrectly and

rather than routing the traffic to the other router onthe primary network, the traffic was routed onto the

lower capacity redundant EBS network."

"Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region"

http://aws.amazon.com/message/65648/

Common ThemeCommon ThemePreviously independent systems become

coupled as a result of unanticipatedinteractions, leading to fundamentally

surprising results

When pumping radioactive water into the wrongWhen pumping radioactive water into the wrongtank, the behavior of the program is undefinedtank, the behavior of the program is undefined

But where does The Cloud come in??But where does The Cloud come in??

The Trifle AnalogyThe Trifle Analogy

Photo by mathematically_impossible

The Trifle AnalogyThe Trifle Analogy

Photo by mathematically_impossible

A complex system consisting of complex subsystemsA complex system consisting of complex subsystems

Photo by wwarby

The Trifle AnalogyThe Trifle Analogy

Original photos by mathematically_impossible and miheco

Tightly coupled to a complex system over which youTightly coupled to a complex system over which youhave no control and into which you have no insighthave no control and into which you have no insight

Photo by 20after4

Recall Recall "Baffling Interactions""Baffling Interactions"

"The notion of baffling interactions is increasinglyfamiliar to all of us. [...] As systems grow in size andin the number of diverse functions they serve, and

are built to function in ever more hostileenvironments, increasing their ties to other

systems, they experience more and moreincomprehensible or unexpected interactions. They

become more vulnerable to unavoidable systemaccidents."

Charles Perrow. Normal Accidents: Living with High-Risk Technologies (p. 72). Kindle Edition.

DECOUPLE DECOUPLE DECOUPLEDECOUPLE DECOUPLE DECOUPLE( also, simplify )( also, simplify )

Photo by erikcharlton

Decouple Your SubsystemsDecouple Your Subsystems

Shared resources are the most commonsource of unexpected interaction

Resist temptation to double up on roles

Use queues, caches as buffersNOTE: those are complexsubsystems of their own

Decouple Your SubsystemsDecouple Your SubsystemsExplicit Decoupling

CPU AffinityWebserver on 1-7; SSH etc on 8Crude, but gets the job done

More robust solutions - containers

Decouple Your FunctionalityDecouple Your Functionality

Service architecture

Each service does one thing well

Easier to measure, understand, andaccommodate resource demands

Reduce potential for interactions,cross-functional failure

Decouple from Your Environment with ConfigurationDecouple from Your Environment with ConfigurationManagementManagement

Decouple from your platform (OS/kernel)Easy to test/bench potential candidatesEasy to migrate if you find a winnerThis is especially important when dealing with cloud

Automate as much of deploy/bootstrapprocess as possible

Probably won't help much during a provider outagedue to stampedeBUT: DirectConnectYou might not always be in the cloud..

Decouple Your DatacentersDecouple Your Datacenters

Most robust redundancy mechanism

Hot-hot keeps you on your toes

Simplifies, not just for the cloudYahoo! now foregoing datacenterfeatures like HVAC"If it gets too hot in Washington,turn that DC off for a while"I'm sure they're not the only ones

Decouple Your DatacentersDecouple Your Datacenters

"AZ" - Basic building block for EC2

This is the level they (theoretically)decouple at

They are probably thinking along thesame lines we are - must be able to turnoff one AZ without impact in the other

( there's a hidden interaction there )( there's a hidden interaction there )

Every datacenter as an independent microcosm ofEvery datacenter as an independent microcosm ofyour overall architectureyour overall architecture

The Birds 'n' the BeesThe Birds 'n' the Bees

Bird's Eye ViewBird's Eye View

Photo by reschroederimages

Bird's Eye ViewBird's Eye View

( note the absence of specifics )( note the absence of specifics )

Bird's Eye ViewBird's Eye View

Maintenance - Divide & ConquerMaintenance - Divide & Conquer

Local Degradation - Divide & ConquerLocal Degradation - Divide & Conquer

Incompatible Upgrade - Guess!Incompatible Upgrade - Guess!

Incompatible Upgrade - Guess!Incompatible Upgrade - Guess!

Incompatible Upgrade - Yay!Incompatible Upgrade - Yay!

Baffling Single Node FailureBaffling Single Node Failure

202 Accepted202 Accepted

Spike in Write TrafficSpike in Write Traffic

Really simple operational steps for stressful tasksReally simple operational steps for stressful tasks& situations& situations

Temporally decouple the problem from theTemporally decouple the problem from theresolutionresolution

Go back to sleepGo back to sleep

Photo by joshme17

Now, how about those specifics?Now, how about those specifics?

Write PathWrite Path

ELBELB

Dynamic Load Balancing

Flexible virtual IP

Easy to add/remove AZs

Uses healthchecks to automaticallyevict nodes

Gate - "Layer 8 Proxy"Gate - "Layer 8 Proxy"

Lightweight Node.js daemon

OAuth

Rate Limiting

Basic routing to actual services

RecallRecall"Decouple Your Functionality""Decouple Your Functionality"

Services - Pick Your Own AdventureServices - Pick Your Own Adventure

Node.js and PythonSome people just hate Node.js

Can be anything, as long as Gate cantalk to it

( another reason to decouple )

Highly specialized

RabbitMQRabbitMQ

A grenade for our knife-fight

Very flexible - more than we needSimplification candidate

New persistor in >= 1.3 - degradationover failure

See talk at 1:30PM

CassandraCassandra

A mostly-textbook DHT

Homogenous distributed model

Random load distribution

Partition toleranceA perfect foundation for ourarchitecture

Partition TolerancePartition ToleranceIt's not just for outages

RecallRecall"Divide & Conquer""Divide & Conquer"

This too is a partitionThis too is a partition

Thank You!Thank You!

@mihasya

[email protected]