high-availability infrastructure in the cloud - evan cooke - web 2.0 expo nyc 2011

twilioCLOUD COMMUNICATIONS

SCALING HIGH-AVAILABILITYINFRASTRUCTURE

IN THE CLOUD

OCT 11, 2011, WEB 2.0EVAN COOKECO-FOUNDER & CTO

High-AvailabilitySounds good, we need that!

Yummmm Technical Meat!

Availability = Uptime

Uptime + Downtime

Availability % Downtime/yr Downtime/mo

99.9% ("three nines") 8.76 hours 43.2 minutes

99.99% ("four nines") 52.56 minutes 4.32 minutes

99.999% ("five nines") 5.26 minutes 25.9 seconds

99.9999% ("six nines") 31.5 seconds 2.59 seconds

Availability % Downtime/yr Downtime/mo

99.9% ("three nines") 8.76 hours 43.2 minutes

99.99% ("four nines") 52.56 minutes 4.32 minutes

99.999% ("five nines") 5.26 minutes 25.9 seconds

99.9999% ("six nines") 31.5 seconds 2.59 seconds

Can’t rely on human to respond in a 5 min window! Must use automation.

2.5 Hours Down

“...we had to stop all traffic to this database cluster, which meant turning off the site. Once t h e d a t a b a s e s h a d recovered and the root cause had been fixed, we slowly al lowed more people back onto the site.”

September 23, 2010

11 Hours DownOctober 4, 2010

“. . .At 6:30pm EST, we d e t e r m i n e d t h e m o s t effective course of action w a s t o r e - i n d e x t h e [database] shard, which would address the memory fragmentation and usage issues. The whole process, including extensive testing against data loss and data corruption, took about five hours.”

“...Before every run of our test suite we destroy t h e n r e - c r e a t e t h e database... Due to the c o n fi g u r a t i o n e r r o r GitHub's production d a t a b a s e w a s dest royed t hen re -created. Not good.”

HoursNovember 14, 2010

Happens to the best

Causes of DowntimeLack of best practice change controlLack of best practice monitoring of the relevant componentsLack of best practice requirements and procurementLack of best practice operationsLack of best practice avoidance of network failuresLack of best practice avoidance of internal application failuresLack of best practice avoidance of external services that failLack of best practice physical environmentLack of best practice network redundancyLack of best practice technical solution of backupLack of best practice process solution of backupLack of best practice physical locationLack of best practice infrastructure redundancyLack of best practice storage architecture redundancy

E. Marcus and H. Stern, Blueprints for high availability, second edition. Indianapolis, IN, USA: John Wiley & Sons, Inc., 2003.

DataPersistence

ChangeControl

change control monitoring of the relevant components

requirements procurement

operations

Operations

avoidance of internal app

failuresavoidance of

external services that fail

storage architecture redundancy

technical solution of

backup

process solution of

backup

Datacenter

avoidance of network failures

physical environment

network redundancy

physical location

infrastructure redundancy

Cloud Non-Cloud

2.5 Hours Down

“...we had to stop all traffic to this database cluster, which meant turning off the site. Once t h e d a t a b a s e s h a d recovered and the root cause had been fixed, we slowly al lowed more people back onto the site.”

September 23, 2010

11 Hours DownOctober 4, 2010

“. . .At 6:30pm EST, we d e t e r m i n e d t h e m o s t effective course of action w a s t o r e - i n d e x t h e [database] shard, which would address the memory fragmentation and usage issues. The whole process, including extensive testing against data loss and data corruption, took about five hours.”

“...Before every run of our test suite we destroy t h e n r e - c r e a t e t h e database... Due to the c o n fi g u r a t i o n e r r o r GitHub's production d a t a b a s e w a s dest royed t hen re -created. Not good.”

HoursNovember 14, 2010

Happens to the best

Database Database Database

ChangeControl

DataPersistence

ChangeControl

operations

Operations

backup

process solution of

backup

Datacenter

network redundancy

physical location

TodayData PersistenceChange Control

lessons learned@twilio

Developer

End User

Carriers Inbound CallsOutbound Calls

Mobile/Browser VoIPVoice

PhoneNumbers

Send To/From Phone Numbers

Short Codes

Dynamically Buy Phone Numbers

Twilio provides web service APIs to automate Voice and SMS communications

1 Year

100x Growth in Tx/Day over 1 Year

10Servers

2009100’s ofServers

10’s ofServers

• 100’s of prod hosts in continuous operation

• 80+ service types running in prod

• 50+ prod database servers

• Prod deployments several times/day across 7 engineering teams

• Frameworks

- PHP for frontend components

- Python Twisted & gevent for async network services

- Java for backend services

• Storage technology

- MySQL for core DB services

- Redis for queuing and messaging

Data persistence is hard(especially in the cloud)

Data persistence is hardData persistence is the hardest

technical problem most scalable SaaS businesses face

What is data persistence?

Stuff that looks like this

What is data persistence?

DatabasesQueues

Files K/VC C D D

Tier 3

Tier 2 B B B B

QTier 1

Incoming Requests

DataPersistence!

• Difficult to change structure

- Huge inertia e.g., large schema migrations

• Painful to recover from disk/node failures

- “just boot a new node” doesn’t work

• Woeful performance/scalability

- I/O is huge bottleneck in modern servers (e.g. EC2)

• Freak’in complex!!!

- Atomic transactions/rollback, ACID, blah blah blah

Why is persistence so hard?

Difficult to Change Structure

Id Name Value

1 Bob 12

2 Jane 78

3 Steve 56

Id Name

2 Jane

3 Steve

...500 million rows

ALTER TABLE names DROP COLUMN Value

HOURS later...

‣ You live with data decisions for a long time

Painful to Recover from Failures

Primary

Secondary

Data on secondary?How much data?R/W consistency?

‣ Because of complexity, failover is human process

Woeful Performance/Scalability

‣ Poor I/O on cloud today, 100x slower than real HW

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %utilsda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00sdb 169.31 111.88 57.43 469.31 0.90 2.25 12.24 2.29 4.36 1.12 59.01sdc 178.22 110.89 59.41 396.04 0.93 1.98 13.08 1.58 3.50 1.18 53.56sdd 145.54 102.97 50.50 384.16 0.78 1.90 12.63 1.00 2.34 1.03 44.85sde 166.34 95.05 54.46 337.62 0.85 1.69 13.27 1.12 2.84 1.22 47.92md0 0.00 0.00 880.20 2007.92 3.44 7.82 7.99 0.00 0.00 0.00 0.00

~10 MB/s write

m1.xlargeraid0 4x ephemeral

DB DB DB DB DB DB

‣ Difficult to horizontally scale in the cloud

BUFFER POOL AND MEMORY----------------------Total memory allocated 11655168000; in additional pool allocated 0Internal hash tables (constant factor + variable factor) Adaptive hash index 223758224 (179959576 + 43798648) Page hash 11248264 Dictionary cache 45048690 (44991344 + 57346) File system 84400 (82672 + 1728) Lock system 28180376 (28119464 + 60912) Recovery system 0 (0 + 0) Threads 428608 (406936 + 21672)Dictionary memory allocated 57346Buffer pool size 693759Buffer pool size, bytes 11366547456Free buffers 1Database pages 691085Old database pages 255087Modified db pages 326490Pending reads 0Pending writes: LRU 0, flush list 0, single page 0Pages made young 497782847, not young 024.78 youngs/s, 0.00 non-youngs/sPages read 447257683, created 16982810, written 40515343324.82 reads/s, 1.14 creates/s, 33.36 writes/sBuffer pool hit rate 993 / 1000, young-making rate 7 / 1000 not 0 / 1000Pages read ahead 0.00/s, evicted without access 0.39/sLRU len: 691085, unzip_LRU len: 0I/O sum[2753]:cur[2], unzip sum[0]:cur[0]

• Incredibly complex configuration

- Billion knobs and buttons

- Whole companies exist just to tune DB’s

• Lots of consistency/transactional models

• Multi-region data is unsolved - Facebook and Google struggle

@!#$%^&* Complex

Deep breath, step backThink about each problem(use @twilio examples)

•Software that runs in the cloud•Open source

• Don’t have structure

- key/value databases (SimpleDB, Cassandra)

- document-orient databases (CouchDB, MongoDB)

• Don’t store a lot of data...

Difficult to Change Structure1

• Outsource data as much as possible

• But NOT to your customers

Don’t Store Stuff1

• Aggressively archive and move data offline

Don’t Store Stuff

~500MRows

S3/SimpleDB

(keep indices in memory)

Build UX that supports longer/restricted access times to older data

• Avoid stateful systems/architectures where possible

Don’t Store Stuff

Browser

SessionDB

Cookie:SessionID

• Avoid stateful systems/architectures where possible

Don’t Store Stuff

Browser

SessionDB

Cookie:enc($session)

Store state in client browser

• Avoid single points of failure

- E.g., master-master (active/active)

- Complex to set up, complex failure modes

- Sometimes it’s the only solution

- Lots of great docs on web

• Minimize number of stateful node, separate stateful & stateless components...

Separate Stateful and Stateless Components

App AReq App B App C

App BOn failure, even if we boot replacement, we lose data

On failure, even if we boot replacement, we lose data

Keep connection open for whole app path!(hint: use evented framework)

App BApp AApp A App B App CApp C

On failure, we don’t lose a single request

Twilio’s SMS stackuses this approach

• Avoid single points of failure

- E.g., master-master (active/active)

- Complex to set up, complex failure modes

- Sometimes it’s the only solution

- Lots of great blog docs on web

• Minimize number of stateful nodes, separate stateful & stateless components

• Build a data change control process to avoid mistakes and errors...

• 100’s of prod hosts in continuous operation

• 80+ service types running in prod

• 50+ prod database servers

• Prod deployments several times/day across 7 engineering teams

Components deployed at different frequencies: Partially Continuous Deployment

WebsiteContent

WebsiteCode

PHP/Rubyetc.

RESTAPI

Python/Javaetc.

Big DBSchema

DeploymentFrequency(Risk)

4 buckets

WebsiteContent

One Click

WebsiteCode

One ClickCI Tests

RESTAPI

One Click

CI TestsHuman Sign-off

Big DBSchema

Human Assisted Click

CI TestsHuman Sign-off

DeploymentProcesses

• If disk I/O is poor, avoid disk

- Tune tune tune. Keep your indices in memory

- Use an in-memory datastore e.g., Redis and configure replication such that if you have a master failure, you can always promote a slave

• When disk I/O saturates, shard

- LOTs of sharding info on web

- Method of last resort, single point of failure becomes multiple single points of failure

@#$%^&* Complex

• Bring the simplest tool to the job

- Use a strictly consistent store only if you need it

- If you don’t need HA, don’t add the complexity

• There is no magic database. Decompose requirements, mix-and-match datastores as needed...

Magic Database does it all. Consistency, Availability, Partition-tolerance, it's got allthree.

Magic Database

Twilio Data Lifecycle

CREATE

name:foo

status:

ret:INIT

UPDATE

name:foo

status:

ret:QUEUED

UPDATE

name:foo

status:

ret:GOING

name:foo

status:

ret:DONE

Twilio Examples: Call, SMS, Conference Other Examples: Order, Workflow, $

CREATE

name:foo

status:

ret:INIT

UPDATE

name:foo

status:

ret:QUEUED

UPDATE

name:foo

status:

ret:GOING

name:foo

status:

ret:DONE

In-Flight Post-Flight

Twilio Data Lifecycle4

• Atomically update part of a workflow

• Billing

• Log Access

• Analytics

• Reporting

ApplicationsTwilio Data Lifecycle

• Strict Consistency

• Key/Value

• ~20ms

• Eventual Consistency

• Range Queries w/ Filters

• ~200ms

High-AvailabilityProperties

Data Store BData Store A

Systems with very different access semantics

Strict Consistency

Key/Value

10k-1M

Logs(REST API)

Eventual consistencyRange queriesFiltered queries~200msBillions

ReportingEventual consistencyArbitrary queriesHigh LatencyBillions

BillingIdempotentAggregationKey/ValueBillions

PostgreSQL

Billing

Logs(REST API)

Reporting

SQL ShardedCassandra/AcunuMongoDbRiakCouchDb

SQL ShardedRedis

Hadoop

• Difficult to change structure

- Huge inertia e.g., large schema migrations

• Painful to recover from disk/node failures

- “just boot a new node” doesn’t work

• Woeful performance/scalability

- I/O is huge bottleneck in modern servers (e.g. EC2)

• Freak’in complex!!!

- Atomic transactions/rollback, ACID, blah blah blah

Why is persistence so hard?

Don’t store stuff!Go schema-less

Separate stateful/statelessChange control processes

Memory FTWShard

Decompose data lifecycleMinimize complexity

Files K/VC C D D

Tier 3

Tier 2 B B B B

QTier 1

Incoming Requests

ATier 1

B B B BTier 2

C C D D

Tier 3

SQLSQLQ

SimpleDBS3

Aggregate into HA queuesMaster-MasterMySQL

Move file store to S3

Move K/V toSimpleDB w/local cache

Idempotentrequest path

DataPersistence

ChangeControl

operations

Operations

backup

process solution of

backup

Datacenter

network redundancy

physical location

SCALING HIGH-AVAILABILITYINFRASTRUCTURE IN THE CLOUD

Focus on dataHow you store it

When you can delete itControl changes to it

Where you store it

Billing

Logs(REST API)

Reporting

Hadoop

HAqueue

Simplemulti-AZ

multi-regionconsistent

Open Problems...Massively scalable

range queriesfilterable~200ms

Simple HA

Hadoop

Massively scalable

aggregator

twiliohttp://www.twilio.com

@emcooke

high-availability infrastructure in the cloud - evan cooke - web 2.0 expo nyc 2011

Technology

brandi cooke

breen cooke

data reduction for the scalable automated analysis of...

- 1 - data reduction for the scalable automated analysis of...

graham cooke, maxime

fall 2 cooke

kate tilbury cooke

n-version antivirus in the network cloud - oberheide ·...

data reduction for the scalable automated analysis...

building a great api - evan cooke, cloudstock, december 2010

the internet motion sensor: a distributed blackhole...

spring 3 cooke

the internet motion sensor: a distributed global scoped...

h8 jay cooke

cooke methodology

cooke indictment

cooke greatpianistsonpianoplaying

share a cooke

austin cooke

authors:jon oberheide, kaushik veeraraghavan, evan cooke,...