Transcript
Page 1: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

SECONDSITE: DISASTER TOLERANCE AS A SERVICE

Shriram Rajagopalan

Brendan Cully

Ryan O’Connor

Andrew Warfield

Page 2: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

2

FAILURES IN A DATACENTER

Page 3: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

3

TOLERATING FAILURES IN A DATACENTER

Initial idea behind Remus was to tolerate Datacenter level failures.

REMUS

Page 4: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

4

CAN A WHOLE DATACENTER FAIL ?

Yes!It’s a “Disaster”!

Page 5: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

5

DISASTERS

Illustrative Image courtesy of TangoPango, Flickr.

“Our Internet infrastructure, despite all the talk, is as fragile as a fine porcelain cup on the roof of a car zipping across a pot-holed goat track.A single truck driver can take out sites like 37Signals in a snap.”

- Om Malik, GigaOM

“Truck driver in Texas kills all the websites you really use”

…Southlake FD found that he had low blood sugar

- valleywag.com

Page 6: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

6

DISASTERS..

Water-main break cripples Dallas County computers, operations

The county's criminal justice system nearly ground to a halt, as paper processing from another era led to lengthy delays - keeping some prisoners in jail longer than normal.

- Dallas Morning News, Jun 2010

Page 7: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

7

DISASTERS..

Page 8: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

8

MORE FODDER BACK HOME

“An explosion … near our

server bank … electrical box containing 580 fiber cables.

electrical box … was covered in asbestos … mandated the wearing of hazmat suits ....

Worse yet, the dynamic rerouting —which is the hallmark of the internet … did not function.

In other words, the perfect storm. Oh well. S*it happens. ’’

-Dan Empfield, Slowswitch.com - a Gossamer Threads customer.

Page 9: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

9

DISASTER RECOVERY – THE OLD FASHIONED WAY

Storage replication between a primary and backup site.

Manually restore physical servers from backup images.

Data Loss and Long Outage periods.

Expensive Hardware – Storage Arrays, Replicators, etc.

Page 10: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

10

Protected Site

Recovery Site

VirtualCenter Site Recovery Manager

VirtualCenter Site Recovery Manager

Datastore Groups

Array Replication

Datastore GroupsX

STATE OF THE ART DISASTER RECOVERY

VMs offline

VMs powered on

VMs become unavailable

VMs online in Protected Site

Source: VMWare Site Recovery Manager – Technical Overview

Page 11: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

11

PROBLEMS WITH EXISTING SOLUTIONS

Data Loss & Service Disruption (RPO ~15min, RTO ~few hours)

Complicated Recovery Planning (e.g. service A needs to be up before B, etc.)

Application Level Recovery

Bottom Line: Current State of DR is Complicated Expensive Not suitable for a general purpose cloud-level offering.

Page 12: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

12

DISASTER TOLERANCE AS A SERVICE ?

Our Vision

Page 13: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

13

OVERVIEW

A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences

Page 14: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

14

PRIMARY & BACKUP SITES

5ms RTT

Page 15: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

15

FAILOVER & FAILBACK WITHOUT OUTAGE

Primary Site: VancouverBackup Site : Kamloops

Primary Site: VancouverPrimary Site: Kamloops

Primary Site: KamloopsBackup Site : Vancouver

Complete State Recovery (CPU, disk, memory, network)

No Application Level Recovery

Page 16: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

16

MAIN CONTRIBUTIONS

Remus (NSDI ’08) Checkpoint based State Replication Fully Transparent HA Recovery Consistency

No Application level recovery

RemusDB (VLDB’11) Optimize Server Latency Reduce Replication Bandwidth by up to 80% using

Page Delta Compression Disk Read Tracking

SecondSite (VEE’12) Failover Arbitration in Wide Area Stateful Network Failover over Wide Area

Page 17: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

17

CONTRIBUTIONS..

Page 18: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

18

FAILURE DETECTION IN REMUS

External Network

Primary

NIC1

NIC2

Backup

NIC1

NIC2Checkpoints

• A pair of independent dedicated NICs carry replication traffic.

• Backup declares Primary failure only if

• It cannot reach Primary via NIC 1 and NIC2

• It can reach External N/W via NIC1

• Failure of Replication link alone results in Backup shutdown.

• Split Brain occurs only when both NICs/links fail.

LAN

Page 19: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

19

FAILURE DETECTION IN WIDE AREA DEPLOYMENTS

Cannot distinguish between link and node failure.

Higher chances of Split Brain as the network is not reliable anymore

External Network

Primary

NIC1

NIC2

Backup

NIC1

NIC2Checkpoints

LAN

WAN

PrimaryDatacent

er

BackupDatacent

er

ReplicationChannel

INTERNET

Page 20: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

20

FAILOVER ARBITRATION

Local Quorum of Simple Reachability Detectors.

Stewards can be placed on third party clouds.

Google App Server implementation with ~100 LoC.

Provider/User could have other sophisticated implementations.

Page 21: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

21

Stewards1 2 3

4 5

FAILOVER ARBITRATION..

Replication Stream

POLL

1

Primary

QuorumLogic

Backup

QuorumLogic

Apriori Steward Set Agreement

I need majority to stay alive

I need exclusive majority to

failover

XX

XX

X

POLL

2PO

LL 3

POLL 4

POLL 5POLL 1

POLL 2POLL 3

POLL 4

POLL 5

Page 22: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

22

NETWORK FAILOVER WITHOUT SERVICE INTERRUPTION

Remus – LAN - Gratuitous ARP from Backup Host

SecondSite – WAN/Internet – BGP Route Update from Backup Datacenter

Need support from upstream ISP(s) at both Datacenters

IP Migration achieved through BGP Multi-homing

Page 23: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

23

NETWORK FAILOVER WITHOUT SERVICE INTERRUPTION..

Internet

BCNet (AS-271)

VMs

Vancouver(134.87.2.173

)

Kamloops(207.23.255.23

7)

134.87.2.174

AS-64678 (stub)(134.87.3.0/24)

207.23.255.238

VMs VMs

Primary Site Backup Site

AS-64678 (stub)(134.87.3.0/24)

BGP Multi-homing

Replication

Routing traffic to Primary Site

Re-routing traffic to Backup Site on Failover

as-path prepend64678 64678

as-path prepend64678 64678 64678 64678

as-path prepend64678

Page 24: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

24

OVERVIEW

A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences

Page 25: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

25

I want periodic failovers with no downtime!

Did you run regression tests ?

Failover Works!!

More than one failure ?

I will have to restart HA!

EVALUATION

Page 26: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

26

RESTARTING HA

Need to Resynchronize Storage.

Avoiding Service Downtime requires Online Resynchronization

Leverage DRBD –only resynchronizes blocks that have changed

Integrate DRBD with Remus Add checkpoint based asynchronous disk replication protocol.

Page 27: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

27

REGRESSION TESTS

Synthetic Workloads to stress test the Replication Pipeline

Failovers every 90 minutes

Discovered some interesting corner cases

Page-table corruptions in memory checkpoints

Write-after-write I/O ordering in disk replication

Page 28: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

28

SECONDSITE – THE COMPLETE PICTURE

• Service Downtime includes timeout for failure detection (10s)• Failure Detection Timeout is configurable

4 VMs x 100 Clients/VM

Page 29: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

29

REPLICATION BANDWIDTH CONSUMPTION

4 VMs x 100 Clients/VM

Page 30: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

30

DEMO

Expect a real disaster (conference demos are not a good idea!)

Page 31: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

31

APPLICATION THROUGHPUT VS. REPLICATION LATENCY

SPECWeb w/ 100 Clients

Kamloops

Page 32: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

32

RESOURCE UTILIZATION VS. APPLICATION LOAD

Domain-0 CPU Utilization Bandwidth usage on Replication Channel

Cost of HA as a function of Application Load (OLTP w/ 100 Clients)

Page 33: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

33

RESYNCHRONIZATION DELAYS VS. OUTAGE PERIOD

OLTP Workload

Page 34: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

34

The user creates a recovery plan which is associated to a single or multiple protection groups

SETUP WORKFLOW – RECOVERY SITE

Source: VMWare Site Recovery Manager – Technical Overview

Page 35: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

35

RECOVERY PLAN

VM Shutdown

High PriorityVM Recovery

Prepare Storage

High PriorityVM Shutdown

Normal PriorityVM Recovery

Source: VMWare Site Recovery Manager – Technical Overview

Low PriorityVM Recovery


Top Related