SECONDSITE: DISASTER TOLERANCE AS A SERVICE
Shriram Rajagopalan
Brendan Cully
Ryan O’Connor
Andrew Warfield
2
FAILURES IN A DATACENTER
3
TOLERATING FAILURES IN A DATACENTER
Initial idea behind Remus was to tolerate Datacenter level failures.
REMUS
4
CAN A WHOLE DATACENTER FAIL ?
Yes!It’s a “Disaster”!
5
DISASTERS
Illustrative Image courtesy of TangoPango, Flickr.
“Our Internet infrastructure, despite all the talk, is as fragile as a fine porcelain cup on the roof of a car zipping across a pot-holed goat track.A single truck driver can take out sites like 37Signals in a snap.”
- Om Malik, GigaOM
“Truck driver in Texas kills all the websites you really use”
…Southlake FD found that he had low blood sugar
- valleywag.com
6
DISASTERS..
Water-main break cripples Dallas County computers, operations
The county's criminal justice system nearly ground to a halt, as paper processing from another era led to lengthy delays - keeping some prisoners in jail longer than normal.
- Dallas Morning News, Jun 2010
7
DISASTERS..
8
MORE FODDER BACK HOME
“An explosion … near our
server bank … electrical box containing 580 fiber cables.
electrical box … was covered in asbestos … mandated the wearing of hazmat suits ....
Worse yet, the dynamic rerouting —which is the hallmark of the internet … did not function.
In other words, the perfect storm. Oh well. S*it happens. ’’
-Dan Empfield, Slowswitch.com - a Gossamer Threads customer.
9
DISASTER RECOVERY – THE OLD FASHIONED WAY
Storage replication between a primary and backup site.
Manually restore physical servers from backup images.
Data Loss and Long Outage periods.
Expensive Hardware – Storage Arrays, Replicators, etc.
10
Protected Site
Recovery Site
VirtualCenter Site Recovery Manager
VirtualCenter Site Recovery Manager
Datastore Groups
Array Replication
Datastore GroupsX
STATE OF THE ART DISASTER RECOVERY
VMs offline
VMs powered on
VMs become unavailable
VMs online in Protected Site
Source: VMWare Site Recovery Manager – Technical Overview
11
PROBLEMS WITH EXISTING SOLUTIONS
Data Loss & Service Disruption (RPO ~15min, RTO ~few hours)
Complicated Recovery Planning (e.g. service A needs to be up before B, etc.)
Application Level Recovery
Bottom Line: Current State of DR is Complicated Expensive Not suitable for a general purpose cloud-level offering.
12
DISASTER TOLERANCE AS A SERVICE ?
Our Vision
13
OVERVIEW
A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences
14
PRIMARY & BACKUP SITES
5ms RTT
15
FAILOVER & FAILBACK WITHOUT OUTAGE
Primary Site: VancouverBackup Site : Kamloops
Primary Site: VancouverPrimary Site: Kamloops
Primary Site: KamloopsBackup Site : Vancouver
Complete State Recovery (CPU, disk, memory, network)
No Application Level Recovery
16
MAIN CONTRIBUTIONS
Remus (NSDI ’08) Checkpoint based State Replication Fully Transparent HA Recovery Consistency
No Application level recovery
RemusDB (VLDB’11) Optimize Server Latency Reduce Replication Bandwidth by up to 80% using
Page Delta Compression Disk Read Tracking
SecondSite (VEE’12) Failover Arbitration in Wide Area Stateful Network Failover over Wide Area
17
CONTRIBUTIONS..
18
FAILURE DETECTION IN REMUS
External Network
Primary
NIC1
NIC2
Backup
NIC1
NIC2Checkpoints
• A pair of independent dedicated NICs carry replication traffic.
• Backup declares Primary failure only if
• It cannot reach Primary via NIC 1 and NIC2
• It can reach External N/W via NIC1
• Failure of Replication link alone results in Backup shutdown.
• Split Brain occurs only when both NICs/links fail.
LAN
19
FAILURE DETECTION IN WIDE AREA DEPLOYMENTS
Cannot distinguish between link and node failure.
Higher chances of Split Brain as the network is not reliable anymore
External Network
Primary
NIC1
NIC2
Backup
NIC1
NIC2Checkpoints
LAN
WAN
PrimaryDatacent
er
BackupDatacent
er
ReplicationChannel
INTERNET
20
FAILOVER ARBITRATION
Local Quorum of Simple Reachability Detectors.
Stewards can be placed on third party clouds.
Google App Server implementation with ~100 LoC.
Provider/User could have other sophisticated implementations.
21
Stewards1 2 3
4 5
FAILOVER ARBITRATION..
Replication Stream
POLL
1
Primary
QuorumLogic
Backup
QuorumLogic
Apriori Steward Set Agreement
I need majority to stay alive
I need exclusive majority to
failover
XX
XX
X
POLL
2PO
LL 3
POLL 4
POLL 5POLL 1
POLL 2POLL 3
POLL 4
POLL 5
22
NETWORK FAILOVER WITHOUT SERVICE INTERRUPTION
Remus – LAN - Gratuitous ARP from Backup Host
SecondSite – WAN/Internet – BGP Route Update from Backup Datacenter
Need support from upstream ISP(s) at both Datacenters
IP Migration achieved through BGP Multi-homing
23
NETWORK FAILOVER WITHOUT SERVICE INTERRUPTION..
Internet
BCNet (AS-271)
VMs
Vancouver(134.87.2.173
)
Kamloops(207.23.255.23
7)
134.87.2.174
AS-64678 (stub)(134.87.3.0/24)
207.23.255.238
VMs VMs
Primary Site Backup Site
AS-64678 (stub)(134.87.3.0/24)
BGP Multi-homing
Replication
Routing traffic to Primary Site
Re-routing traffic to Backup Site on Failover
as-path prepend64678 64678
as-path prepend64678 64678 64678 64678
as-path prepend64678
24
OVERVIEW
A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences
25
I want periodic failovers with no downtime!
Did you run regression tests ?
Failover Works!!
More than one failure ?
I will have to restart HA!
EVALUATION
26
RESTARTING HA
Need to Resynchronize Storage.
Avoiding Service Downtime requires Online Resynchronization
Leverage DRBD –only resynchronizes blocks that have changed
Integrate DRBD with Remus Add checkpoint based asynchronous disk replication protocol.
27
REGRESSION TESTS
Synthetic Workloads to stress test the Replication Pipeline
Failovers every 90 minutes
Discovered some interesting corner cases
Page-table corruptions in memory checkpoints
Write-after-write I/O ordering in disk replication
28
SECONDSITE – THE COMPLETE PICTURE
• Service Downtime includes timeout for failure detection (10s)• Failure Detection Timeout is configurable
4 VMs x 100 Clients/VM
29
REPLICATION BANDWIDTH CONSUMPTION
4 VMs x 100 Clients/VM
30
DEMO
Expect a real disaster (conference demos are not a good idea!)
31
APPLICATION THROUGHPUT VS. REPLICATION LATENCY
SPECWeb w/ 100 Clients
Kamloops
32
RESOURCE UTILIZATION VS. APPLICATION LOAD
Domain-0 CPU Utilization Bandwidth usage on Replication Channel
Cost of HA as a function of Application Load (OLTP w/ 100 Clients)
33
RESYNCHRONIZATION DELAYS VS. OUTAGE PERIOD
OLTP Workload
34
The user creates a recovery plan which is associated to a single or multiple protection groups
SETUP WORKFLOW – RECOVERY SITE
Source: VMWare Site Recovery Manager – Technical Overview
35
RECOVERY PLAN
VM Shutdown
High PriorityVM Recovery
Prepare Storage
High PriorityVM Shutdown
Normal PriorityVM Recovery
Source: VMWare Site Recovery Manager – Technical Overview
Low PriorityVM Recovery