S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

Download S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

Post on 15-Dec-2015

213 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

<ul><li>Slide 1</li></ul> <p>S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan OConnor Andrew Warfield Slide 2 F AILURES IN A D ATACENTER 2 Slide 3 T OLERATING F AILURES IN A D ATACENTER Initial idea behind Remus was to tolerate Datacenter level failures. REMUS 3 Slide 4 C AN A W HOLE D ATACENTER F AIL ? Yes! Its a Disaster ! 4 Slide 5 D ISASTERS Illustrative Image courtesy of TangoPango, Flickr. Our Internet infrastructure, despite all the talk, is as fragile as a fine porcelain cup on the roof of a car zipping across a pot-holed goat track. A single truck driver can take out sites like 37Signals in a snap. - Om Malik, GigaOM Truck driver in Texas kills all the websites you really use Southlake FD found that he had low blood sugar - valleywag.com 5 Slide 6 D ISASTERS.. Water-main break cripples Dallas County computers, operations The county's criminal justice system nearly ground to a halt, as paper processing from another era led to lengthy delays - keeping some prisoners in jail longer than normal. - Dallas Morning News, Jun 2010 6 Slide 7 D ISASTERS.. 7 Slide 8 M ORE F ODDER B ACK H OME An explosion near our server bank electrical box containing 580 fiber cables. electrical box was covered in asbestos mandated the wearing of hazmat suits.... Worse yet, the dynamic rerouting which is the hallmark of the internet did not function. In other words, the perfect storm. Oh well. S*it happens. -Dan Empfield, Slowswitch.com - a Gossamer Threads customer. 8 Slide 9 D ISASTER R ECOVERY T HE O LD F ASHIONED W AY Storage replication between a primary and backup site. Manually restore physical servers from backup images. Data Loss and Long Outage periods. Expensive Hardware Storage Arrays, Replicators, etc. 9 Slide 10 Protected Site Recovery Site VirtualCenter Site Recovery Manager VirtualCenter Site Recovery Manager Datastore Groups Array Replication Datastore Groups X S TATE OF THE A RT D ISASTER R ECOVERY VMs offline VMs powered on VMs become unavailable VMs online in Protected Site Source: VMWare Site Recovery Manager Technical OverviewVMWare Site Recovery Manager Technical Overview 10 Slide 11 P ROBLEMS WITH E XISTING S OLUTIONS Data Loss &amp; Service Disruption (RPO ~15min, RTO ~few hours) Complicated Recovery Planning (e.g. service A needs to be up before B, etc.) Application Level Recovery Bottom Line: Current State of DR is Complicated Expensive Not suitable for a general purpose cloud-level offering. 11 Slide 12 D ISASTER T OLERANCE AS A S ERVICE ? Our Vision 12 Slide 13 O VERVIEW A Case for Commoditizing Disaster Tolerance SecondSite System Design Evaluation &amp; Experiences 13 Slide 14 P RIMARY &amp; B ACKUP S ITES 5ms RTT 14 Slide 15 F AILOVER &amp; F AILBACK WITHOUT O UTAGE Primary Site: Vancouver Backup Site : Kamloops Primary Site: Vancouver Primary Site: Kamloops Backup Site : Vancouver Complete State Recovery (CPU, disk, memory, network) No Application Level Recovery 15 Slide 16 M AIN C ONTRIBUTIONS Remus (NSDI 08) Checkpoint based State Replication Fully Transparent HA Recovery Consistency No Application level recovery RemusDB (VLDB11) Optimize Server Latency Reduce Replication Bandwidth by up to 80% using Page Delta Compression Disk Read Tracking SecondSite (VEE12) Failover Arbitration in Wide Area Stateful Network Failover over Wide Area 16 Slide 17 C ONTRIBUTIONS.. 17 Slide 18 F AILURE D ETECTION IN R EMUS External Network Primar y NIC1 NIC2 Backup NIC1 NIC2 Checkpoints A pair of independent dedicated NICs carry replication traffic. Backup declares Primary failure only if It cannot reach Primary via NIC 1 and NIC2 It can reach External N/W via NIC1 Failure of Replication link alone results in Backup shutdown. Split Brain occurs only when both NICs/links fail. LAN 18 Slide 19 F AILURE D ETECTION IN W IDE A REA D EPLOYMENTS Cannot distinguish between link and node failure. Higher chances of Split Brain as the network is not reliable anymore External Network Primar y NIC1 NIC2 Backup NIC1 NIC2 Checkpoints LAN WAN Primary Datacente r Primary Datacente r Backup Datacente r Backup Datacente r Replication Channel INTERNET 19 Slide 20 F AILOVER A RBITRATION Local Quorum of Simple Reachability Detectors. Stewards can be placed on third party clouds. Google App Server implementation with ~100 LoC. Provider/User could have other sophisticated implementations. 20 Slide 21 Stewards 1 2 3 4 5 F AILOVER A RBITRATION.. Replication Stream POLL 1 Primary Quorum Logic Backup Quorum Logic Apriori Steward Set Agreement I need majority to stay alive I need exclusive majority to failover X X X X X POLL 2 POLL 3 POLL 4 POLL 5 POLL 1 POLL 2POLL 3 POLL 4 POLL 5 21 Slide 22 N ETWORK F AILOVER WITHOUT S ERVICE I NTERRUPTION Remus LAN - Gratuitous ARP from Backup Host SecondSite WAN/Internet BGP Route Update from Backup Datacenter Need support from upstream ISP(s) at both Datacenters IP Migration achieved through BGP Multi-homing 22 Slide 23 N ETWORK F AILOVER WITHOUT S ERVICE I NTERRUPTION.. Internet BCNet (AS-271) VMs Vancouver (134.87.2.173) Kamloops (207.23.255.237) 134.87.2.174 AS-64678 (stub) (134.87.3.0/24) 207.23.255.238 VMs Primary SiteBackup Site AS-64678 (stub) (134.87.3.0/24) BGP Multi- homing Replication Routing traffic to Primary Site Re-routing traffic to Backup Site on Failover as-path prepend 64678 as-path prepend 64678 64678 as-path prepend 64678 23 Slide 24 O VERVIEW A Case for Commoditizing Disaster Tolerance SecondSite System Design Evaluation &amp; Experiences 24 Slide 25 I want periodic failovers with no downtime! Did you run regression tests ? Failover Works!! More than one failure ? I will have to restart HA! E VALUATION 25 Slide 26 R ESTARTING HA Need to Resynchronize Storage. Avoiding Service Downtime requires Online Resynchronization Leverage DRBD only resynchronizes blocks that have changed Integrate DRBD with Remus Add checkpoint based asynchronous disk replication protocol. 26 Slide 27 R EGRESSION T ESTS Synthetic Workloads to stress test the Replication Pipeline Failovers every 90 minutes Discovered some interesting corner cases Page-table corruptions in memory checkpoints Write-after-write I/O ordering in disk replication 27 Slide 28 S ECOND S ITE T HE C OMPLETE P ICTURE Service Downtime includes timeout for failure detection (10s) Failure Detection Timeout is configurable 4 VMs x 100 Clients/VM 28 Slide 29 R EPLICATION B ANDWIDTH C ONSUMPTION 4 VMs x 100 Clients/VM 29 Slide 30 D EMO Expect a real disaster (conference demos are not a good idea!) 30 Slide 31 A PPLICATION T HROUGHPUT VS. R EPLICATION L ATENCY SPECWeb w/ 100 Clients Kamloops 31 Slide 32 R ESOURCE U TILIZATION VS. A PPLICATION L OAD Domain-0 CPU UtilizationBandwidth usage on Replication Channel Cost of HA as a function of Application Load (OLTP w/ 100 Clients) 32 Slide 33 R ESYNCHRONIZATION D ELAYS VS. O UTAGE P ERIOD OLTP Workload 33 Slide 34 The user creates a recovery plan which is associated to a single or multiple protection groups S ETUP W ORKFLOW R ECOVERY S ITE Source: VMWare Site Recovery Manager Technical OverviewVMWare Site Recovery Manager Technical Overview 34 Slide 35 R ECOVERY P LAN VM Shutdown High Priority VM Recovery Prepare Storage High Priority VM Shutdown Normal Priority VM Recovery Source: VMWare Site Recovery Manager Technical OverviewVMWare Site Recovery Manager Technical Overview Low Priority VM Recovery 35 </p>