when disaster strikes the cloud: who, what, when, where and how to recover

27
Accelerating Enterprise OpenStack When Disaster Strikes the Cloud Michael Factor IBM Research - Haifa [email protected] Who, What, When, Where and How to Recover Ronen Kat IBM Research - Haifa [email protected] Sean Cohen RedHat [email protected]

Upload: sean-cohen

Post on 15-Jul-2015

167 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: When disaster strikes the cloud:  Who, what, when, where and how to recover

Accelerating Enterprise OpenStack

When Disaster Strikes the Cloud

Michael Factor IBM Research - Haifa

[email protected]

Who, What, When, Where and How to Recover

Ronen Kat IBM Research - Haifa [email protected]

Sean Cohen RedHat

[email protected]

Page 2: When disaster strikes the cloud:  Who, what, when, where and how to recover

2

Talk Outline q What is disaster recovery?

q Concepts and basics

q Protecting data and applications from disasters q OpenStack Cinder toolbox for disaster recovery q Applications are more than just data

q The road ahead: Kilo and beyond

Page 3: When disaster strikes the cloud:  Who, what, when, where and how to recover

3

What is Disaster Recovery?

According to Wikipedia, Disaster Recovery (DR) is "the process, policies and procedures . . . for recovery . . . of technology infrastructure . . . after a natural or human-induced disaster.”

Servers Storage Network Software Configuration

Surviving a disaster requires geographic dispersion

Page 4: When disaster strikes the cloud:  Who, what, when, where and how to recover

4

Recovery Point Objective and Recovery Time Objective

How far back in time a disaster takes one

How long until operational after a disaster

Seconds 0

RECOVERY POINT OBJECTIVE (RPO)

Minutes Hours Days Weeks Weeks

RECOVERY POINT TIME (RTO)

Days Hours Minutes Seconds

Replication

Backup restore Active site Hot site

Page 5: When disaster strikes the cloud:  Who, what, when, where and how to recover

5

Data and Metadata Consistency

Data consistency q If a modified datum is available,

all data it depends upon is also available

Metadata consistency q Configuration updates are seen

in the same order relative to one another and to data updates

Application VM

DB LOG

DB LOG

Remote Site

Page 6: When disaster strikes the cloud:  Who, what, when, where and how to recover

6

OpenStack Cloud Metadata

Virtual networks between the cloud VM External network access Attached volumes Volume types Virtual machines flavors SSH keys for VM access Virtual machines images

Identities of users

Page 7: When disaster strikes the cloud:  Who, what, when, where and how to recover

Accelerating Enterprise OpenStack

Protecting Data and Applications from Disasters

Page 8: When disaster strikes the cloud:  Who, what, when, where and how to recover

8

Data Protection: Cinder Backup and Restore

q Cinder backup q Backup a volume to backup storage

Swift

backup-create

Primary Cloud

Page 9: When disaster strikes the cloud:  Who, what, when, where and how to recover

9

Data Protection: Cinder Backup and Restore

q Can Cinder restore on secondary cloud?

q Problem: Cinder on secondary cloud is not aware of the backup

Swift backup-restore

Primary Cloud

Secondary Cloud

Page 10: When disaster strikes the cloud:  Who, what, when, where and how to recover

10

Data Protection: Cinder Backup and Restore

q Solution: “electronic tape shipping” q backup-export q backup-import

q Cinder supports since Icehouse

Swift

backup-export

Primary Cloud

Secondary Cloud

Backup reference

backup-import

Page 11: When disaster strikes the cloud:  Who, what, when, where and how to recover

11

Data Protection: Cinder Backup and Restore

q After backup-import Cinder can restore on secondary cloud q backup-restore

Swift backup-restore

Primary Cloud

Secondary Cloud

Page 12: When disaster strikes the cloud:  Who, what, when, where and how to recover

12

Data Protection: Cinder Volume replication

q Cinder has initial support for volume replication in Juno release

q Cinder back-ends can “advertise” support for replication

q Volume created with replication extra-spec will be allocated on back-end supporting replication and will be replicated

q Supporting back ends: q IBM Storwize, more expected in Kilo

Cinder back-end

Cinder back-end

Volume-type extra specs: “capabilities:replication

<is> True”

Page 13: When disaster strikes the cloud:  Who, what, when, where and how to recover

13

Data Protection: Cinder Volume replication

q Secondary volume can become primary when promoted q replication-promote

q Replication can be reversed following a replication-promote q replication-reenable

Cinder back-end

Cinder back-end

Page 14: When disaster strikes the cloud:  Who, what, when, where and how to recover

14

Consistency Groups q New in Juno

q Support for volume grouping for consistency

q Grouping of volumes is based on the volume-type

q Supporting q Consistency group snapshots

q Needs to be extended to support q Cinder backup q Cinder volume replication

DB LOG

Page 15: When disaster strikes the cloud:  Who, what, when, where and how to recover

15

Protecting Applications from Disasters

Servers Storage Network Software Configuration

Disaster Recovery Orchestration

Page 16: When disaster strikes the cloud:  Who, what, when, where and how to recover

16

OpenStack Tools

q Applications are defined in OpenStack by q Heat Orchestration Templates

q However q Not all applications are template based q Deployments (including configuration) change over time q Some definitions are cloud specific, e.g., networks, types q Heat templates and Stacks don’t stay consistent

q Tools that can create a template from deployment, e.g., Flame, ReHeat

q But, template will only fit the current cloud

Page 17: When disaster strikes the cloud:  Who, what, when, where and how to recover

17

OpenStack Tools and Beyond

q Demo: A technology preview for disaster recovery with IBM Cloud Manager

Page 18: When disaster strikes the cloud:  Who, what, when, where and how to recover

18

THE ROAD AHEAD

Page 19: When disaster strikes the cloud:  Who, what, when, where and how to recover

19

Ceph Multi-Site & Disaster Recovery (Block) example

q Export snapshots to geographically dispersed data centers q Provides disaster recovery

q Export incremental snapshots q Minimize network bandwidth by only sending changes

q  Kilo cycle focus to extends the multi-site and disaster recovery options q  RBD Mirroring q  Cinder Volume Replication

Page 20: When disaster strikes the cloud:  Who, what, when, where and how to recover

20

Ceph Multi-Site & Disaster Recovery (Object) example

q Zones and region support q  Deploy topologies similar to S3

and others with a global namespace

q Data center synchronization q  Back-up full or partial sets of data

between regions

q Read affinity q  Serve local copies of data to local

users

Page 21: When disaster strikes the cloud:  Who, what, when, where and how to recover

21

Disaster Recovery as a Service Catalog q Pluggable Disaster Recovery policies

q Replication targets can specify different RPO/RTO levels that can be offered based on the supported backend capabilities

q Disaster Recovery Policies q  Active - Cold standby q  Active - Hot standby q  Active - Active (requires application awareness and transaction integrity) q  Backup to Cloud / From the Cloud

Page 22: When disaster strikes the cloud:  Who, what, when, where and how to recover

22

Extending Heat Orchestration for Disaster Recovery

q Heat can be used to automate q Add support for Cinder replication

q Need to make Consistency group across OpenStack projects q Nova Cinder, Trove….

q Stack Snapshot Backup / Rollback

q Enable customization of workload components at recovery site. q Networks, VM configurations changes, guest agent etc.

Page 23: When disaster strikes the cloud:  Who, what, when, where and how to recover

23

The Road Toward Application Consistency

First phase: File system consistency

q Integrate into OpenStack to allow consistent snapshots and backups q Nova needs to request QEMU Guest Agent to freeze the file systems

(and applications if fsfreeze-hook is installed) during the snapshot

q Patches has proposed for Nova and Cinder, targeting the Kilo release

Source: Hitachi

Page 24: When disaster strikes the cloud:  Who, what, when, where and how to recover

24

The Road Toward Application Consistency

Next phase: Consistency at the application level

q Application-Aware on Windows with VSS Support on qemu-ga q Application notification via Microsoft Volume Shadow Copy Service (VSS)

q Application-Aware on Linux Using qemu-ga Hooks q Application-consistent snapshots can be created with scripts interacting with the

QEMU guest agent q The scripts can notify applications to flush their data

Page 25: When disaster strikes the cloud:  Who, what, when, where and how to recover

25

Disaster Recovery at Scale

q  Site evacuation holy grail is an automatic planned migration of the workloads and data from one cloud-scale datacenter to another.

q  New OpenStack HA approaches to help Recovery from infrastructure failures:

q  Leveraging Pacemaker to provide automated detection of a failed hypervisor and the recovery of the VMs that were running there.

q  Evacuate instance to a scheduled host was added in Juno q  Simple tagging API for instances in Nova was accepted for Kilo release

q  Can support automatic-recovery new tag

Suggest removing – no time

Page 26: When disaster strikes the cloud:  Who, what, when, where and how to recover

26

OpenStack Documentation needs to catch up…

q Join the OpenStack Disaster Recovery Guide q We have a basic OpenStack High Availability Guide

q http://docs.openstack.org/high-availability-guide/content/

q A very outdated “Recover cloud after disaster” section in the Admin guide http://docs.openstack.org/admin-guide-cloud/content/section_nova-disaster-recovery-process.html

Page 27: When disaster strikes the cloud:  Who, what, when, where and how to recover

Accelerating Enterprise OpenStack

Q&A

Michael Factor IBM Research - Haifa

[email protected]

THANK YOU

Ronen Kat IBM Research - Haifa [email protected]

Sean Cohen RedHat

[email protected]