when disaster strikes the cloud: who, what, when, where and how to recover

Accelerating Enterprise OpenStack

When Disaster Strikes the Cloud

Michael Factor IBM Research - Haifa

[email protected]

Who, What, When, Where and How to Recover

Ronen Kat IBM Research - Haifa [email protected]

Sean Cohen RedHat

[email protected]

2

Talk Outline q What is disaster recovery?

q Concepts and basics

q Protecting data and applications from disasters q OpenStack Cinder toolbox for disaster recovery q Applications are more than just data

q The road ahead: Kilo and beyond

3

What is Disaster Recovery?

According to Wikipedia, Disaster Recovery (DR) is "the process, policies and procedures . . . for recovery . . . of technology infrastructure . . . after a natural or human-induced disaster.”

Servers Storage Network Software Configuration

Surviving a disaster requires geographic dispersion

4

Recovery Point Objective and Recovery Time Objective

How far back in time a disaster takes one

How long until operational after a disaster

Seconds 0

RECOVERY POINT OBJECTIVE (RPO)

Minutes Hours Days Weeks Weeks

RECOVERY POINT TIME (RTO)

Days Hours Minutes Seconds

Replication

Backup restore Active site Hot site

5

Data and Metadata Consistency

Data consistency q If a modified datum is available,

all data it depends upon is also available

Metadata consistency q Configuration updates are seen

in the same order relative to one another and to data updates

Application VM

DB LOG

DB LOG

Remote Site

6

OpenStack Cloud Metadata

Virtual networks between the cloud VM External network access Attached volumes Volume types Virtual machines flavors SSH keys for VM access Virtual machines images

Identities of users


Protecting Data and Applications from Disasters

8

Data Protection: Cinder Backup and Restore

q Cinder backup q Backup a volume to backup storage

Swift

backup-create

Primary Cloud

9


q Can Cinder restore on secondary cloud?

q Problem: Cinder on secondary cloud is not aware of the backup

Swift backup-restore

Primary Cloud

Secondary Cloud

10


q Solution: “electronic tape shipping” q backup-export q backup-import

q Cinder supports since Icehouse

Swift

backup-export

Primary Cloud

Secondary Cloud

Backup reference

backup-import

11


q After backup-import Cinder can restore on secondary cloud q backup-restore

Swift backup-restore

Primary Cloud

Secondary Cloud

12

Data Protection: Cinder Volume replication

q Cinder has initial support for volume replication in Juno release

q Cinder back-ends can “advertise” support for replication

q Volume created with replication extra-spec will be allocated on back-end supporting replication and will be replicated

q Supporting back ends: q IBM Storwize, more expected in Kilo

Cinder back-end

Cinder back-end

Volume-type extra specs: “capabilities:replication

<is> True”

13

Data Protection: Cinder Volume replication

q Secondary volume can become primary when promoted q replication-promote

q Replication can be reversed following a replication-promote q replication-reenable

Cinder back-end

Cinder back-end

14

Consistency Groups q New in Juno

q Support for volume grouping for consistency

q Grouping of volumes is based on the volume-type

q Supporting q Consistency group snapshots

q Needs to be extended to support q Cinder backup q Cinder volume replication

DB LOG

15

Protecting Applications from Disasters

Servers Storage Network Software Configuration

Disaster Recovery Orchestration

16

OpenStack Tools

q Applications are defined in OpenStack by q Heat Orchestration Templates

q However q Not all applications are template based q Deployments (including configuration) change over time q Some definitions are cloud specific, e.g., networks, types q Heat templates and Stacks don’t stay consistent

q Tools that can create a template from deployment, e.g., Flame, ReHeat

q But, template will only fit the current cloud

17

OpenStack Tools and Beyond

q Demo: A technology preview for disaster recovery with IBM Cloud Manager

18

THE ROAD AHEAD

19

Ceph Multi-Site & Disaster Recovery (Block) example

q Export snapshots to geographically dispersed data centers q Provides disaster recovery

q Export incremental snapshots q Minimize network bandwidth by only sending changes

q  Kilo cycle focus to extends the multi-site and disaster recovery options q  RBD Mirroring q  Cinder Volume Replication

20

Ceph Multi-Site & Disaster Recovery (Object) example

q Zones and region support q  Deploy topologies similar to S3

and others with a global namespace

q Data center synchronization q  Back-up full or partial sets of data

between regions

q Read affinity q  Serve local copies of data to local

users

21

Disaster Recovery as a Service Catalog q Pluggable Disaster Recovery policies

q Replication targets can specify different RPO/RTO levels that can be offered based on the supported backend capabilities

q Disaster Recovery Policies q  Active - Cold standby q  Active - Hot standby q  Active - Active (requires application awareness and transaction integrity) q  Backup to Cloud / From the Cloud

22

Extending Heat Orchestration for Disaster Recovery

q Heat can be used to automate q Add support for Cinder replication

q Need to make Consistency group across OpenStack projects q Nova Cinder, Trove….

q Stack Snapshot Backup / Rollback

q Enable customization of workload components at recovery site. q Networks, VM configurations changes, guest agent etc.

23

The Road Toward Application Consistency

First phase: File system consistency

q Integrate into OpenStack to allow consistent snapshots and backups q Nova needs to request QEMU Guest Agent to freeze the file systems

(and applications if fsfreeze-hook is installed) during the snapshot

q Patches has proposed for Nova and Cinder, targeting the Kilo release

Source: Hitachi

24

The Road Toward Application Consistency

Next phase: Consistency at the application level

q Application-Aware on Windows with VSS Support on qemu-ga q Application notification via Microsoft Volume Shadow Copy Service (VSS)

q Application-Aware on Linux Using qemu-ga Hooks q Application-consistent snapshots can be created with scripts interacting with the

QEMU guest agent q The scripts can notify applications to flush their data

25

Disaster Recovery at Scale

q  Site evacuation holy grail is an automatic planned migration of the workloads and data from one cloud-scale datacenter to another.

q  New OpenStack HA approaches to help Recovery from infrastructure failures:

q  Leveraging Pacemaker to provide automated detection of a failed hypervisor and the recovery of the VMs that were running there.

q  Evacuate instance to a scheduled host was added in Juno q  Simple tagging API for instances in Nova was accepted for Kilo release

q  Can support automatic-recovery new tag

Suggest removing – no time

26

OpenStack Documentation needs to catch up…

q Join the OpenStack Disaster Recovery Guide q We have a basic OpenStack High Availability Guide

q http://docs.openstack.org/high-availability-guide/content/

q A very outdated “Recover cloud after disaster” section in the Admin guide http://docs.openstack.org/admin-guide-cloud/content/section_nova-disaster-recovery-process.html


Q&A

Michael Factor IBM Research - Haifa

[email protected]

THANK YOU

Ronen Kat IBM Research - Haifa [email protected]

Sean Cohen RedHat

[email protected]

when disaster strikes the cloud: who, what, when, where and how to recover

Technology

backup swift backup

data protection

end cinder

backup storage swift

qafter backupimport

qcinder backup qbackup

secondary cloud qbackup

qcan cinder