disaster recovery with vmware infrastructure vmware infrastructure for rapid, reliable, and...
TRANSCRIPT
Disaster Recovery with VMware Infrastructure
VMware Infrastructure for Rapid, Reliable, and Cost-Effective Disaster Recovery
Agenda
Challenges of Traditional DR
Properties of Virtualization for DR
Using VMware Virtualization in DR
SRM Technical Overview
What We Hear…Is This Familiar?
“ We don’t have a DR plan for mission critical x86 systems – it would be too expensive and complex”
“ We don’t have a DR plan for mission critical x86 systems – it would be too expensive and complex”
“ It is very difficult to test our DR plan because of all the extra hardware, configuration and special processes”
“ It is very difficult to test our DR plan because of all the extra hardware, configuration and special processes”
“ In our last disaster recovery test we missed our recovery objectives by days”
“ In our last disaster recovery test we missed our recovery objectives by days”
Only 31% of CIO’s surveyed rate their plans as extremely or very effective (IDG)
Only 31% of CIO’s surveyed rate their plans as extremely or very effective (IDG)
40% of all companies that experience a major disaster will go out of business if they cannot gain access to their data within 24 hours (Gartner)
40% of all companies that experience a major disaster will go out of business if they cannot gain access to their data within 24 hours (Gartner)
DR Pain Points
Lack of a reliable disaster recovery plan27-30% of business have no disaster recovery plan (VMworld, Imation)
Inability to meet RTO and RPO requirements with current plan
Business needs and/or regulatory needs
Need to improve RTO from days to minutes or hours
Need to improve RPO from 24 hours to 1 hour or less
Idle hardware at recovery siteUnable to instantly repurpose machines at the secondary site
Management effort required to maintain recovery siteNeed to maintain system and application images at secondary site
Usually only data is regularly and cleanly updated
Multiple slow processes to transfer data to DR site for OS, application installation, configuration, data files
Requires 1:1 duplication of servers and infrastructure at DR site
Makes x86 physical DR strategies complex and expensiveExpensive and Complex
DR Challenges Today
Application
OS
x86
OS files
localstorage
Storage
WAN
ProdApplication
OS
x86
OS files
localstorage
Storage
DR
Challenges of Traditional DR: Infrastructure
• Bound to HW• 5-10% utilized
Complex to physically recover OS, applications & data
Separate processes for system and application data
OS & applications have dependencies on hardware configuration
Tier 2 & 3 applications left unprotected, adding to Tier 1 RTO risk
Slow and Unreliable Process
DR Challenges Today
cd, tape or ghost image
Application
OS
x86
OS files
localstorage
Storage
WAN
Application
OS
x86
OS files
localstorage
Storage
Prod
“Boot & Pray”DR
Challenges of Traditional DR: Recovery
Agenda
Challenges of Traditional DR
Properties of Virtualization for DR
Using VMware Virtualization in DR
SRM Technical Overview
DR : The Killer App for Virtualization!
2006 Customer Survey (n=2265)
…85% use VMware in production; 43% set as a default policy for production servers*
Press“Best Disaster
Recovery Product of 2006”
(TechTarget)
Customers
55% of customers using virtualization for BC/DR*
55%
*Source: VMware customer survey, 9/2006. N=2265
What is Server Virtualization
Before Virtualization After Virtualization
VMware server virtualization packages hardware, OS, and applications into a portable virtual machine package
• Software tied to hardware• Single OS image per machine• One application workload per OS
• Multiple workloads per machine• Software independent of hardware• System, data, apps are files
Copyright © 2006 VMware, Inc. All rights reserved.
VMware Virtualization Enablers for DR
Hardware Independence
Run a virtual machine on any server without
modification
• Eliminate need for 1:1 hardware duplication for DR
• Eliminate risk of hardware “configuration drift”
• Re-use older servers for DR
Copyright © 2006 VMware, Inc. All rights reserved.
VMware Virtualization Enablers for DR
Encapsulation
Encapsulate entire systems in simple files
• Simplify backup and replication
• Simplify copying and cloning of systems
• Simplify provisioning
SystemApps = files in VFMS
Physical Server
Data
Copyright © 2006 VMware, Inc. All rights reserved.
VMware Virtualization Enablers for DR
Isolation
Each virtual machine is isolated from other virtual
machines
• Provide easier testing of DR plan
• Utilize DR hardware for other tasks
• Leverage resource pools to separate workload groupsVMware Infrastructure
OS
AppOS
App
OS
App
Batch Job
DR Test
Copyright © 2006 VMware, Inc. All rights reserved.
VMware Virtualization Enablers for DR
Partitioning
Safely run multiple virtual machines simultaneously on
a single physical server
• Consolidate servers
• Boost utilization
• Provide significant cost savings
% Utilization
Agenda
Challenges of Traditional DR
Properties of Virtualization for DR
Using VMware Virtualization in DR
Data and system protection
Replication
DR testing
Protecting physical servers with virtual machines
SRM Technical Overview
VMware Availability Products And Features
Avoid planned outages Quick recovery from unplanned outages
Component
Server
Storage
Data N/A
Site
VMware HAVMotion, DRS + Maintenance Mode
NIC Teaming, Multipathing
Encapsulation, VCBStorage VMotion
Encapsulation, boot from shared storage, instant reprovisioning, HW independence, resource pools, snapshots, VLANs
Encapsulation, VCB
VMware Site Recovery Manager
Data and System Protection – Physical vs. Virtual
Data and system protection with physical infrastructure• Separate processes for protecting data and system disks
• Require identical hardware for guaranteed restore
• Complex processes to ensure protection
Data and system protection with VMware Infrastructure
• Same process for data and system disks
• Entire system stored as data
• Hardware-independent virtual machines are easy to restore to any hardware
System, data, system config
System
DataSystem
configuration
Backup Options with VMware – Reduce Backup Windows
Agent in Service Console
Simplified backup of full-disk images
Any storage
Agent in each VM
Same architecture as physical system backup
File-level incremental backup possible
Any storage
ServiceConsole
App
OS
Backup Agent
ServiceConsole
Backup Server
tape
App
OS
Backup Agent
Backup Agent
Consolidated Backup - Agent on Proxy Server
Move backup out of VM
Provide LAN-free backup
Eliminate backup windows
Pre-integrated with 3rd party backup products
OS
In-VM In-Console VCB
Copyright © 2005 VMware, Inc. All rights reserved.
VMware Consolidated Backup – How it Works
Move backup out of the virtual machine
Run midday backups – LAN Free
Integrated with 3rd party backup
Move backup out of the virtual machine
Run midday backups – LAN Free
Integrated with 3rd party backup
Centralized file and image level backup
1
1. Take VM Snapshot
2. Mount SAN Snapshot
3. Backup files or disk images with leading backup tools2
3
Replication with VMware: Array-Based Replication
WAN orDark Fiber
WAN orDark Fiber
Array-Based Replication
PRIMARY DR SITE
SiteFailure
SourceVMFS
TargetVMFS
Storage array
Storage array
Simpler Disaster Recovery Testing with Virtualization
1. Snapshot and clone replicated data to create testing VM’s
2. Connect test VM’s to an isolated network
3. Power up testing VM’s to validate recovery
4. Delete VM clones used for testing
SAN
SAN
Target VMFS
DR Site
OS.image
Appln.image
Data.imageData.
OS
Application
15 GHz 9 GHzPowered On DR VMs
- Rapid DR setup and removal- Dual-use of DR site for batch, test and other workloads
LiveDR
TestDR
Replicated Data
Snapshot
Recovery Process in a Virtualized Environment
RTO of minutes to a few hours, not days to weeks! RTO of minutes to a few hours, not days to weeks!
Configure hardware
Install OS
Configure OS
Install backup agent
Start “Single-step automatic recovery”
RestoreVM
Poweron VM
Example recovery process comparison
P-P
V-V
40+ hrs
40+ hrs
< 4+ hrs
< 4+ hrs
VMware Site Recovery Manager: Technical Overview
July 2008
VMware
Agenda
Introduction and Key Concepts
Site Recovery Manager 1.0 Prerequisites and SAN Integration
Site Recovery Manager Workflows
Site Recovery Manager Roles and Privileges
Alarms and Site Status Monitoring
Summary
What is a Disaster?
Complete loss of a data center for an extended period of time
Declaration of a disaster usually requires consensus from multiple parts of the organization (at the C*O level)
What is not a disaster?
Failure of an individual host
A temporary service interruption
The Current State of Physical Disaster Recovery
DR services tiered according to business needs
Physical DR is challenging
Maintain identical hardware at both locations
Apply upgrades and patches in parallel
Little automation
Error-prone and difficult to test
Tier RPO RTO Cost
I Immediate Immediate $$$
II 24+ hrs. 48+ hrs. $$
III 7+ days 5+ days $
Advantages of Virtual Disaster Recovery
Virtual machines are portable
Virtual hardware can be automatically configured
Test and failover can be automated (minimizes human error)
The need for idle hardware is reduced
Costs are lowered, and the quality of service is raised
Simplifies and automates disaster recovery workflows:
Setup, testing, failover
Turns manual recovery runbooks into automated recovery plans
Provides central management of recovery plans from VirtualCenter
Introducing VMware Site Recovery Manager
Works with VMware Infrastructure to make disaster recovery rapid, reliable, manageable, affordable
Site Recovery Manager leverages VMware Infrastructure to deliver advanced disaster recovery management and automation
Protected Site
Recovery Site
VirtualCenter Site Recovery Manager
VirtualCenter Site Recovery Manager
Datastore Groups
Array Replication
Datastore GroupsXSite Recovery Manager at a Glance
Protected VMs
Protected VMs powered on
become unavailable online in Protected Site
offline
Site A Site BRecovery
SiteProtected
Site
Supports bi-directional site
protection
Server Side Components *Site 1
VC Server 1
SRM Server 1
StorageReplicationAdapter
SRM 1 DB
VCMS 1 DB
Block Replication SW
Site 2
VC Server 2
SRM Server 2
StorageReplicationAdapter
SRM 2 DB
VCMS 2 DB
Block Replication SWArray 1 Array 2
* Note: Conceptual drawing only. Site Recovery Manager Server may run on another system than VCMS
Site Recovery Manager Concept Relationship “Cheat Sheet”
Site Concept Relationship
Protected LUN Indivisible unit of storage that can be replicated
Protected Datastore Contains one or more LUNs (i.e. VMFS)
Protected Datastore Groups
Auto-generated collection of one or more datastores. Indivisible unit or storage failover.
Protected Protection Group
Collection of all VMs stored in a datastore group
Recovery Recovery Plan Contains one or more protection groups
Key Concepts And Their Relationships
Protection Group 3
Protection Group 2
Protection Group 1
Datastore Group 3
Datastore Group 2
Datastore Group 1
LUN 1
LUN 2
LUN 3
LUN 4
LUN 5
VM
FS
1
VM
FS
2
VM
FS
4
VM
FS
3
Recovery Plan 1 (Whole Site)
Protection Groups:
Recovery Plan 2(Subset)
Protection Groups:
Protected Site Recovery Site
Protection Group 1
Protection Group 2
Protection Group 3
Protection Group 1
Array Integration with Site Recovery Manager
Vendor-specific scripts support:
Array discovery
Replicated LUN discovery
Test initiation (simulated failover in an isolated environment)
Failover initiation (actual failover of services to the recovery site)
In cooperation with VMware and with the full support of VMware the storage vendors create the storage replication adapters for their respective storage arrays
SRM Server
Replication Manager
Array Manager
Array Manager
Vendor-Specific Script
VendorMgmt
Interface
VendorMgmt
Interface
Array
Array
Array
Vendor-Specific Script
Protected Site
Recovery Site
VirtualCenter Site Recovery Manager
VirtualCenter Site Recovery Manager
VMware Site Recovery Manager Licensing
Site 2Site 1
SRM licensed per CPU socket on the ESX server that hosts the
protected virtual machines in the Protected Site
SRM Protected VMs
VMs not protected by Site Recovery Manager
Safety Tip: DNS Validation – The Rule of ‘Four’
Validate DNS is working as expected by performing the following DNS lookups for the VC,SRM and ESX servers
Short name
Long name
Reverse
Forward
Site Recovery Manager 1.0 PrerequisitesESX 3.0.2, ESX 3.5
VirtualCenter (VC) server version 2.5 installed at the protected site and at the recovery site
Site Recovery Manager server installed at the protected and at the recovery site
Site Recovery Manager plug-in installed on the VMware Infrastructure Clients that will access the protected and recovery site
Network configuration that allows TCP connectivity between VC servers and SRM servers
An Oracle or SQL Server database that uses ODBC for connectivity in the protected site and in the recovery site
A Site Recovery Manager license file installed on the VC license server at the protected site and at the recovery site
Pre-configured array-based replication between the protected site and the recovery site
Site Recovery Manager Installation Workflow
At the protected site the following activities are completed:
Installation of the SRM server
Installation of the SRM Plugin into the VI Client
Installation of the Storage Replication Adapter (SRA)
At the recovery site the following activities are completed:
Installation of the SRM server
Installation of the SRM Plugin into the VI Client *
Installation of the Storage Replication Adapter (SRA)
It is important to complete the workflows in the order detailed in this presentation
* Note: Optional step, only required if a different instance of the VI Client is used to access the recovery site
Protected and Recovery Site Datacenters
PROTECTED SITE
RECOVERY SITE
Site Recovery Manager User Interface
Local and Paired Site
Protection Setup
RecoverySetup
SRM UI Access
Setup Workflow – Protection SiteAt the protection site the following setup activities are completed:
The user pairs the SRM servers at the protected and recovery sites
Security certificates are established between the SRM servers and the VC servers
Certificates that are not properly signed will result in the Yellow Warnings Signs.Reciprocity will still be established allowing you to continue to the next step in the workflow.
Setup Workflow – Protection Site (continued)
Array Managers ConfigurationSelect the correct Manager Type from the Manager type drop down box
Storage Partner Participation
VMware provides the SRA specification
Storage Partners create the SRA
Storage Partners test the SRA
VMware review the SRA test results
SRA support with SRM granted if all test are passed
SRM identifies available arrays in the Protection and Recovery Side and the replicated datastores and determines the datastore groups
Protection Side Array Discovery
Recovery Side Array Discovery
Replicated Datastoresand
Datastore Groups
Setup Workflow – Protection Site (continued)
Setup Workflow – Protection Site (continued)Using the Inventory Preferences Mapper, the user maps resources in the protected site to their counterparts in the recovery site.
Setup Workflow – Protection Site (continued)
A protection group is a group of VMs that will be failed over together to the recovery site
Working through the Protection Group wizard you will need to select a temporary location for placeholder VM configuration files for the protected VMs at the recovery site.
Setup Workflow – Protection Site (continued)
Working through the Protection Group wizard a user selects which VMs need to be protected and assigns them to a protection group
The creation of a protection group results in VC inventory updates in the recovery site
Setup Workflow – Recovery Site
At the recovery site the following setup activity is completed:
The user creates a recovery plan which is associated to a single or multiple protection groups
Site Recovery Manager Recovery PlanVM Shutdown
High PriorityVM Recovery
Prepare Storage
High PriorityVM Shutdown
Normal PriorityVM Recovery
Site Recovery Manager Recovery Plan (continued)
Site Recovery Manager Recovery Plan Benefits:Turn manual BC/DR run books into an automated process
Specify the steps of the recovery process in VirtualCenter
Provide a way to test your BC/DR plan in an isolated environment at the recovery site without impacting the protected VMs in the protected site
Low PriorityVM Recovery
Post Test Cleanup
Storage Reset
Testing a Recovery PlanSRM enables you to ‘Test’ a recovery plan by simulating a failover with zero downtime to the protected VMs in the protected site
Site A - Protected Site
Source LUN(shared-san-2)
Site B - Recovery Site
Clone LUN(shared-san-2)
Read WriteEnabled
Data Replication continues between the Source LUN and Target LUNThe data synchronization between the Target LUN and the Clone LUN is suspended
Target LUN(shared-san-2)
Note: Datastore ‘shared-san-1’ will be in the same configuration state as ‘shared-san-2’
Protected VMs(app_vm7 to app_vm12)
Protected VMs powered on in Site B during the SRM
Test failover
Protected VMs(app_vm7 to app_vm12)
Protected VMs that will be recovered to Site B
Storage configuration during a SRM Test failover from Site A to Site B for datastore ‘shared-san-2’
Write Disabled(read only)
Read WriteEnabled
Testing a Recovery Plan (continued)
Status
Success
Errors
Waiting for Input
Recovery Only
Test Only
Success
Executing an Actual FailoverWARNING - Executing an actual failover will permanently alter virtual machines and
infrastructure of both the protected and recovery sites
Site A - Protected Site
Source LUN(shared-san-2)
Protected VMs(app_vm7 to app_vm12)All powered off by SRM
At start of SRM Recovery
Site B - Recovery Site
Target LUN(shared-san-2)
Write Disabled(read only)
Read Write Enabled
Protected VMs(app_vm7 to app_vm12)All powered on by SRM
during the SRM Recovery
Note: A Clone LUN is not used during an actual failover in SRM.
Storage configuration after running a Recovery in SRM (Actual Failover)from Site A to Site B
Data Replication is suspended
Executing an Actual Failover (continued)WARNING - Executing an actual failover will permanently alter virtual machines and
infrastructure of both the protected and recovery sites
WARNING - Failback to the protected site is a not an automated process in SRM 1.0
SRM performs a Datastore re-signature SRM will automatically perform a re-signature on the Datastores in the Recovery Site that were replicated from the SRM Protected Site
LVM.EnableResignature=1
With a re-signature - Datastore names will change to snapxxxx_datastorename, for example snap-00000002-shared-san-1
snap-00000002-shared-san-2
WARNING - The re-signature of the target datastore has implications during a failback (resync) of data back to the SRM Protected Site
Failback Options with Site Recovery Manager 1.0 SRM 1.0 does not provide a push-button automated failback
process
Failback Options
Without SRM (no Recovery Plan, no Testing capabilities, no audit trail)
Unregister the protected virtual machines in the Protected Site VC
Work with your storage team, reverse data replication
VM re-inventory in Protected Site VC, restart and re-ip (manual or scripted)
With SRM (Recovery Plan, Test before Recovery, built-in audit trail)
Delete the protection groups in the Protected Site VC
Unregister the protected virtual machines in the Protected Site VC
Work with your storage team, reverse data replication
Leverage SRM, complete SRM workflows in the reverse direction from Recovery Site back to the Protected Site
Repeat the above steps from the Protected Site back to the Recovery Site to complete the re-protection of the virtual machines in the Protected Site
Default Roles and Privileges in Site Recovery Manager
Alarms and Site Status Monitoring
SRM will support the following alarm notification actions:
Send e-mail to specified address
Send SNMP trap to VC trap receivers
Execute specified command on VC host
We recommend you complete setup of alarm notifications for: Remote Site Down
Remote Site Ping Failed
Replication Group Removed
Recovery Plan Destroyed
License Server Unreachable
Site Recovery Manager Server Monitoring
SRM will raise VC events for the following conditions:
Disk Space Low
CPU use exceeded limit
Memory low
Remote Site not responding
Remote Site heartbeat failed
Recovery Plan Test started, ended, succeeded, failed, or cancelled
Virtual Machine Recovery started, ended, succeeded, failed, or reports a warning
Site Recovery Manager Core Benefits
Expand disaster recovery protection
Now any workload in a VM can be protected with minimal incremental effort and cost
Reduce time to recovery
As soon as disaster is declared, a single button kicks off recovery sequence for hundreds of VMs
Increase reliability of recovery
Replication of system state ensures a VM has all it needs to startup
Hardware independence eliminates failures due to different hardware
Easier testing based off of actual failover sequence allows more frequent and more realistic tests
Summary Site Recovery Manager Leverages VMware Infrastructure to Make Disaster Recovery
RapidAutomate disaster recovery processEliminate complexities of traditional recovery
Reliable Ensure proper execution of recovery planEnable easier, more frequent tests
ManageableCentrally manage recovery plansMake plans dynamic to match environment
AffordableUtilize recovery site infrastructureReduce management costs
Backup Slides
Protected Site Topology Map
Setup Workflow – Recovery Site VC UpdatesThe creation of the protection group results in VC Inventory updates in the recovery site.
Protected VMs app_vm1 to app_vm12 are created in the VC inventory in the recovery site with the creation of their respective protection groups in the protected site
Questions?
Questions?