ready, set, plan defeat disaster it is your move · ready, set, plan defeat disaster it is your...
TRANSCRIPT
Ready, Set, Plan
Defeat Disaster it is Your Move
Richard Dolewski
Gateway/400November 08, 2007
Disaster Recovery Planning
Only 70% of today’s businesses have fully documented Disaster Recovery Plans.
Of these company’s with plans
Pre 91164% NEVER test their plan
Post 911 30% NEVER test their plan
Common Misconceptions
It will never happen to me!Business as usual after a disasterWe have special requirementsToo many other priorities
Murphy’s law: Disaster strikes when, where and because you are not prepared
Common Issues
We all tend to let our guard down when times improve
As Planners we must always be ready & be prepared.
We are all not Safe from Weather Related Disasters !!!
Impact on Cost of Downtime
Tangible Costs
• Lost Revenue• Lost Wages• Lost Inventory• Regulatory Violations• Legal Fees
Intangible Costs
• Lost Opportunity• Employee Retention• Goodwill• Brand Damage• Customer Respect
Average Cost Per Hour of Downtime —By Industry
Finance: Brokerage Operations $3.15 Million
Finance: Credit Card Auth. $2.1 Million
Online Retail: $113,000
Communications: Internet Provider $90,000
Transportation: $89,500
Media: Ticket Sales $69,000
Transportation: Package Shipping $28,000Source: Contingency Planning Research 2005
93% of businesses that suffer a significant loss of data are out of business within 5 years.
The Bureau of Labor
Last Sobering Statistic:
No Disaster Recovery Plan
• Guarantees:• Confusion
• Lack of direction
• Conflict
• Lost Customers
Definition of a Disaster
A sudden, unplanned event that causes great damage and loss to an organization.
The time factor determines whether an interruption in service is an inconvenience or a disaster. The time factor
varies from organization to organization.
What is Disaster Recovery
Reaction to a sudden, unplanned event that enables an organization to continue critical
business functions until normal business operations resume.
“…It is not enough to arrange for hardware replacement;… planning must address continuation of
business operations, or business continuation.”
Consider the Business impact of Down time !
Why is this area “ Vital ” ?
Expectations of the Services are demanding
Technology is an enabler of business
Penalties are becoming more severe
Business is becoming more competitive
Can serve as both a source of competitive advantageas well as competitive disadvantage
Vulnerability IT Assessment
Key Steps to DR Planning
IT Capabilities Assessment
– Overview our current IT capabilities– Align the Business Needs– List the gaps between the Business needs and current solutions
– What solutions are needed to bridge the gap
Vulnerability Methodology – BP Audit
Objective is to drive down the duration of outages
A systematic approach towards:
Reducing the frequency of outages by eliminating all single points of failure.
Reducing the duration of outages by configuring both hardware and software for the fastest possible recovery.
Vulnerability Methodology
Best Practices Audit
Analyze
Identify Potential Exposures
Provide Alternatives & Solutions
Implement Solutions
Power Redundancy
Physical Security
Open Door PolicyEntry points – Door ManagementCipher locksIP Cameras
Save/Restore Strategy
System saves must be reviewed Ensure compete recovery is possible from mid week,mid day or weekend failure. Electronic notification of exceptions Review Restoration Procedures Backup software BRMS
Save/Restore Strategy
Partial saves are used because of shrinking backup window.
Save While Active may be the solution for you.
Introduce faster tape technology
Less then 50% of companies have complete backups
Reliable Backups
Backups are the backbone to any recovery situation
– In most recovery situations, the backups are not adequate
– Excessive time is spent recreating parts of operating system
– System State is typically not complete
– QUSRSYS
Design & Test Your Backups
Testing your recovery strategy ensures you have a good back up strategy!
–Your backup is only as good as your recovery–Your recovery is only as good as your backups
Hint:
Design recovery strategy before your backup strategy
Checklist for Backup & Recovery
Examine current save strategy for all mission critical servers.
Map out how you would rebuild multiple servers. Is there a specific order required. Consider enterprise recovery.
Check the backup logs. Missing objects, folders, directories.
Examine Backup software : Veritas, ArchSrv, BRMS, TSM logging
Tape Management Software
Strategic Backup Management Product
Manages your mediaAutomates your BackupsSimplifies your Restores and RecoveriesProvides Detailed Reporting
..…and more
Business Impact Analysis
Mission Statement
IT Services Mandate:
To protect systems from risk.
To ensure continuing confidence
To monitor and protect corporate computing assets.
Protect Important Assets
Four Primary Assets needed to operate Information Systems:
Hardware and Networks can be replaced
Facilities can be rebuilt or relocated
Data is Priceless !!!
Business & IT DR Planning
– Defining Business Objectives– Prioritizing Business Objectives– Overview your current IT capabilities– Alignment between IT and the Business– Minimize the gap between Business needs & IT deliverables– What solutions are needed to bridge the gap– Acceptable length of downtime - High Level
3 Steps to Business Preparedness
1) PLAN to stay in business
2) TALK to your Business & IT Folks
3) PROTECT your investment
Key Steps to DR Planning
Business Impact Costs
– Create costs estimates for each agreed upon risk scenario– Define acceptable amount of downtime– Define acceptable amount of data loss– Create budget costs to implement agreed upon solution
Business Impact Analysis
A summary of critical IT applications.
- Application name- Application priority- Special Requirements- Maximum outage (hours, days)
These applications to be included in the BIA presentation and report.
Business Impact Analysis
Define cost of outageTotal revenueCustomer baseFines and penalties
Graphic representation of revenue lossDefine recovery sequence of the vital processesReview loss per hour, day, week and/or monthObtain senior management confirmation
How much money would your company lose if a major outage occurred?
Questions for the Business
Include: Manufacturing, Finance, Purchasing, Sales,
Warehousing.
What Services do you provide them !
What Services do they provide !
Risk Analysis
Key Steps to DR Planning
Identify Risks
– Identify scenarios where a recovery is required– Identify key business requirements necessary– During these potential interruption– Incident or Disaster
Objectives of a Risk Analysis
Answer: Four basic questions. . .
1. What could go wrong? Threat/Event2. How often can it happen? Frequency3. What will be the consequences? Impact4. How certain are the answers above? Confidence
Statement of Risk
QuantitativeAssigning values, such as $$$$ to something
Identifying the cost of a particular effect, incident or phenomenonALE - Annualized Loss ExposureObjective
Annual Loss Exposure
RISK = FREQUENCY times EXPOSURE R=f*eWhere - f = FREQUENCY
e = EXPOSURE
EXAMPLE = POWER FAILUREFrequency = 5 times a yearResult = Uncontrolled loss of $ 70000 ( Dept A)
= Uncontrolled loss of $ 10000 ( Dept B)5 x $ 70000 =$ 350, 0005 x $ 10000 =$ 100, 000
Produces a Total ALE=$ 450,000
Recovery Options
Key Steps to DR Planning
Where will you go
Determine your recovery site• Internal alternative• Commercial Hotsite• Hosted High Availability ( Internal or Commercial )• Location• Geographic Separation
The Disaster Recovery Challenge
Resume Time Sensitive Business Operations with NOwarning and:
At another (remote?) location/facilitySmaller server with less capacity & capabilityUsing only information stored off siteWithin a designated recovery time objectiveWithout some key personnel
Recovery Time Objective
The time within which Business Processes must be Restored at acceptable Levels of Operational
Capability to Minimize the Impact of an outage.
Point ofDisruption
Resumptionof
operations(Businessor Data
Processing)
Time-SensitiveSystems
Operationalwith Current &Accurate Data
TimeBusinessProcesses
Functional
RTO
Recovery Time Objective
Recovery Tasks Time to Complete Task in hours Assess the disaster situation 3 hours Declare a disaster 2 hours Retrieve tapes from our offsite supplier
1 hour
Transport key staff & backup tapes to the recovery site
2 hours***
Restore all Mission Critical Servers
20 hours
Configure and redirect networks 1 hour Apply incremental data 2 hours Testing & validation 1 hour Total Time 32 hours
Disaster Recovery Hotsite
High Availability
Disk Protection
Transaction Integrity
Tape Backup
Backup & Recovery Hierarchy
Risk Management
Data Resiliency Level
Vender Exercise
BULL**** METER
Evaluating Hotsite Vendors
Location, Location, LocationProximity to public transportationSeparate Power Grid & CO
FacilityAppropriate computing hardwareCompatible communication networksAdequate workspace
Support Service
Evaluating Hotsite Vendors
Support Staff & AvailabilityExperienceTest TimeNumber of CustomersCostAdditional servicesDeclaration feesWhat’s included and what’s not ?
Recovery Solution
Hotsite vs. High Availability
Depends on Recovery Time & Recovery PointBusiness Objectives
IBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
Downtime vs. Availability
Downtime Cost Variable is $/Hour
Understanding downtime $/Hour is the most important key to understanding your availability requirements.Labor costs, loss productivity & revenueThe cost of downtime continues to rise.The cost of computing is falling.
Building a TeamMeans getting the right people
Perspective
The difference between a GREAT recovery team and the one that falls down on the job is the:
Caliber of the Team members !
Heroes Step Forward
Perspective
To often companies populate their DR teams with raw inexperienced staffers
Volunteers to satisfy an auditor or worse the sacrificial lamb
The Right People
Not only are these folks leaders, and the most capable:
They are trusted
Confident
Able to correct mistakes
Dedicated to the success of the team
Characteristics of a Good Team
Ideal Characteristics Characteristic to Avoid
Considered an Expert by his/her peers
Hands off Individual ( Avoids Work )
A go-to Person for anything and/or Everything
New to the Organization
Totally unfamiliar with the systems
Works well under Pressure Folds under Pressure
Controls Emotions Hot Head
Ideal Characteristics Characteristics to Avoid
Confident Lacks sense of Urgency
Trusted by Peers Tendency to blame others
Excuses , Excuses , Excuses
Totally unfamiliar with the systems
Dedicated – A company person Pure 9 – 5 er.
First one out the DoorWilling to fix problems created by others
No where to be found
Characteristics of a Good Team
Roles & Responsibilities
Key Steps to DR Planning
Roles
– Educate staff on their roles in the DR plan– Clearly state expectations in a disaster situation– Who is in Charge– Pre-define methods you will utilize to contact staff
Types of Teams
IT Management Team
Executive Management Team
Damage Assessment Team
Media Relations Team
Recovery Management Team
Technical Recovery Team
Platform Recovery Team: iSeries RecoverypSeries RecoveryIntel Server RecoveryUnix/Linux RecoveryNetwork RecoveryApplications Team Security
Insurance Recovery team Site Restoration Facilities Build Team
Types of Teams
The Role of Executives
Executives are not typically involved and should NOT be. Why:
Executive Pressure is hindering
Intimidating
Often Nasty
Glorified Experts
The team must provide regular status reports & the Executives should be accessible if required.
Team Building
IT Recovery Team:
Initial AssessmentFacilities Recovery and Restoration ( Hardware Specific ? )Communications Recovery Teams (voice & data) Data Processing Functional RecoveryVital Records (Off site storage )
IT Recovery Team Role
Recovery includes iSeries/400, and all mission critical IntelServers, Applications, and Communications.
Responsible for initiating damage assessment, recoveryactions, notification procedures until such time as one of the Senior Executive is available.
All reporting will be flow to the to the IT ManagementTeam.
IT Team Leader Role
This individual needs to be technically skilled
Have a strong background in all the server hardware, softwareand complete IT infrastructure.
Communicate with vendor technical reps and hardwareengineers, performance issues, hardware problem resolution,and interfacing with the management team.
Be able to schedule and manage people.
Responsible for initiating damage assessment activities, recovery actions, notification procedures.
Disaster Recovery Teams
DISASTER RECOVERY
MANAGEMENT TEAM Pri: ________________ Alt:
ERP APPLICATION RECOVERY TEAM
Pri:
JDE APPLICATION RECOVERY TEAM
Pri: __________________ Alt:
HELP DESK TEAM Pri: __________________ Alt:
Administration - HUMAN RESOURCES - CLAIMS - ADMINISTRATION - INSURANCE - REGULATORY
Unix
Pri:______ Alt:______
AS/400
Pri:______ Alt:______
AIX Pri:______ Alt:________
Network LAN
Pri:______ Alt:_______
Selecting Plan Manager
Designate a DR Plan Manager - DR Coordinator to manage the DR initiative.
The DR Plan Manager - Act as a focal point for the project.
Organize, plan, and facilitate the development of DR plan based on the prioritization from the Business.
Comply with standards and utilize the methodology for recovery plan development, maintenance and testing.
Plan Manager Activities
Provide DR Plan maintenance activities.
Air Travel Arrangements and Hotel
Tape Media Arrangements / Air or Ground Cargo
Assist in detailed damage assessment and insurance.
Co-ordinate HR to provide counseling for staff or family.
Coordinate food and sleeping arrangements
Coordinate testing Activities
The role of the Plan Manager during a Technical test is to:
Manage the conduct of the test.Develop Table Top ScenariosEnsure that each objective is fully realized.Ensure that each test participant follows the procedures.Record problems and their resolutions as they arise.Record the duration of each of the procedures.Liaison with the Hot Site staff.
Plan Manager Activities
The Plan Manager is also responsible for writing the summary report for the test.
Review the objectives of the technical test. Summarize the changes for the DR Plan and distribute.Summarize any recommendations resulting from the test.Post Test meeting with Participants.State the schedule for the next test.
Plan Manager Activities
How Disasters effect Staff& the caring
Regional Disaster Elements
Recovery is only possible if someone is available to put IT back together again.
Equipment may be accessible, but your recovery will be ineffective if your IT staff cannot access the recovery site.
Key Personnel are often displaced or unavailable during a major regional disaster
Disasters effect people in unpredictable ways
It can devastate peopleCan effect others around themMake the individual unable to functionEmotional breakdown
Respect the situation
The effects are real
Best Practices During a Disaster
Alleviate recovery team workloads with support staffOrganize communications so that team members have only one person to report toEliminate one on one updates Let your staff do what they do best
Executives make strategic decisionsMangers co-ordinate staff & resourcesTechnicians fix the problems
Ensure Team Leaders are sensitive to team members personal needsEnsure team members families are cared for Locate missing team members and notify others of their where aboutProvide support services to the families of injured staffBroadcast all positive accomplishments
Best Practices During a Disaster
Feed your staff - Have snacks handyStay away from High Energy DrinksEnforce no shift duration to exceed 12 hoursEncourage people to take breaks
Provide distractions during breaks
Best Practices During a Disaster
Family Comes First
Until the basic personal needs are met
Family comes first. Always !!
Staff members will not focus on the Enterprise Recovery
Staff members may or will not be available
79
Family Comes First
Recovery Plans must provide for family needs as well as staff members. Offer the basic needs.
Ensure your organization demonstrates they careabout the Recovery members and their families
Offer temporary Shelter and foodMedical careHousing Day CareTransportation
Home Disaster PlanEMERGENCY SUPPLY KIT• Supplies to be including in any emergency kit:
– Water– Food– Battery-powered radio and extra batteries– Flashlight and extra batteries– First Aid kit– Whistle to signal for help– Dust or filter masks– Moist towelettes for sanitation– Wrench or pliers to turn off utilities– Can opener for food (if kit contains canned
food)– Plastic sheeting and duct tape to "seal the
room"– Garbage bags and plastic ties for personal
sanitation
Disaster Recovery Planning
Recovery Plan Format
What Should you use to write the Plan???
Is the Trustworthy word Processing Software enough?Most ConvenientNo training required
DRP Software Planning ToolsHow big is to big?
Planning Methodology
•Identify exposures
•Provide alternatives
•Define recovery strategy
•Develop solutions
•Document
Customer Environment
Analysis
Business Impact
Analysis
Analyze/Validate
The purpose is to ensure IT can provide services required to meet the business objectives in the event of a disaster
Business AnalysisIT Analysis
Data Gathering Data Gathering
Plan Test Maintenance Relocation
Cannot be approached casually
The Plan must be ....
–Well organized–Action Oriented –Comprehensive
Objective: Total restoration of Services in a timely manner
Disaster Recovery Planning
Develop and Implement the DRP
Disaster Recovery Planning Design Concerns
Minimize dependency on specific individualsEnsure completeness Ensure establishment of critical decisionsMinimize dependency on specific outside entitiesEnsure the plan is current - Living DocumentSignoff
Information Gathering Sample
1) Hardware configurations of all servers.
2) Software running on all equipment.
3) On each system detail, IP address, system name.
4) Backup procedures, rotations, tape naming convention
5) Vendor list supporting equipment.
6) Restoration procedures for system(s) if different.
7) Location of all required software to recover server(s).
8) Supporting network hardware.
Plan Elements
Plan Contents:•Mission statement•Disaster definition•Team responsibilities•Contact information•Critical documentation•Unique procedures•Recovery site inventory•Backup/recovery process•Implementation plan•Test plan•Maintenance•Relocation/migration plan
Mission Statement
The Disaster Recovery Plan has been developed to recover critical computer based applications and services within 2 days of a full scale disaster.
CarrierServices
Relocate Operations
Hotsite
WAN
Plan Elements
Executive Summary and Table of Contents
Scope & Assumptions & Definitions
Emergency and Notification Procedures
Disaster Declaration
Recovery & Resumption Site Procedures
Voice and Data Communications Requirements
Scope of the DR Plan
The DR Plan will provide immediate response & subsequent recovery from any unplanned computing service interruption, such as critical server failure, or catastrophic event such as aloss of facility.
Provide an organized and consolidated approach to managing response and recovery activities.
Recover essential operations in a timely manner.
DR Plan Assumptions
Only the primary site has been disabled by the disruption, all other facilities are unaffected.
The Off-site storage location for critical backup files is accessible.
Qualified personnel as identified in this document are available to perform Disaster Recovery responsibilities.
Plan Elements
People assignments, responsibilities & trainingSite: Selection and environment preparationVital records: Inventory and BackupSoftware Systems: Inventory and BackupApplication Systems: Inventory and Backup
OS/400 Recovery Sample
Licensed Internal Code RestoreLicensed Internal Code Restore at HotsiteBuilding your Disk Configuration at the HotsiteLicensed Internal Code Restore at Data CenterBuilding your Disk Configuration at the Data CenterRestoring the Operating SystemRecovering the BRMS ProductInitialize BRMS DeviceUser profilesDevicesNONSYSIBM LICPGM
Plan Elements
Hardware Inventory, Agreements, DocumentationCommunications, Current, BackupTransportation: Emergency RequirementsSupplies: Critical items - VendorsDocumentation: Inventory & Off-site BackupOther EquipmentVendor Contracts, Etc.Test Plans
Plan Elements
Appendix
Vendor Agreements & Contracts
Telephone calling tree
Team notification procedures
Recovery and resumption sites, addresses, telephone numbers
Call Tree Information
NameTitleAddress (Street address, not post office box number)Office telephone numberHome telephone numberPager number, if availableCellular telephone number, if availablePersonal Email Alternate telephone numberBlackberry or PDAFAX
Determine Personnel StatusPlan Activation ProceduresFirst Alert ResponseDisaster verificationPlacing Hotsite on AlertActivate Damage Assessment teamCommand CenterHot-Site Call up ProceduresTeam Responsibilities During a Disaster
Plan Activation
Declaring a DisasterDisaster Declaration PersonnelDirections to primary HotsiteDirections to Alternate HotsiteTravel InformationRecalling Offsite TapeHot-Site OpeningHigh Availability Role – Swap
Hot-Site Activation
Conclusion
•Identify exposures
•Provide alternatives
•Define recovery strategy
•Develop solutions
•Document
Customer Environment
Analysis
Business Impact
Analysis
Analyze/Validate
The purpose is to ensure I/S can provide services required to meet the business objectives in the event of a disaster
Business Analysis
Data Gathering Data Gathering
I/S Analysis
Plan Test Maintenance Relocation
Testing
Key Steps to DR Planning
Test Your Plan– Test your stated recovery scenarios– Test your restoration capabilities– Train your staff for response– Validate all assumptions– Timeline validation– Document required changes to your solutions and business
needs
Implementation Review
Review Documentation:OrganizationConsistencyIs it Clear?StaffingLack of Documentation ( Missing )
Validation Exercise
1. Does the plan meet the Recovery Time Objectives:
2. What is your RPO?
3. Have anything changed ?
Passive Testing - Format
Participants bring a their copy of the DR plan
Plan Manager reviews objectives of the test
Plan Manager starts the test
The Recovery team discusses the scenario
Recovery team executes the Recovery tasks
Key Objectives of a Passive Test
Validate Completeness and Accuracy of the Plan
– Team Organization– Call lists– Checklist for all team members– Contacts - Internal & External– Recovery Resources
Passive Testing
The exercise should:
State the objectives of the walk throughList the participantsSelect a scenario relative to your companyInclude the scenario in definition handouts
Summarize the changes for the Computer Contingency Plan and schedule for their completion
Passive Testing
Reduce the team !!!
Reduce the Recovery team by 20%.Examine LogisticsPut the logistics into actionCan we make it happen as written ???
Telephoning vendors after normal business hours to ensure that their hotline and service numbers are correct and manned
Simulation Exercises
Active Testing
A hands on exercise focus to determine how well the plan works.
Tests the Business on how impacted they are in a disaster.
Simulate Reality
Do not endanger your primary source of revenue.
Testing must not effect normal day to day business.
Active Testing
It’s not all bad news when your plan fails during a test…
…frequent testing identifies gaps in your recovery plan
– Hot-site ( Alternate Site )– Actual Reload ( Full scale, single system )– Communications switch– Assessment of Backup & Recovery procedures
Active Testing
Introducing Murphy
Does this sound familiar ?
Testing only done from full backupsSpecial backups performed to ensure successBackup tapes pre-shippedSame staff perform recovery steps – Alternates back running the shopNever test the whole thing – Communications, All servers in Enterprise
Tips
Test because your Business Depends on it.Pre-Arrange for serial dependent software KeysEnsure software & PTF’s are currentTest a Mid-Week scenarioInstall latest Backup/Recovery group PTFsKeep your hot-site aware of your hardware profileKnow where your LIC CD is locatedHave you performed a FULL save since the last
upgrade?
Successful Testing Requires Teamwork
“ The desire to recover is High. The time to test frequently is not there.
Testing is a partnership and ONLY successful
when everyone works together.”
115
Thank You
Questions
Richard Dolewski, CDRP
VP Business Continuity
Tel. 1- 206- 436 - 3321
www.wts.com