business continuation strategies: how to take a licking and keep on ticking martha fateman director,...
TRANSCRIPT
Business Continuation Strategies: How to Take a Licking and Keep On Ticking
Martha FatemanDirector, Central Computing Services
University of California, Berkeley
August 5, 2002
Copyright © 2002, The Regents of the University of California. Permission is granted for this material to be shared for non-commercial, educational purposes, provided that this copyright statement appears on the reproduced materials.
Obstacles
Lack of staff time Lack of funding Lack of interest Belief you have to do it all …
. . . and do it right Lack of knowledge about what it is
Business Continuity Planning
Definition:
Advance planning and preparation needed to minimize loss and ensure the continuity of critical business functions
You Can Make Progress
Know what it is Know your risks Divide it up Take advantage of opportunities Keep piecing it together
How Berkeley Got Started
Changes and near disasters = opportunities
1991 Oakland Hills Fire 1995 New Director 1997 New Chancellor Multiple minor disasters External Auditors
Business Continuation Steps
1. Risk Assessment
2. Prevention and Mitigation
3. Emergency Response
4. Business Resumption Plan
Business Continuation Steps
1.1. Risk AssessmentRisk Assessment
2. Prevention and Mitigation
3. Emergency Response
4. Business Resumption Plan
Identify Vulnerabilities
Identify vulnerabilities: Specific to your region of the country Specific to your city and campus Specific to your building Specific to your facility within the building
Your Region of the Country
The Hayward Fault runs through the Berkeley campus
• Earthquakes• Hurricanes• Floods• Tornadoes• Volcanoes
Your City and Campus
Terrorism Disruptive demonstrations Strikes Bomb threats Evacuations (false fire alarms)
Your Building
Level of public or campus traffic Age of building and campus infrastructure Facility Design Proximity to
– Construction– Research labs– Animal care facilities– Embassies or federal buildings
Your Facility
Water above? Redundancy
– Power, cooling Placement & separation of
building systems– Water supply– Steam pipes
Maintenance level of building/utility systems
Self Assessment
Use common sense Review campus experience Use IBM’s free online Safe Site Test
https://www-1.ibm.com/services/continuity/recover2.nsf/forms/safe+site+test
Berkeley’s IBM Score
On IBM’s Scale 0-40:
U.C. Berkeley Scored 60
– Retrofit the facility to mitigate problems– Move to a less dangerous environment
Impact of Assessment
Independent Report:
A dose of reality Clarified our thinking about the worst case Road map for mitigation Got campus management attention
– funding for improvements – a new facility
Business Continuation Steps
1.1. Risk Assessment2. Prevention and MitigationPrevention and Mitigation3. Emergency Response4. Business Resumption Plan
An Ounce of Prevention
Common precautions taken in computer rooms:
– Fire detection systems– Fire suppression systems– Temperature measurement/control– Water intrusion detectors– Emergency power
$500,000 Went a Long Way
Berkeley Life-Safety Measures:
Secured overhead lighting fixtures Improved emergency lighting & exit signs Purchased emergency supplies
$500,000 Went a Long Way
Berkeley Life-Safety Measures: Installed diagonal bracing under machine
room floor and removed old wiring
Business Continuation Steps
1. Risk Assessment
2. Prevention and Mitigation
3.3. Emergency ResponseEmergency Response
4. Business Resumption Plan
Mission of Emergency Response
Protection of life Assessment of damages Restoration of general campus operations
Handle the First Hours/Days
Use a model Develop your plan Practice your plan Have a plan for
communications
How Berkeley Got Started
Oakland Hills Fire of 1991 was an opportunity
that led to: 1996 State law on emergency management 1997 Campus emergency response planner 1998 First campus disaster exercise
Use a Model
Berkeley uses:
Standardized Emergency Response System based on Incident Command System
used by fire fighters and similar to military model
Campus-wide Planning
Chancellor’sCabinet
&PolicyGroup
EmergencyOperations
Center
•Police Department•Physical Plant•Environmental Health & Safety•Housing & Dining•Capital Projects•Health Services•Information Systems & Technology
Departmental Operations Centers
Replicate the Structure
EmergencyOperations
Center
DepartmentalOperations
Center
Operations
Planning
Resources
Finance
Develop Your Plan
Each Departmental Operations Center has its own emergency response plan
Pre-assigned meeting place Pre-determined priorities Written checklists for the unit’s response to emergencies At least two people assigned to each task Call lists for team members, vendors, other needed contacts
Sample Checklist – MVS Services
1. Report to the IS&T Secondary DOC.2. Conduct a roll call of all recovery staff. Determine who is available to for emergency operations.3. Provide staff status report to the Operations Team Leader – indicate staff who are injured or missing.4. Track the progress of staff. Ensure each team provides updates to the Secondary DOC every 30 minutes. 5. If systems are non-recoverable, advise the Emergency Operations Team Leader. Meet and determine alternatives or
immediate solutions, i.e. procurement of replacements, manual systems, etc.6. Relay information and situation updates to the Operations Team Leader.7. Receive emergency assignments for recovery teams from the Operations Team Leader – per the direction of the
Primary DOC and the Campus EOC. Assign recovery resources, as needed and report progress and updates to the Operations Team Leader.
8. Keep track of staff, labor hours, location worked, and equipment and supplies used (or have this done at the supervisor level). Provide reports at the end of each shift to the Operations Team Leader for documentation.
9. Check with the Operations Team Leader regarding the EOC’s plan for providing food, water and rest areas for staff.10. As resources – equipment and supplies – are used, advise the Operations Team Leader who will relay the request to
the Secondary DOC Manager. The Secondary DOC Manager will request additional supplies and equipment through the Primary DOC.
11. When de-activated, ensure time and materials records are completed and forwarded to the Operations Team Leader.
CCS System Restoration Priority
Group 1 Affect major campus wide operations, causes severe disruptions to the campus
System Name Sys Admin
Hardware Power Source
UPS / Type
System Purpose
Active Directory SDA Liebert 1 Y/A Windows 2000 Active Directory.
Actdir1 1 Dell PowerEdge 2550
Actdir2 1 Dell PowerEdge 2550
Actdir3 1 Dell PowerEdge 2550
Actdir4 1 Dell PowerEdge 2550
Arachne ACS 1 Sun E450 Liebert 1 Y/A Campus Home Page, Web Server, Schedule of Classes, Course Descriptions, Job Vacancy Listings, Deans and Directors memos.
Berkeley Restart Priority List
What Went Right
We had the right equipment We had the plan We had practiced the plan Everyone knew what to do
Emergency Shopping List
The campus data center stocks:– Sandbags, wet-dry vacs, and fans – Crowbars, shovels, flashlights & batteries– Blankets and first aid kits– Cell phones & and radio equipment– Fanny packs with food & water for 3 days– Emergency operations center tent & office supplies
Business Continuation
1. Risk Assessment
2. Prevention and Mitigation
3. Emergency Response
4.4. Business Resumption PlanBusiness Resumption Plan
Business Resumption Plan
The Business Resumption Plan is your formal plan and written procedures to restore IT operations, so that you can support the recovery and resumption of business on your campus again.
Business Resumption Plan
You should start the Business Resumption
Plan only after, or at a minimum in parallel to,
Mitigation Emergency Response
Just Like Any Other Project
Control the scope Manage expectations and educate the
stakeholders Manage budget
What Berkeley Is Doing
Not a textbook model
Scale to suit budget Adjust for academia Take advantage of your opportunities
Scope limited by budget
Business Resumption Plan
1. Obtain Management Support2. Perform Risk Assessment3. Conduct Business Impact Analysis4. Select a Recovery Strategy5. Develop the Plan6. Test and Train7. Maintain the Plan
Business Resumption Plan
1.1. Obtain Management SupportObtain Management Support
2.2. Perform Risk AssessmentPerform Risk Assessment
3. Conduct Business Impact Analysis
4. Select a Recovery Strategy
5. Develop the Plan
6. Test and Train
7. Maintain the Plan
Berkeley Seismic Evaluation
Seismic Evaluation of Berkeley campus
7 buildings rated very poor
50 buildings rated poor, including the data center
27% of the main campus ASF newly identified as poor or very poor
$700M and 20 years to fix
The Chancellor’s Response
In addition to an aggressive building upgrade program… established committees to identify risks, suggest mitigations and recovery strategies for:
Classrooms Research Utilities Infrastructure Business Operations
Management Follow-Up
SAFER Recommended specific areas for remediation Established an Oversight Committee to follow-up
on recommendations Established a central campus Business
Resumption Group
Business Resumption Plan
1.Obtain Management Support2. Perform Risk Assessment3.3. Conduct Business Impact AnalysisConduct Business Impact Analysis4. Select a Recovery Strategy5. Develop the Plan6. Test and Train7. Maintain the Plan
Business Impact Analysis
Estimates the costs of losing business Establishes priorities for recovery Determines
– Recovery Time Objectives (RTO)– Recovery Point Objectives (RPO)
Drives selection of the recovery strategy
Business and ServiceUnits Defined Needs
Identified campus business functions Determined the critical recovery periods Established priorities for resumption Assigned lead department to functions
IT is Critical to Recovery
Priority 1 function linked to applications Supporting hardware Supporting software
Inventory for IT Recovery
Host Platform OS CPUs Memory Storage Applications Data Bases
MAINFRAME IBM 9672 R46 x 1
OS390 2.10 4 4 GB 1.2 T and 2 STK silos 10K slots plus 12x 9490 tapes drives
Misc. Stu & Fin Systems
DB2 6.1 and CICS 4.1
GRAD Sun 250 Solaris 2.6 2 x 300 MHz UltraSPARC II
512 MB 88 GB Grad Admissions
Oracle 8.0.5
SIS450 Sun 450 Solaris 2.6 4 x 400 MHz UltraSPARC II
2 GB 70 GB Stu Info Sys na
VPS7500 Periphonis VPS750
Solaris 2.6 1 x 300 MHz UltraSPARC II
256 MB 18 GB Phone Enrollment
na
CCS-SDA-MF5
Dell 6300 4 x 450 MHz Pentium II
1 GB 4 x 9 GB Purchasing na
SPO-A Dell 4200 2 x 266 MHz Pentium II
6 x 4 GB Sponsored Projects
Oracle 7.3.3
Business Resumption Plan
1. Obtain Management Support
2. Perform Risk Assessment
3. Conduct Business Impact Analysis
4. Select a Recovery StrategySelect a Recovery Strategy
5. Develop the Plan
6. Test and Train
7. Maintain the Plan
IT Recovery Strategies
1. Mirroring or duplexing
2. E-Vaulting
3. Vendor hot site
4. Mobile hot site
5. Cold Site
6. No recovery option
IT Recovery Times
RECOVERY OPTION RECOVERY TIME
Mirroring 30 minutes
E-Vaulting 8-24 hours
Vendor Hot Site 1-4 days
Mobile Hot Site 5-8 days
Cold Site 9-15 days
No recovery plan 16+ days
Faster is (a lot) more expensive!
Recovery Time
Cost
Berkeley’s Recovery Choice
“The object of the Plan is to restore Priority 1 systems within 7 days of a disaster”
Berkeley chose hot-site recovery for its medium timeframe, medium price, and recovery expertise provided by hotsite vendor
Business Resumption Plan
1. Obtain Management Support2. Perform Risk Assessment3. Conduct Business Impact Analysis4. Select a Recovery Strategy5.5. Develop the PlanDevelop the Plan6. Test and Train7. Maintain the Plan
Define Your ScopePlan for Success
Wanted a Quick Win
Chose an application that was: On a single platform Used by another campus Already documented and tested at hotsite
Choosing a Consultant
Gartner research– Who are the major players?– What should we look for?
Existing contracts within University system Experience with the campus, its equipment
What Consultants Will Do
Assess operations Review technical recovery
procedures Train applications staff Facilitate plan development
$0
$50,000
$100,000
$150,000
$200,000
$250,000
$300,000
DevelopProcedures
ReviewProcedures
$242K
$20K
What’s in theBusiness Resumption Plan?
Overall– Disaster declaration, responsibilities, call lists, vendor lists, restore
sequence (part of Emergency Response) System recovery procedures
– Operating systems, utilities, backups, security, data bases Applications recovery procedures
– Application programs, data recovery, coordination with users Distribution & control mechanisms
– Control distribution & maintenance of the plan
Business Resumption Software?
Software or text files? Computer center or whole campus? Single user or web system?
Cost is the underlying consideration
Business Resumption Plan
1. Obtain Management Support
2. Perform Risk Assessment
3.3. Conduct Business Impact Analysis
4. Select a Recovery Strategy
5. Develop the Plan
6.6. Test and TrainTest and Train
7. Maintain the Plan
Business Resumption Plan
1. Obtain Management Support
2. Perform Risk Assessment
3.3. Conduct Business Impact Analysis
4. Select a Recovery Strategy
5. Develop the Plan
6. Test and Train
7. Maintain the PlanMaintain the Plan
The Road to Hell is paved with good intentions
Incidence of recovery failure Procedures for maintenance Integration with change management
Changes How You Think
About Physical environment Complexity of IT solutions Value of vendor relationships
Face the Facts
Are your tapes being sent off site often enough? Is everything you need for a full recovery in offsite
storage? In a limited situation, do you have procedures for
bare metal restores? Are your applications staff still thinking about
recovery of batch systems?
Look for Opportunitiesto Make Progress
Know the textbook steps Find ways to do what is important for your
campus Start with what is easy for your campus Build confidence
Staff Response
Someone realizes my work is valuable Some one wants to listen Someone wants to help Someone knows what to do
Staff Empowerment
Training and Practice
– Build confidence– Let more staff participate in the solution
Think like an emergency team – Apply structure to lesser emergencies– Identify opportunities to improve
Don’t Go It Alone
Get a project sponsor Get a campus group to decide business
priorities Dedicate someone to work on this Hire consultants