business continuation strategies: how to take a licking and keep on ticking martha fateman director,...

94
Business Continuation Strategies: How to Take a Licking and Keep On Ticking Martha Fateman Director, Central Computing Services University of California, Berkeley August 5, 2002 Copyright © 2002, The Regents of the University of California. Permission is granted for this material to be shared for non-commercial, educational purposes, provided that this copyright statement appears on the reproduced materials.

Upload: camron-riley

Post on 18-Dec-2015

214 views

Category:

Documents


2 download

TRANSCRIPT

Business Continuation Strategies: How to Take a Licking and Keep On Ticking

Martha FatemanDirector, Central Computing Services

University of California, Berkeley

August 5, 2002

Copyright © 2002, The Regents of the University of California. Permission is granted for this material to be shared for non-commercial, educational purposes, provided that this copyright statement appears on the reproduced materials.

Up a Creek?

Business Continuity

to the Rescue

We All Know

We need a business continuity plan Most of us don’t have one

Obstacles

Lack of staff time Lack of funding Lack of interest Belief you have to do it all …

. . . and do it right Lack of knowledge about what it is

Business Continuity Planning

Definition:

Advance planning and preparation needed to minimize loss and ensure the continuity of critical business functions

You Can Make Progress

Know what it is Know your risks Divide it up Take advantage of opportunities Keep piecing it together

How Berkeley Got Started

Changes and near disasters = opportunities

1991 Oakland Hills Fire 1995 New Director 1997 New Chancellor Multiple minor disasters External Auditors

How Can You Get Started?

Know your risks Find opportunities Follow your interests

Business Continuation Steps

1. Risk Assessment

2. Prevention and Mitigation

3. Emergency Response

4. Business Resumption Plan

Business Continuation Steps

1.1. Risk AssessmentRisk Assessment

2. Prevention and Mitigation

3. Emergency Response

4. Business Resumption Plan

Identify Vulnerabilities

Identify vulnerabilities: Specific to your region of the country Specific to your city and campus Specific to your building Specific to your facility within the building

Your Region of the Country

The Hayward Fault runs through the Berkeley campus

• Earthquakes• Hurricanes• Floods• Tornadoes• Volcanoes

Your City and Campus

Terrorism Disruptive demonstrations Strikes Bomb threats Evacuations (false fire alarms)

Your Building

Level of public or campus traffic Age of building and campus infrastructure Facility Design Proximity to

– Construction– Research labs– Animal care facilities– Embassies or federal buildings

Nearby Construction

More Nearby Construction

Your Facility

Water above? Redundancy

– Power, cooling Placement & separation of

building systems– Water supply– Steam pipes

Maintenance level of building/utility systems

Routine Power Failures

Building Design

Transformers close to water pipes, both adjacent to computer room

Risk Assessment – How To

Self evaluation Professional evaluation

Self Assessment

Use common sense Review campus experience Use IBM’s free online Safe Site Test

https://www-1.ibm.com/services/continuity/recover2.nsf/forms/safe+site+test

Berkeley’s IBM Score

On IBM’s Scale 0-40:

U.C. Berkeley Scored 60

– Retrofit the facility to mitigate problems– Move to a less dangerous environment

Berkeley’s Outside Assessment

Worse than we thought Improvements a waste Move!

Impact of Assessment

Independent Report:

A dose of reality Clarified our thinking about the worst case Road map for mitigation Got campus management attention

– funding for improvements – a new facility

The Power of “Life-Safety”

Funding for improvements came rapidly New facility planned

New Building

Business Continuation Steps

1.1. Risk Assessment2. Prevention and MitigationPrevention and Mitigation3. Emergency Response4. Business Resumption Plan

An Ounce of Prevention

Common precautions taken in computer rooms:

– Fire detection systems– Fire suppression systems– Temperature measurement/control– Water intrusion detectors– Emergency power

$500,000 Went a Long Way

Berkeley Life-Safety Measures:

Secured overhead lighting fixtures Improved emergency lighting & exit signs Purchased emergency supplies

$500,000 Went a Long Way

Berkeley Life-Safety Measures: Installed diagonal bracing under machine

room floor and removed old wiring

$500,000 Went a Long Way

Berkeley Life-Safety Measures:• Installed and anchored machine racks

Mitigating Berkeley Risks

Fan-Blower Portable generator

Mitigating Berkeley Risks

Portable lighting

Mitigating Berkeley Risks

Well-used sandbags

Safety Routines

Safety routines Facilities response

Business Continuation Steps

1. Risk Assessment

2. Prevention and Mitigation

3.3. Emergency ResponseEmergency Response

4. Business Resumption Plan

Mission of Emergency Response

Protection of life Assessment of damages Restoration of general campus operations

Handle the First Hours/Days

Use a model Develop your plan Practice your plan Have a plan for

communications

How Berkeley Got Started

Oakland Hills Fire of 1991 was an opportunity

that led to: 1996 State law on emergency management 1997 Campus emergency response planner 1998 First campus disaster exercise

Use a Model

Berkeley uses:

Standardized Emergency Response System based on Incident Command System

used by fire fighters and similar to military model

Campus-wide Planning

Chancellor’sCabinet

&PolicyGroup

EmergencyOperations

Center

•Police Department•Physical Plant•Environmental Health & Safety•Housing & Dining•Capital Projects•Health Services•Information Systems & Technology

Departmental Operations Centers

Replicate the Structure

EmergencyOperations

Center

DepartmentalOperations

Center

Operations

Planning

Resources

Finance

During an Emergency

Assess the damage Control the damage Make decisions Assess again

Develop Your Plan

Each Departmental Operations Center has its own emergency response plan

Pre-assigned meeting place Pre-determined priorities Written checklists for the unit’s response to emergencies At least two people assigned to each task Call lists for team members, vendors, other needed contacts

Sample Checklist – MVS Services

1. Report to the IS&T Secondary DOC.2. Conduct a roll call of all recovery staff. Determine who is available to for emergency operations.3. Provide staff status report to the Operations Team Leader – indicate staff who are injured or missing.4. Track the progress of staff. Ensure each team provides updates to the Secondary DOC every 30 minutes. 5. If systems are non-recoverable, advise the Emergency Operations Team Leader. Meet and determine alternatives or

immediate solutions, i.e. procurement of replacements, manual systems, etc.6. Relay information and situation updates to the Operations Team Leader.7. Receive emergency assignments for recovery teams from the Operations Team Leader – per the direction of the

Primary DOC and the Campus EOC. Assign recovery resources, as needed and report progress and updates to the Operations Team Leader.

8. Keep track of staff, labor hours, location worked, and equipment and supplies used (or have this done at the supervisor level). Provide reports at the end of each shift to the Operations Team Leader for documentation.

9. Check with the Operations Team Leader regarding the EOC’s plan for providing food, water and rest areas for staff.10. As resources – equipment and supplies – are used, advise the Operations Team Leader who will relay the request to

the Secondary DOC Manager. The Secondary DOC Manager will request additional supplies and equipment through the Primary DOC.

11. When de-activated, ensure time and materials records are completed and forwarded to the Operations Team Leader.

CCS System Restoration Priority

Group 1 Affect major campus wide operations, causes severe disruptions to the campus

System Name Sys Admin

Hardware Power Source

UPS / Type

System Purpose

Active Directory SDA   Liebert 1 Y/A Windows 2000 Active Directory.

Actdir1 1 Dell PowerEdge 2550

Actdir2 1 Dell PowerEdge 2550

Actdir3 1 Dell PowerEdge 2550

Actdir4 1 Dell PowerEdge 2550

Arachne ACS 1 Sun E450 Liebert 1 Y/A Campus Home Page, Web Server, Schedule of Classes, Course Descriptions, Job Vacancy Listings, Deans and Directors memos.

Berkeley Restart Priority List

Practice Your Plans

The Importance of Practice

July 1, 1999

What Went Right

We had the right equipment We had the plan We had practiced the plan Everyone knew what to do

Emergency Shopping List

The campus data center stocks:– Sandbags, wet-dry vacs, and fans – Crowbars, shovels, flashlights & batteries– Blankets and first aid kits– Cell phones & and radio equipment– Fanny packs with food & water for 3 days– Emergency operations center tent & office supplies

Fashion Accessories

Radio Cell phone Palm Pilot Flashlight Emergency Plan Fanny pack Sun block

Business Continuation

1. Risk Assessment

2. Prevention and Mitigation

3. Emergency Response

4.4. Business Resumption PlanBusiness Resumption Plan

Business Resumption Plan

The Business Resumption Plan is your formal plan and written procedures to restore IT operations, so that you can support the recovery and resumption of business on your campus again.

Business Resumption Plan

You should start the Business Resumption

Plan only after, or at a minimum in parallel to,

Mitigation Emergency Response

What’s at Stake?

Never resumebusiness

Fail in 2 years

Succeed

40%40%

20%

What Does It Take?

IBM estimates:

Time – At least 22 months Cost – 2% of annual IT spending

To Start or Not to Start

My early thoughts – WAIT

For applications staff For business partners

If Not Now, When?

My thoughts now –

Do what you can Do the systems work

first

DO SOMETHING

Just Like Any Other Project

Control the scope Manage expectations and educate the

stakeholders Manage budget

What Berkeley Is Doing

Not a textbook model

Scale to suit budget Adjust for academia Take advantage of your opportunities

Scope limited by budget

Business Resumption Plan

1. Obtain Management Support2. Perform Risk Assessment3. Conduct Business Impact Analysis4. Select a Recovery Strategy5. Develop the Plan6. Test and Train7. Maintain the Plan

Business Resumption Plan

1.1. Obtain Management SupportObtain Management Support

2.2. Perform Risk AssessmentPerform Risk Assessment

3. Conduct Business Impact Analysis

4. Select a Recovery Strategy

5. Develop the Plan

6. Test and Train

7. Maintain the Plan

Berkeley Seismic Evaluation

Seismic Evaluation of Berkeley campus

7 buildings rated very poor

50 buildings rated poor, including the data center

27% of the main campus ASF newly identified as poor or very poor

$700M and 20 years to fix

The Chancellor’s Response

In addition to an aggressive building upgrade program… established committees to identify risks, suggest mitigations and recovery strategies for:

Classrooms Research Utilities Infrastructure Business Operations

Management Follow-Up

SAFER Recommended specific areas for remediation Established an Oversight Committee to follow-up

on recommendations Established a central campus Business

Resumption Group

Business Resumption Plan

1.Obtain Management Support2. Perform Risk Assessment3.3. Conduct Business Impact AnalysisConduct Business Impact Analysis4. Select a Recovery Strategy5. Develop the Plan6. Test and Train7. Maintain the Plan

Business Impact Analysis

Estimates the costs of losing business Establishes priorities for recovery Determines

– Recovery Time Objectives (RTO)– Recovery Point Objectives (RPO)

Drives selection of the recovery strategy

Business and ServiceUnits Defined Needs

Identified campus business functions Determined the critical recovery periods Established priorities for resumption Assigned lead department to functions

IT is Critical to Recovery

Priority 1 function linked to applications Supporting hardware Supporting software

Inventory for IT Recovery

Host Platform OS CPUs Memory Storage Applications Data Bases

MAINFRAME IBM 9672 R46 x 1

OS390 2.10 4 4 GB 1.2 T and 2 STK silos 10K slots plus 12x 9490 tapes drives

Misc. Stu & Fin Systems

DB2 6.1 and CICS 4.1

GRAD Sun 250 Solaris 2.6 2 x 300 MHz UltraSPARC II

512 MB 88 GB Grad Admissions

Oracle 8.0.5

SIS450 Sun 450 Solaris 2.6 4 x 400 MHz UltraSPARC II

2 GB 70 GB Stu Info Sys na

VPS7500 Periphonis VPS750

Solaris 2.6 1 x 300 MHz UltraSPARC II

256 MB 18 GB Phone Enrollment

na

CCS-SDA-MF5

Dell 6300   4 x 450 MHz Pentium II

1 GB 4 x 9 GB Purchasing na

SPO-A Dell 4200   2 x 266 MHz Pentium II

  6 x 4 GB Sponsored Projects

Oracle 7.3.3

Business Resumption Plan

1. Obtain Management Support

2. Perform Risk Assessment

3. Conduct Business Impact Analysis

4. Select a Recovery StrategySelect a Recovery Strategy

5. Develop the Plan

6. Test and Train

7. Maintain the Plan

IT Recovery Strategies

1. Mirroring or duplexing

2. E-Vaulting

3. Vendor hot site

4. Mobile hot site

5. Cold Site

6. No recovery option

IT Recovery Times

RECOVERY OPTION RECOVERY TIME

Mirroring 30 minutes

E-Vaulting 8-24 hours

Vendor Hot Site 1-4 days

Mobile Hot Site 5-8 days

Cold Site 9-15 days

No recovery plan 16+ days

Faster is (a lot) more expensive!

Recovery Time

Cost

Berkeley’s Recovery Choice

“The object of the Plan is to restore Priority 1 systems within 7 days of a disaster”

Berkeley chose hot-site recovery for its medium timeframe, medium price, and recovery expertise provided by hotsite vendor

Business Resumption Plan

1. Obtain Management Support2. Perform Risk Assessment3. Conduct Business Impact Analysis4. Select a Recovery Strategy5.5. Develop the PlanDevelop the Plan6. Test and Train7. Maintain the Plan

Define Your ScopePlan for Success

Wanted a Quick Win

Chose an application that was: On a single platform Used by another campus Already documented and tested at hotsite

Help Wanted

A foreign language Part time or full time? Programmer or analyst?

Choosing a Consultant

Gartner research– Who are the major players?– What should we look for?

Existing contracts within University system Experience with the campus, its equipment

What Consultants Will Do

Assess operations Review technical recovery

procedures Train applications staff Facilitate plan development

$0

$50,000

$100,000

$150,000

$200,000

$250,000

$300,000

DevelopProcedures

ReviewProcedures

$242K

$20K

What’s in theBusiness Resumption Plan?

Overall– Disaster declaration, responsibilities, call lists, vendor lists, restore

sequence (part of Emergency Response) System recovery procedures

– Operating systems, utilities, backups, security, data bases Applications recovery procedures

– Application programs, data recovery, coordination with users Distribution & control mechanisms

– Control distribution & maintenance of the plan

Business Resumption Software?

Software or text files? Computer center or whole campus? Single user or web system?

Cost is the underlying consideration

Business Resumption Plan

1. Obtain Management Support

2. Perform Risk Assessment

3.3. Conduct Business Impact Analysis

4. Select a Recovery Strategy

5. Develop the Plan

6.6. Test and TrainTest and Train

7. Maintain the Plan

Test, Revise, Test Again

Do a hotsite test Revise the plan Train the staff

Distribution

Multiple copies Safe copies

Business Resumption Plan

1. Obtain Management Support

2. Perform Risk Assessment

3.3. Conduct Business Impact Analysis

4. Select a Recovery Strategy

5. Develop the Plan

6. Test and Train

7. Maintain the PlanMaintain the Plan

The Road to Hell is paved with good intentions

Incidence of recovery failure Procedures for maintenance Integration with change management

How to Take a Licking and Keep On Ticking

Final Thoughts

Changes How You Think

About Physical environment Complexity of IT solutions Value of vendor relationships

Face the Facts

Are your tapes being sent off site often enough? Is everything you need for a full recovery in offsite

storage? In a limited situation, do you have procedures for

bare metal restores? Are your applications staff still thinking about

recovery of batch systems?

Look for Opportunitiesto Make Progress

Know the textbook steps Find ways to do what is important for your

campus Start with what is easy for your campus Build confidence

Staff Response

Someone realizes my work is valuable Some one wants to listen Someone wants to help Someone knows what to do

Staff Empowerment

  Training and Practice

– Build confidence– Let more staff participate in the solution

Think like an emergency team – Apply structure to lesser emergencies– Identify opportunities to improve

Don’t Go It Alone

Get a project sponsor Get a campus group to decide business

priorities Dedicate someone to work on this Hire consultants

Berkeley As a Model!