ppd computing “business continuity” david kelsey 3 may 2012

10
PPD Computing “Business Continuity” David Kelsey 3 May 2012

Upload: lilian-wiggins

Post on 12-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PPD Computing “Business Continuity” David Kelsey 3 May 2012

PPD Computing“Business Continuity”

David Kelsey3 May 2012

Page 2: PPD Computing “Business Continuity” David Kelsey 3 May 2012

Kelsey, PPD IT continuity 2

The RAL electrical work and risks• SSE will replace two old HV switch-boards in RAL main sub-station

– Will take ~6 months from mid May 2012• Normally we have two 132 kV supplies and 11 kV transformers

– One is sufficient to power RAL so we have a live spare• During the work

– Only one transformer is live– If that fails we have no fast failover– But no digging allowed near the underground cables from Harwell

• Estimated time for SSE to patch to second supply is <48 hours• Increased risk of power outages during this period

– Increased risk is difficult to quantify• Bottom line

– Need to plan for short breaks in electrical power and possibly up to ~48 hours

03/05/2012

Page 3: PPD Computing “Business Continuity” David Kelsey 3 May 2012

Kelsey, PPD IT continuity 3

PPD Business Continuity planning

• PPD has a Business Continuity Plan– Started with the Y2K problem– And Disaster Recovery plan– This is good practice and useful anyway

• E.g. What do we do if R1 burns down?• Or RAL is closed for other reasons?

• As part of this plan– PPD Computing Group has plans– for different time-scales

• 1-2 days; ~1 week; several weeks or more

• This is a good time to review and revise the plans!03/05/2012

Page 4: PPD Computing “Business Continuity” David Kelsey 3 May 2012

Kelsey, PPD IT continuity 4

If RAL power is off …Services UP (generators)• Core network

– Parts of R26, parts of R89– Off-site connections

• (JANET and DL)

• CLRC Windows Domain• Exchange mail servers• VPN? (not yet sure?)• Also failover of some services to

DL (e.g. Exchange servers)– We can VPN in to DL to access SSC

services (from home)

• Central STFC web server– For advice about RAL status

Most Services are DOWN• Telephones

– Landlines, Vodafone mast

• Access control & gates• Fire Alarms• Catering• Water pumps• Many computer services• Etc etc etc• NO COFFEE :=(• RAL WILL BE SHUT!

– Access only for small number of authorised staff

03/05/2012

Page 5: PPD Computing “Business Continuity” David Kelsey 3 May 2012

Kelsey, PPD IT continuity 5

What will be down in PPD (R1)?• R1 will have no power• We (Computing Group) will not be here!

– Unless coming in to retrieve machines and/or backups• Machine rooms will be down (we have no generators)• No PPD Windows or Linux servers (including file servers)

– No H drive, No T drive, etc.– No web servers

• PPD Windows domain will be down• No network• No printers• No Scientific Computing Tier 2/3 compute service• No dCache service – no access to scientific data• No video conferencing• Pointsec recovery will be unavailable

03/05/2012

Page 6: PPD Computing “Business Continuity” David Kelsey 3 May 2012

Kelsey, PPD IT continuity 6

What is computing group doing?• Identifying those things that can be done now in advance

– E.g. Check and test configuration of our UPS units (for orderly shutdown)• We will provide best efforts support to keep PPD working from homes or

other institutes– But without PPD compute servers being up

• Make changes in advance to help make laptops useable from elsewhere while PPD is down– E.g. Sophos (Windows) already reconfigured to failover to Sophos site for updates

• Provide documentation in advance– How to re-configure devices

• Windows security updates etc

– Advice on failover to Exchange at DL– Etc– To be automatically copied to laptops

03/05/2012

Page 7: PPD Computing “Business Continuity” David Kelsey 3 May 2012

Kelsey, PPD IT continuity 7

What should PPD groups do?• We (CG) cannot make IT service plans for individuals or groups• Develop your own Business Continuity Plan

– Only you know which services are critical• Establish communication means with all members of your group

– Phone, non-STFC email• Plan for lack of PPD computing services

– Mission-critical software, data, computer power• E.g. just before conferences!

• Access to high-speed networking, videoconferencing, printing, web services not available– Negotiate alternative work locations for staff

• This is all part of the wider PPD Business Continuity Plan

03/05/2012

Page 8: PPD Computing “Business Continuity” David Kelsey 3 May 2012

Kelsey, PPD IT continuity 8

What do individuals need to do?

• Have access to a laptop (or home PC)• Have a copy of all important files (H and T drives)– E.g. via Windows Offline Files– or rsync copy on MACs– And paper files from your office!

• Have current documentation and contact details• For regular PPD Tier 3 analysis users– Make a plan

• What data do you need? How much CPU?• Can you submit elsewhere? (the Grid or CERN or Amazon?)

– Do not leave everything until the very last minute :=)

03/05/2012

Page 9: PPD Computing “Business Continuity” David Kelsey 3 May 2012

Kelsey, PPD IT continuity 9

Communication

• Cascade: STFC senior management -> Director –> Div Heads –> Group leaders -> all staff

• Collect and store important contact details– Phone numbers– Non-STFC email addresses– Contact details for Computing Group– And not just kept on the PPD file server!

03/05/2012

Page 10: PPD Computing “Business Continuity” David Kelsey 3 May 2012

Kelsey, PPD IT continuity 10

PPD IT Forum

• A meeting of the “PPD IT Forum” (i.e. All Staff and Visitors welcome!) planned for– Thursday 17th May 2012– CR03 R61– 11:00 to 12:30

• To present more details and discuss issues and concerns

• Please come!

03/05/2012