practical approaches

11

HA & DR Strategy

Giles Gamon of High-Availability.Com

Practical Approaches

July 2007

22

Business Continuity

A system of planning for, recovering and A system of planning for, recovering and maintaining both the IT and business maintaining both the IT and business environments within an organisation environments within an organisation regardless of the type of interruption. In regardless of the type of interruption. In addition to the IT infrastructure, it covers addition to the IT infrastructure, it covers people, facilities, workplaces, equipment, people, facilities, workplaces, equipment, business processes, and more business processes, and more

33

Defining High-Availability

Provision of end-to-end access to a service and Provision of end-to-end access to a service and data without interruptiondata without interruption The elimination of all Single Points Of Failure (SPOF)The elimination of all Single Points Of Failure (SPOF) Objective - Zero/Near Zero downtimeObjective - Zero/Near Zero downtime

Includes handling scheduled downtimeIncludes handling scheduled downtime

44

Defining Disaster Recovery

The process of restoring and maintaining The process of restoring and maintaining the data, equipment, applications and the data, equipment, applications and other technical resources on which a other technical resources on which a business depends business depends

Response to complete loss of a facilityResponse to complete loss of a facility May include dealing with loss of key staffMay include dealing with loss of key staff Disaster may also affect alternate facilities Disaster may also affect alternate facilities

that were assumed to be availablethat were assumed to be available

55

Achieving Business Continuity

Identification of threats to serviceIdentification of threats to service Systems failures, human errors, sabotage, Systems failures, human errors, sabotage,

software bugs, acts of God etcsoftware bugs, acts of God etc

Management of riskManagement of risk Building in redundancy, taking backups, Building in redundancy, taking backups,

training staff, testing systems, active training staff, testing systems, active management solutionsmanagement solutions

66

Causes of Down Time

Source - IEEE

77

Causes - Disaster

Planning to cope with disasters is an Planning to cope with disasters is an important component of a High-Availability important component of a High-Availability strategystrategy Flood, fire, power grid failure, terrorism etcFlood, fire, power grid failure, terrorism etc

Most ‘disasters’ are classified as Most ‘disasters’ are classified as environmental causes of downtimeenvironmental causes of downtime Collectively environmental causes approximately Collectively environmental causes approximately

5% of downtime5% of downtime

88

Causes - Environmental

Power cuts and brown outsPower cuts and brown outs UPS & GeneratorUPS & Generator

What do they power?What do they power?

Cooling systems errorCooling systems error Humidification regulation errors can cause Humidification regulation errors can cause

hardware failureshardware failures

99

Southampton University 2005

1010

UK – Jan 2005 & June 2007

1111

Causes – Hardware Failure

Probably the most recognised cause of downtimeProbably the most recognised cause of downtime

Server failuresServer failuresDisk, CPU, internal cooling fans, memory faults, …Disk, CPU, internal cooling fans, memory faults, …

Network failuresNetwork failuresDNS, DHCP, router, ISP, switches, cables cut, …DNS, DHCP, router, ISP, switches, cables cut, …

OtherOtherTape backup corruption, client hardware, …Tape backup corruption, client hardware, …

1212

Causes - Planned

Hardware upgradesHardware upgradesOS version upgradesOS version upgradesSoftware version upgradesSoftware version upgradesData migration / transformationData migration / transformationBackupsBackupsBatch processingBatch processingPreventative maintenancePreventative maintenanceTestingTesting

1313

Causes – Human Factor

Failure to maintainFailure to maintain File systems fullFile systems full Database tables fullDatabase tables full Patches for known bugs not appliedPatches for known bugs not applied

AccidentsAccidents root # rm –rf / tmp/tempstuffroot # rm –rf / tmp/tempstuff Network mis-configurationNetwork mis-configuration Incorrect cable removedIncorrect cable removed

InexperienceInexperience root# rebootroot# reboot Cleaner knocks cables outCleaner knocks cables out

MaliceMalice root# uadmin 1 5 root# uadmin 1 5 or or halthalt Physical sabotagePhysical sabotage

1414

Causes – Software Error

Code crashesCode crashes Application suddenly stops with a Application suddenly stops with a core dumpcore dump

Memory leaksMemory leaks Slowly consumes all memory until system crashSlowly consumes all memory until system crash

Run away codeRun away code Taking all CPU time in a loopTaking all CPU time in a loop

Hanging codeHanging code Code pauses waiting for reply that never comesCode pauses waiting for reply that never comes

Resource shortfallsResource shortfalls Overflowing logs, failure to allocate memory or Overflowing logs, failure to allocate memory or

processprocess

Buffer overflowsBuffer overflows Possibly exploited or just bad codePossibly exploited or just bad code

1515

Managing Risks

Identify critical servicesIdentify critical services

Describe service level targetsDescribe service level targets

Map risks to servicesMap risks to services

Quantify the level of threatQuantify the level of threat

Design and cost solutionsDesign and cost solutions

Compromise in a rational wayCompromise in a rational way

1616

Identify Critical Services

How long can the web server be down?How long can the web server be down? Think – internal & publicThink – internal & public

How about Email?How about Email? Can some Emails be lost?Can some Emails be lost?

How about finance, HR, ?How about finance, HR, ? How much downtime is acceptable?How much downtime is acceptable?

Who will be affected?Who will be affected? Admin, public, suppliers …Admin, public, suppliers …

What is the impact on the ‘business’What is the impact on the ‘business’ Reputation, income, disruption, political …Reputation, income, disruption, political …

1717

Describe Service Level Targets

Email, Web (external)Email, Web (external) Downtime < 2 hours per month 8a.m. – 2a.m.Downtime < 2 hours per month 8a.m. – 2a.m.

Housing ServerHousing Server Downtime < 30 mins per month – 24x7Downtime < 30 mins per month – 24x7

Revenue & BenefitsRevenue & Benefits Downtime < 5 mins per year – 24x7Downtime < 5 mins per year – 24x7

Statistical ServerStatistical Server Fix when you can – not really requiredFix when you can – not really required

1818

Balancing Risk and Reward

Unless you have an infinite budget you will have to make ‘trade-offs’Unless you have an infinite budget you will have to make ‘trade-offs’

Identify and remove SPoFs for critical servicesIdentify and remove SPoFs for critical services SPoF = Single Points of FailureSPoF = Single Points of Failure

Identify the least reliable – MTBFsIdentify the least reliable – MTBFs Moving parts typically have the lowest MTBFMoving parts typically have the lowest MTBF

Identify the most difficult components to repair/rebuildIdentify the most difficult components to repair/rebuild e.g.:- Security server, databasee.g.:- Security server, database

Identify what will have biggest impact on failureIdentify what will have biggest impact on failure Usually a core serverUsually a core server

Database, Email, Web, authentication server etcDatabase, Email, Web, authentication server etc

1919

Technical Approaches

ClusteringClustering

ReplicationReplication Transaction / block levelTransaction / block level

Emerging technologiesEmerging technologies iSCSIiSCSI

Multi-domain clustersMulti-domain clusters

Oracle RACOracle RAC

2020

Typical Multi-Tier Architecture

View the service in a holistic fashionView the service in a holistic fashion

List all SPoFsList all SPoFs NetworkNetwork Load balancersLoad balancers SwitchesSwitches Application serverApplication server Database serverDatabase server Data disksData disks EtcEtc

Design in redundancy where possibleDesign in redundancy where possible

2121

Resilient Architecture

Multi-site solutionMulti-site solution Replication to remote siteReplication to remote site Load balancers shown actually provide Load balancers shown actually provide

each other with redundant functionalityeach other with redundant functionality Multiple switches used but not shownMultiple switches used but not shown

SPoFs reduced near to zeroSPoFs reduced near to zero Multiple active blades centresMultiple active blades centres Multiple active application serversMultiple active application servers Clustered database serversClustered database servers

This architecture is resilient to almost This architecture is resilient to almost every conceivable faultevery conceivable fault

2222


2323


2424

High-Availability Clustering

Intelligent management solutionIntelligent management solutionSoftware onlySoftware onlyDeployed on critical serversDeployed on critical serversCan be active-active or active-passiveCan be active-active or active-passiveConstant monitoringConstant monitoring

Application availabilityApplication availability Server healthServer health Network availabilityNetwork availability Other defined componentsOther defined components

Automated restart / move in the event of a faultAutomated restart / move in the event of a faultNotifications to administrative staffNotifications to administrative staff

GUI, Email, SMSGUI, Email, SMS

2525

High-Availability Clustering

Active-PassiveActive-Passive Simple setupSimple setup

Externalise ‘shared’ dataExternalise ‘shared’ dataUse RAID &/ MirroringUse RAID &/ Mirroring

Low cost, fast and simpleLow cost, fast and simpleVery reliableVery reliable

2626

High-Availability Replication

Traditional cluster locallyTraditional cluster locallyReplicate to remote nodeReplicate to remote nodeReplication at transaction Replication at transaction levellevelRemote node probably Remote node probably included in clusterincluded in cluster

Automatic locallyAutomatic locally Manual remotelyManual remotely

2727


Typically replication does a ‘log scrape’Typically replication does a ‘log scrape’ Although some newer versions have closer Although some newer versions have closer

integrationintegration

Takes committed transactions and copies Takes committed transactions and copies them across to the other node(s)them across to the other node(s)

Other nodes ‘apply’ the transactions to a Other nodes ‘apply’ the transactions to a read-onlyread-only copy of the database copy of the database

2828


Block level replicationBlock level replication Suitable for user filesSuitable for user files Not ideal for databasesNot ideal for databases

Many better approaches that understand dB dataMany better approaches that understand dB data Available in different guises - likeAvailable in different guises - like

Sun’s SNDR (remote mirror) – in kernelSun’s SNDR (remote mirror) – in kernel Sync / asyncSync / async Streams type moduleStreams type module

Rsync – user spaceRsync – user space Periodic checking and copyPeriodic checking and copy

2929


Use dB replication for dB when possibleUse dB replication for dB when possible

Use block level for other file types and Use block level for other file types and legacy applications that have no legacy applications that have no replication option availablereplication option available

3030

Practical Examples

CarlisleCarlisle Some lessons learned Some lessons learned

Surrey AmbulanceSurrey Ambulance 999 call handling centre999 call handling centre

North Yorkshire PoliceNorth Yorkshire Police Tasking & operational managementTasking & operational management

3131

Carlisle – Jan 2005

Extensive flooding Jan 2005Extensive flooding Jan 2005 Civic centre hub of all operations hitCivic centre hub of all operations hit

Backup generators in basement (flooded 1Backup generators in basement (flooded 1st)st)

Guardian IT ‘insurance’ not usedGuardian IT ‘insurance’ not used

All major systems down for a weekAll major systems down for a week

Flooded in Jan 2005 and still dealing with Flooded in Jan 2005 and still dealing with substantial issues todaysubstantial issues today

3232

Carlisle - Lessons

Don’t assume just because you have ‘a plan’ it Don’t assume just because you have ‘a plan’ it will actually workwill actually work Guardian IT / Sun Guard provide a warm feeling but Guardian IT / Sun Guard provide a warm feeling but

not useful – Carlisle terminatingnot useful – Carlisle terminating Test itTest it Keep testing and updatingKeep testing and updating

Recovery takes longer than you imagineRecovery takes longer than you imagine Administration relating to recovery and the process of Administration relating to recovery and the process of

recovery itself are a huge drains on resourcesrecovery itself are a huge drains on resources

3333

Surrey Ambulance Service

999 call centre999 call centre

24x7 live operations environment24x7 live operations environment

Handling calls from the publicHandling calls from the public

Live feeds from ambulance GPS Live feeds from ambulance GPS devicesdevices

Automatic escalation and loggingAutomatic escalation and logging

3434

North Yorkshire Police

24x7 live CAD system24x7 live CAD system Command and controlCommand and control Custody managementCustody management Crime managementCrime management Duty rosteringDuty rostering Imaging and biometricsImaging and biometrics

Oracle backend to ‘STORM’ applicationOracle backend to ‘STORM’ applicationHighly integrated systemsHighly integrated systems

Mapping systemsMapping systems PNC linksPNC links DVLA linksDVLA links Firearms databaseFirearms database Neighbouring force systemsNeighbouring force systems

3535

North Yorkshire Police

3636

Contacts

Giles GamonHigh-Availability.Com

[email protected]@High-Availability.Com

[email protected]

01565 754 459

practical approaches

Technology

housing server downtime

web server

causes disaster

authentication server

security server

core server database

service systems failures

web external downtime