practical approaches
TRANSCRIPT
11
HA & DR Strategy
Giles Gamon of High-Availability.Com
Practical Approaches
July 2007
22
Business Continuity
A system of planning for, recovering and A system of planning for, recovering and maintaining both the IT and business maintaining both the IT and business environments within an organisation environments within an organisation regardless of the type of interruption. In regardless of the type of interruption. In addition to the IT infrastructure, it covers addition to the IT infrastructure, it covers people, facilities, workplaces, equipment, people, facilities, workplaces, equipment, business processes, and more business processes, and more
33
Defining High-Availability
Provision of end-to-end access to a service and Provision of end-to-end access to a service and data without interruptiondata without interruption The elimination of all Single Points Of Failure (SPOF)The elimination of all Single Points Of Failure (SPOF) Objective - Zero/Near Zero downtimeObjective - Zero/Near Zero downtime
Includes handling scheduled downtimeIncludes handling scheduled downtime
44
Defining Disaster Recovery
The process of restoring and maintaining The process of restoring and maintaining the data, equipment, applications and the data, equipment, applications and other technical resources on which a other technical resources on which a business depends business depends
Response to complete loss of a facilityResponse to complete loss of a facility May include dealing with loss of key staffMay include dealing with loss of key staff Disaster may also affect alternate facilities Disaster may also affect alternate facilities
that were assumed to be availablethat were assumed to be available
55
Achieving Business Continuity
Identification of threats to serviceIdentification of threats to service Systems failures, human errors, sabotage, Systems failures, human errors, sabotage,
software bugs, acts of God etcsoftware bugs, acts of God etc
Management of riskManagement of risk Building in redundancy, taking backups, Building in redundancy, taking backups,
training staff, testing systems, active training staff, testing systems, active management solutionsmanagement solutions
66
Causes of Down Time
Source - IEEE
77
Causes - Disaster
Planning to cope with disasters is an Planning to cope with disasters is an important component of a High-Availability important component of a High-Availability strategystrategy Flood, fire, power grid failure, terrorism etcFlood, fire, power grid failure, terrorism etc
Most ‘disasters’ are classified as Most ‘disasters’ are classified as environmental causes of downtimeenvironmental causes of downtime Collectively environmental causes approximately Collectively environmental causes approximately
5% of downtime5% of downtime
88
Causes - Environmental
Power cuts and brown outsPower cuts and brown outs UPS & GeneratorUPS & Generator
What do they power?What do they power?
Cooling systems errorCooling systems error Humidification regulation errors can cause Humidification regulation errors can cause
hardware failureshardware failures
99
Southampton University 2005
1010
UK – Jan 2005 & June 2007
1111
Causes – Hardware Failure
Probably the most recognised cause of downtimeProbably the most recognised cause of downtime
Server failuresServer failuresDisk, CPU, internal cooling fans, memory faults, …Disk, CPU, internal cooling fans, memory faults, …
Network failuresNetwork failuresDNS, DHCP, router, ISP, switches, cables cut, …DNS, DHCP, router, ISP, switches, cables cut, …
OtherOtherTape backup corruption, client hardware, …Tape backup corruption, client hardware, …
1212
Causes - Planned
Hardware upgradesHardware upgradesOS version upgradesOS version upgradesSoftware version upgradesSoftware version upgradesData migration / transformationData migration / transformationBackupsBackupsBatch processingBatch processingPreventative maintenancePreventative maintenanceTestingTesting
1313
Causes – Human Factor
Failure to maintainFailure to maintain File systems fullFile systems full Database tables fullDatabase tables full Patches for known bugs not appliedPatches for known bugs not applied
AccidentsAccidents root # rm –rf / tmp/tempstuffroot # rm –rf / tmp/tempstuff Network mis-configurationNetwork mis-configuration Incorrect cable removedIncorrect cable removed
InexperienceInexperience root# rebootroot# reboot Cleaner knocks cables outCleaner knocks cables out
MaliceMalice root# uadmin 1 5 root# uadmin 1 5 or or halthalt Physical sabotagePhysical sabotage
1414
Causes – Software Error
Code crashesCode crashes Application suddenly stops with a Application suddenly stops with a core dumpcore dump
Memory leaksMemory leaks Slowly consumes all memory until system crashSlowly consumes all memory until system crash
Run away codeRun away code Taking all CPU time in a loopTaking all CPU time in a loop
Hanging codeHanging code Code pauses waiting for reply that never comesCode pauses waiting for reply that never comes
Resource shortfallsResource shortfalls Overflowing logs, failure to allocate memory or Overflowing logs, failure to allocate memory or
processprocess
Buffer overflowsBuffer overflows Possibly exploited or just bad codePossibly exploited or just bad code
1515
Managing Risks
Identify critical servicesIdentify critical services
Describe service level targetsDescribe service level targets
Map risks to servicesMap risks to services
Quantify the level of threatQuantify the level of threat
Design and cost solutionsDesign and cost solutions
Compromise in a rational wayCompromise in a rational way
1616
Identify Critical Services
How long can the web server be down?How long can the web server be down? Think – internal & publicThink – internal & public
How about Email?How about Email? Can some Emails be lost?Can some Emails be lost?
How about finance, HR, ?How about finance, HR, ? How much downtime is acceptable?How much downtime is acceptable?
Who will be affected?Who will be affected? Admin, public, suppliers …Admin, public, suppliers …
What is the impact on the ‘business’What is the impact on the ‘business’ Reputation, income, disruption, political …Reputation, income, disruption, political …
1717
Describe Service Level Targets
Email, Web (external)Email, Web (external) Downtime < 2 hours per month 8a.m. – 2a.m.Downtime < 2 hours per month 8a.m. – 2a.m.
Housing ServerHousing Server Downtime < 30 mins per month – 24x7Downtime < 30 mins per month – 24x7
Revenue & BenefitsRevenue & Benefits Downtime < 5 mins per year – 24x7Downtime < 5 mins per year – 24x7
Statistical ServerStatistical Server Fix when you can – not really requiredFix when you can – not really required
1818
Balancing Risk and Reward
Unless you have an infinite budget you will have to make ‘trade-offs’Unless you have an infinite budget you will have to make ‘trade-offs’
Identify and remove SPoFs for critical servicesIdentify and remove SPoFs for critical services SPoF = Single Points of FailureSPoF = Single Points of Failure
Identify the least reliable – MTBFsIdentify the least reliable – MTBFs Moving parts typically have the lowest MTBFMoving parts typically have the lowest MTBF
Identify the most difficult components to repair/rebuildIdentify the most difficult components to repair/rebuild e.g.:- Security server, databasee.g.:- Security server, database
Identify what will have biggest impact on failureIdentify what will have biggest impact on failure Usually a core serverUsually a core server
Database, Email, Web, authentication server etcDatabase, Email, Web, authentication server etc
1919
Technical Approaches
ClusteringClustering
ReplicationReplication Transaction / block levelTransaction / block level
Emerging technologiesEmerging technologies iSCSIiSCSI
Multi-domain clustersMulti-domain clusters
Oracle RACOracle RAC
2020
Typical Multi-Tier Architecture
View the service in a holistic fashionView the service in a holistic fashion
List all SPoFsList all SPoFs NetworkNetwork Load balancersLoad balancers SwitchesSwitches Application serverApplication server Database serverDatabase server Data disksData disks EtcEtc
Design in redundancy where possibleDesign in redundancy where possible
2121
Resilient Architecture
Multi-site solutionMulti-site solution Replication to remote siteReplication to remote site Load balancers shown actually provide Load balancers shown actually provide
each other with redundant functionalityeach other with redundant functionality Multiple switches used but not shownMultiple switches used but not shown
SPoFs reduced near to zeroSPoFs reduced near to zero Multiple active blades centresMultiple active blades centres Multiple active application serversMultiple active application servers Clustered database serversClustered database servers
This architecture is resilient to almost This architecture is resilient to almost every conceivable faultevery conceivable fault
2222
Resilient Architecture
2323
Resilient Architecture
2424
High-Availability Clustering
Intelligent management solutionIntelligent management solutionSoftware onlySoftware onlyDeployed on critical serversDeployed on critical serversCan be active-active or active-passiveCan be active-active or active-passiveConstant monitoringConstant monitoring
Application availabilityApplication availability Server healthServer health Network availabilityNetwork availability Other defined componentsOther defined components
Automated restart / move in the event of a faultAutomated restart / move in the event of a faultNotifications to administrative staffNotifications to administrative staff
GUI, Email, SMSGUI, Email, SMS
2525
High-Availability Clustering
Active-PassiveActive-Passive Simple setupSimple setup
Externalise ‘shared’ dataExternalise ‘shared’ dataUse RAID &/ MirroringUse RAID &/ Mirroring
Low cost, fast and simpleLow cost, fast and simpleVery reliableVery reliable
2626
High-Availability Replication
Traditional cluster locallyTraditional cluster locallyReplicate to remote nodeReplicate to remote nodeReplication at transaction Replication at transaction levellevelRemote node probably Remote node probably included in clusterincluded in cluster
Automatic locallyAutomatic locally Manual remotelyManual remotely
2727
High-Availability Replication
Typically replication does a ‘log scrape’Typically replication does a ‘log scrape’ Although some newer versions have closer Although some newer versions have closer
integrationintegration
Takes committed transactions and copies Takes committed transactions and copies them across to the other node(s)them across to the other node(s)
Other nodes ‘apply’ the transactions to a Other nodes ‘apply’ the transactions to a read-onlyread-only copy of the database copy of the database
2828
High-Availability Replication
Block level replicationBlock level replication Suitable for user filesSuitable for user files Not ideal for databasesNot ideal for databases
Many better approaches that understand dB dataMany better approaches that understand dB data Available in different guises - likeAvailable in different guises - like
Sun’s SNDR (remote mirror) – in kernelSun’s SNDR (remote mirror) – in kernel Sync / asyncSync / async Streams type moduleStreams type module
Rsync – user spaceRsync – user space Periodic checking and copyPeriodic checking and copy
2929
High-Availability Replication
Use dB replication for dB when possibleUse dB replication for dB when possible
Use block level for other file types and Use block level for other file types and legacy applications that have no legacy applications that have no replication option availablereplication option available
3030
Practical Examples
CarlisleCarlisle Some lessons learned Some lessons learned
Surrey AmbulanceSurrey Ambulance 999 call handling centre999 call handling centre
North Yorkshire PoliceNorth Yorkshire Police Tasking & operational managementTasking & operational management
3131
Carlisle – Jan 2005
Extensive flooding Jan 2005Extensive flooding Jan 2005 Civic centre hub of all operations hitCivic centre hub of all operations hit
Backup generators in basement (flooded 1Backup generators in basement (flooded 1st)st)
Guardian IT ‘insurance’ not usedGuardian IT ‘insurance’ not used
All major systems down for a weekAll major systems down for a week
Flooded in Jan 2005 and still dealing with Flooded in Jan 2005 and still dealing with substantial issues todaysubstantial issues today
3232
Carlisle - Lessons
Don’t assume just because you have ‘a plan’ it Don’t assume just because you have ‘a plan’ it will actually workwill actually work Guardian IT / Sun Guard provide a warm feeling but Guardian IT / Sun Guard provide a warm feeling but
not useful – Carlisle terminatingnot useful – Carlisle terminating Test itTest it Keep testing and updatingKeep testing and updating
Recovery takes longer than you imagineRecovery takes longer than you imagine Administration relating to recovery and the process of Administration relating to recovery and the process of
recovery itself are a huge drains on resourcesrecovery itself are a huge drains on resources
3333
Surrey Ambulance Service
999 call centre999 call centre
24x7 live operations environment24x7 live operations environment
Handling calls from the publicHandling calls from the public
Live feeds from ambulance GPS Live feeds from ambulance GPS devicesdevices
Automatic escalation and loggingAutomatic escalation and logging
3434
North Yorkshire Police
24x7 live CAD system24x7 live CAD system Command and controlCommand and control Custody managementCustody management Crime managementCrime management Duty rosteringDuty rostering Imaging and biometricsImaging and biometrics
Oracle backend to ‘STORM’ applicationOracle backend to ‘STORM’ applicationHighly integrated systemsHighly integrated systems
Mapping systemsMapping systems PNC linksPNC links DVLA linksDVLA links Firearms databaseFirearms database Neighbouring force systemsNeighbouring force systems
3535
North Yorkshire Police
3636
Contacts
Giles GamonHigh-Availability.Com
[email protected]@High-Availability.Com
01565 754 459