five nines - to dream the impractical dream?
Post on 19-Jan-2016
Embed Size (px)
DESCRIPTIONFive Nines - To Dream the Impractical Dream?. Presentation to the CSG Bruce Vincent. Agenda. What is driving high-availability? How do we judge which services need HA? How to achieve HA services? What’s working and what’s not. What’s driving high-availability?. - PowerPoint PPT Presentation
Five Nines -To Dream the Impractical Dream? Presentation to the CSGBruce Vincent
AgendaWhat is driving high-availability?
How do we judge which services need HA?
How to achieve HA services?
Whats working and whats not
Whats driving high-availability? Frankly, [we] areand should be.
Central IT services have gone from popular to essential
Interdependencies of services
The hassle of outages!
Choices of providers
Failure without loss of service
Failure with loss of service
How do we judge which services need high-availability?If a service isnt that important, why are we running it?
Turn it aroundwhy doesnt it need to be fault-tolerant and scalable?
How to achieve HA services?Build in fault-tolerance and scalability by design
Monitoring and metrics
Learn from outages service
Manage Risk - Balance efforts
Service - Dont focus on SLA legalese
Major Failure with No OutageIncident Summary: Active Directory Server rebooted after determining that it was in an impaired state, cause under investigation Incident Started: 2004-10-12 08:59 Incident Stopped: 2004-10-12 09:05 Systems Affected: Godzilla Server functions are replicated on other servers, no end- user outage Incident Detail: Server rebooted and has resumed operations Investigating logged error messages.
As Opposed ToIncident Summary: A failed memory board on the Production Oracle machine caused Production to be unavailable. The machine was re- booted to clear memory errors. Incident Started: 2004-01-03 12:15 Incident Stopped: 2004-01-03 12:50 Systems Affected: Oracle Applications Production Incident Detail: A failed memory board on the Production machine caused Production to be unavailable. The machine was re- booted to clear memory errors.
DOOH!At Tue Jan 4 10:31:30 PST 2005, the following responses were received:Incident Summary: Ofdbprd1 was down because of a failed CPU/memory board. Incident Started: 2004-01-03 10:55 Incident Stopped: 2004-01-03 12:10 Systems Affected: Ofdbprd1 (Oracle Financials transaction database server). Incident Detail: The failed CPU/memory board was removed from the configuration and the system was brought up on the remaining boards. The failed board has been replaced and will be returned to service on Jan 6.
Make Services Boring (sort-of)Build it to mask failures
Test failover regularly (A-A,A-P)
Keep service profiles for monitoring and resource trends up-to-date
Create enterprise system-wide view
We take a great deal of pride in providing serviceit should be excellent service
Many of our centrally offered services have been wildly successful! And that means that [client] expectations/needs are extremely hightoo high?
There are too many constraints to support services that arent vital!
In operations groups, there has been a history of distrust of many failover protocols/services (HSRP, Build in HA - We now have Architectural Reviews of all systems/services which evaluates this. Vendors are really starting to get HA. We dont buy one of anythingwe choose software. Build to mask apparent outages,...build to mask rate of noticeable service outage.
Monitoring and metrics we have a major effort to correlate service events from multiple monitoring frameworks. Understand system resources/trends, nominal vs. degraded service, interdependecies
Learnings We now have Incident Reports that are standard and widely distributedweekly meetings that review cause and resolutions. (Traverits okay to make a mistake, just dont keep making the SAME bloody mistake)
Risk Manage put the money where somethings most likely to faillook for aggregate riskrisk is ADDITIVE!We have lots of services that are well designed for masking failuresLDAPADDNSKDCOracle RACUsing loadbalancing/FO for many servicesThings fail but it doesnt have to be noticeable