five nines - to dream the impractical dream?

Download Five Nines - To Dream the Impractical Dream?

Post on 19-Jan-2016




0 download

Embed Size (px)


Five Nines - To Dream the Impractical Dream?. Presentation to the CSG Bruce Vincent. Agenda. What is driving high-availability? How do we judge which services need HA? How to achieve HA services? What’s working and what’s not. What’s driving high-availability?. - PowerPoint PPT Presentation


  • Five Nines -To Dream the Impractical Dream? Presentation to the CSGBruce Vincent

  • AgendaWhat is driving high-availability?

    How do we judge which services need HA?

    How to achieve HA services?

    Whats working and whats not

  • Whats driving high-availability? Frankly, [we] areand should be.

    Central IT services have gone from popular to essential

    Interdependencies of services

    The hassle of outages!

    Choices of providers

  • Failure without loss of service

  • Failure with loss of service

  • How do we judge which services need high-availability?If a service isnt that important, why are we running it?

    Turn it aroundwhy doesnt it need to be fault-tolerant and scalable?

  • How to achieve HA services?Build in fault-tolerance and scalability by design

    Monitoring and metrics

    Learn from outages service

    Manage Risk - Balance efforts

    Service - Dont focus on SLA legalese

  • Major Failure with No OutageIncident Summary: Active Directory Server rebooted after determining that it was in an impaired state, cause under investigation Incident Started: 2004-10-12 08:59 Incident Stopped: 2004-10-12 09:05 Systems Affected: Godzilla Server functions are replicated on other servers, no end- user outage Incident Detail: Server rebooted and has resumed operations Investigating logged error messages.

  • As Opposed ToIncident Summary: A failed memory board on the Production Oracle machine caused Production to be unavailable. The machine was re- booted to clear memory errors. Incident Started: 2004-01-03 12:15 Incident Stopped: 2004-01-03 12:50 Systems Affected: Oracle Applications Production Incident Detail: A failed memory board on the Production machine caused Production to be unavailable. The machine was re- booted to clear memory errors.

  • DOOH!At Tue Jan 4 10:31:30 PST 2005, the following responses were received:Incident Summary: Ofdbprd1 was down because of a failed CPU/memory board. Incident Started: 2004-01-03 10:55 Incident Stopped: 2004-01-03 12:10 Systems Affected: Ofdbprd1 (Oracle Financials transaction database server). Incident Detail: The failed CPU/memory board was removed from the configuration and the system was brought up on the remaining boards. The failed board has been replaced and will be returned to service on Jan 6.

  • Make Services Boring (sort-of)Build it to mask failures

    Control/communicate changes

    Test failover regularly (A-A,A-P)

    Keep service profiles for monitoring and resource trends up-to-date

    Create enterprise system-wide view

    We take a great deal of pride in providing serviceit should be excellent service

    Many of our centrally offered services have been wildly successful! And that means that [client] expectations/needs are extremely hightoo high?

    There are too many constraints to support services that arent vital!

    In operations groups, there has been a history of distrust of many failover protocols/services (HSRP, Build in HA - We now have Architectural Reviews of all systems/services which evaluates this. Vendors are really starting to get HA. We dont buy one of anythingwe choose software. Build to mask apparent outages, to mask rate of noticeable service outage.

    Monitoring and metrics we have a major effort to correlate service events from multiple monitoring frameworks. Understand system resources/trends, nominal vs. degraded service, interdependecies

    Learnings We now have Incident Reports that are standard and widely distributedweekly meetings that review cause and resolutions. (Traverits okay to make a mistake, just dont keep making the SAME bloody mistake)

    Risk Manage put the money where somethings most likely to faillook for aggregate riskrisk is ADDITIVE!We have lots of services that are well designed for masking failuresLDAPADDNSKDCOracle RACUsing loadbalancing/FO for many servicesThings fail but it doesnt have to be noticeable