introduction to ibm ha

8/13/2019 Introduction to IBM HA

http://slidepdf.com/reader/full/introduction-to-ibm-ha 1/22

© Copyright IBM Corporation 2004

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

Welcome to:

3.0.23.0.3

Introduction to High-AvailabilityIntroduction to High-Availability




Unit Objectives

After completing this unit, you should be able to:

Understand what high availability is

Understand why you might need high availability

Outline the various options for implementing high availability

Compare and contrast the high availability optionsState the benefits of using highly available clusters

Understand the key considerations when designing andimplementing a high availability cluster

Be familiar with the basics of risk analysis




So, What Is High Availability?

High Availability is...

The masking or elimination of both planned and unplanned downtime.

The elimination of single points of failure (SPOFs).

Fault resilience, but NOT fault tolerance.

Workload Fallover

Production Standby

Client

WAN


http://slidepdf.com/reader/full/introduction-to-ibm-ha 4/22© Copyright IBM Corporation 2004

Planned downtime:

Hardware upgrades

Repairs

Software updatesBackups

Testing

Development

So Why Is Planned Downtime Important?

High availability solutions should reduce bothplanned and unplanned downtime.

Unplanned downtime:

Administrator Error

Application failure

Hardware faultsEnvironmental Disasters

1.0%14.0%

85.0%

Hardware Failure (1%)

Other unplanned downtime (14%)

Planned downtime (85%)



Continuous Availability Is the Goal

Continuous Availability

Continuous

Operations

High

Availability

Elimination of Downtime

Masking or elimination of

planned downtime

Masking or elimination of

unplanned downtime



Eliminating Single Points of Failure

Cluster Object Eliminated as a single point of failure by . . .

Node Using multiple nodes

Power Source Using multiple circuits or uninterruptible power supplies

Network adapter Using redundant network adapters

Network Using multiple networks to connect nodes

TCP/IP Subsystem Using serial networks to connect adjoining nodes and clients

Disk adapter Using redundant disk adapters

Disk Using redundant hardware and disk mirroring and/or striping

Application Assigning a node for application takeover; configuring anapplication monitor

A fundamental design goal of (successful) cluster design is

the elimination of single points of failure (SPOFs).



Availability - from Simple to Complex

Stand-alone

Enhanced

High Availability

Cluster

Fault

Tolerant



The Stand-alone SystemThe stand-alone system may offer limited availability benefits:

Journaled FilesystemDynamic CPU DeallocationService Processor Redundant Power

Redundant CoolingECC MemoryHot Swap AdaptersDynamic KernelDisk mirroring

Example single points of failure:

Disk Adapter/ Data PathsNo Hot Swap StoragePower for Storage Arrays

Cooling for Storage ArraysHot Spare StorageNode/Operating SystemNetworkNetwork Adapter

ApplicationSite Failure (SAN distance)Site Failure (via mirroring)



The Enhanced System

The enhanced system may offer increased availability benefits:

Journaled FilesystemDynamic CPU DeallocationService Processor Redundant Power Redundant Cooling

ECC MemoryHot Swap AdaptersDynamic KernelDisk MirroringRedundant Disk adapters/multiple paths

Hot Swap StorageRedundant Power for Storage ArraysRedundant Cooling for Storage ArraysHot Spare Storage

Example single points of failure:Node/Operating SystemNetwork Adapter Network

ApplicationSite Failure (SAN distance)Site Failure (via mirroring)



High-Availability Clusters (HACMP)

Clustering technologies offer high-availability:

Journaled FilesystemDynamic CPU DeallocationService Processor Redundant Power

Redundant CoolingECC MemoryHot Swap AdaptersDynamic KernelRedundant Data Paths

Data MirroringHot Swap StorageRedundant Power for Storage ArraysRedundant Cooling for Storage ArraysHot Spare StorageDual Disk AdaptersRedundant nodes (operating system)Redundant Network AdaptersRedundant NetworksApplication MonitoringSite Failure (SAN distance)

Example single points of failure:Site Failure (via mirroring)

C



Fault-Tolerant Computing

Fault-tolerant solutions should not fail:

Lock Step CPUsHardened Operating SystemHot Swap StorageContinuous Restart

Example single points of failure:

Site Failure (SAN distance)Site Failure (via mirroring)

A il bilit S l ti



Availability Solutions

Stand-aloneEnhancedStandalone

High AvailabilityClusters

Fault-tolerantComputers

Solutions

Availabilitybenefits

Journaled FilesystemDynamic CPU DeallocationService Processor Redundant Power Redundant CoolingECC MemoryHot Swap AdaptersDynamic Kernel

Redundant Data PathsData MirroringHot Swap StorageRedundant Power forStorage Arrays

Redundant Cooling forStorage Arrays

Hot Spare Storage

Redundant ServersRedundant NetworksRedundant Network AdaptersHeartbeat MonitoringFailure DetectionFailure Diagnosis

Automated Fallover Automated Reintegration

Lock Step CPUsHardened Operating SystemRedundant MemoryContinuous Restart

Downtime Couple of days Couple of hoursDepends, but

typically 3 mins

In theory, none!

Data AvailabilityGood as your

last full backupLast transaction Last transaction No loss of Data

Relative Cost* 1 1.5 2-3 10+

* All other parameters being equal.

Simple Complex

S Wh t Ab t Sit F il ?




So, What About Site Failure?

Toronto London

Data Replication

Near distance (using SAN) supported by HACMP 5.2

Far distance, (requires data mirroring) invest in a GeographicClustering Solution (for example, HACMP XD*)

Distance unlimited

Data replication across a geography

Application, disk and network independent

Automated site failover and reintegration A single cluster across two sites

*The HACMP XD feature of HACMP contains IBM's HAGEO product and PPRC support .

Wh Mi ht I N d Hi h A il bilit ?




Why Might I Need High Availability?

60% of all large companies now operate round the clock (7x24)

Losses on failure:330,000 $US per hour (industry average)

Peak losses: 130,000 $US per minute (telephone network)

Loss of customer loyaltyLoss of customer confidence

And, if there is no disaster recovery:50% of affected companies will never reopen

90% of affected companies are out of business in less than two years

Note: High Availability is NOT a Disaster Recovery solution.

$ £

0

50

100

150

200

Lose of Revenue $M

E

B fit f Hi h A il bilit S l ti




Benefits of High-Availability Solutions

High-availability solutions offer the following benefits:

Standard components (no specialized hardware)Can be built from existing hardware (no need to invest in new kit)Work with just about any application

Work with wide range of disk and network typesNo specialized operating system or microcodeExcellent availability at low cost

+ =

Standard Components High Availability Solution

Other Considerations for High Availability




Highavailability

Continuousoperation

Continuous availability

Systems

Management

People

Data

Hardware

Software

Environment

Networking

Other Considerations for High-Availability

High-availability solutions require the following:

Thorough design and detailed planningElimination of single points of failureSelection of appropriate hardwareCorrect implementationDisciplined system administration practicesDocumented operational proceduresComprehensive testing

A Philosophical View of High Availability




A Philosophical View of High Availability

The goal of an HA cluster is to make a service highly available.

Users aren't interested in highly available hardware.Users aren't even interested in highly available software.

Users are interested in the availability of services.

Therefore, use the hardware and the software to make the serviceshighly available.

Cluster design decisions should be judged on the basis of whetheror not they:

Contribute to availability (for example, eliminate a SPOF)Detract from availability (for example, gratuitous complexity)

Since it is impractical if not impossible to truly eliminate all SPOFs,

be prepared to use risk analysis techniques to determine whichSPOFs are tolerated and which must be eliminated

Classic Risk Analysis




Classic Risk Analysis

1. Identify relevant policies

What existing risk tolerance policies are available?2. Study the current environment

Understand what strengths (for example, server room is on a properly sizedUPS) and weaknesses (for example, no disk mirroring) exist today

3. Perform requirements analysisJust how much availability is required?

What is the acceptable likelihood of a long outage?

4. Hypothesize vulnerabilities

What can possibly go wrong?

5. Identify and quantify risks

The statistical probability of something going wrong over the life of the

project (or the likely number of times something will go wrong over the life ofthe project) multiplied by the cost of an occurrence

6. Evaluate countermeasures

What does take to reduce the risk (by reducing the likelihood

or consequences of an occurrence) to an acceptable level7. Make decisions, create a budget and plan the cluster

What Do We Plan to Achieve This Week?




What Do We Plan to Achieve This Week?

A

B

A

B

Your mission this week is to build a two-node highly available cluster

using two previously separate pSeries systems, each of which has anapplication which needs to be made highly available.

Checkpoint




Checkpoint

1. Which of the following is a characteristic of high availability?

a. High availability always requires specially designed hardware components.b. High availability solutions always require manual intervention to ensure recovery following

failover.

c. High availability solutions never require customization.

d. High availability solutions offer excellent price performance when compared with Fault

Tolerant solutions.

2. True or False?High availability solutions never fail.

3. True or False? A thorough design and detailed planning is required for all high availability solutions.

4. True or False?The cluster shown on the foil titled "What We Plan to Achieve This Week" has no obvioussingle points of failure.

5. A proposed cluster with a two year life (for planning purposes) has avulnerability which is likely to occur twice per year at a cost of $10,000 peroccurrence. It costs $25,000 in additional hardware costs to eliminate thevulnerability. Should the vulnerability be eliminated?

a. yesb. no

Checkpoint Answers




Checkpoint Answers

1. Which of the following is a characteristic of high availability?

a. High availability always requires specially designed hardware components.b. High availability solutions always require manual intervention to ensure recovery followingfailover.

c. High availability solutions never require customization.

d. High availability solutions offer excellent price performance when compared with Fault

Tolerant solutions.2. True or False?

High availability solutions never fail.

3. True or False?

A thorough design and detailed planning is required for all high availability solutions.

4. True or False? (the local area network is a SPOF)

The cluster shown on the foil titled "What Will We Achieve This Week" has no obvioussingle points of failure.

5. A proposed cluster with a two year life (for planning purposes) has avulnerability which is likely to occur twice per year at a cost of $10,000 peroccurrence. It will cost $25,000 in additional hardware costs to eliminatethe vulnerability. Should the vulnerability be eliminated?

a. yes ($25,000 is less than $10,000 times four)b. no

Unit Summary




Unit Summary

Having completed this unit, you should be able to:

Understand what high availability is

Understand why you might need high availability

Outline the various options for implementing high availability

Compare and contrast the high-availability optionsState the benefits of using highly available clusters

Understand the key considerations when designing andimplementing a high-availability cluster

Be familiar with the basics of risk analysis

introduction to ibm ha

Documents