an empirical examination of current high-availability clustering solutions’ performance

An Empirical Examination of An Empirical Examination of Current High-Availability Current High-Availability

Clustering Solutions’ Clustering Solutions’ Performance Performance

Jeffrey AbsherDePaul University

Research Symposium PresentationNovember 2003

See actual paper for bibliographical, procedural info, and appropriate academic reference information

HA and Related TechnologyHA and Related Technology

Distributed OSDistributed OSLoad BalancingLoad BalancingDisaster RecoveryDisaster RecoveryFault ToleranceFault ToleranceHA clusteringHA clustering

HA’s defining traitsHA’s defining traits SPOF avoided by using SPOF avoided by using

redundancyredundancy Single image to the outside Single image to the outside

world using a single virtual IP world using a single virtual IP address and hostnameaddress and hostname

Automated fault management Automated fault management and recoveryand recovery

Multiple access paths from Multiple access paths from each cluster node to each each cluster node to each resource group (set of HA resource group (set of HA services)services)

Simple abstraction for Simple abstraction for applications and administratorsapplications and administrators

Undisrupted (or minimal Undisrupted (or minimal disrupted) services during disrupted) services during failover.failover.

“If a computer breaks down, the functions performed by that computer will be handled by some other computer in the cluster.”

A cluster and tester topologyA cluster and tester topology

100 Mbps Ethernet

16MbpsToken Ring

LAB1 LAB2

Testing Client

LAB29.16.6.41Netfinity 86512xPentium 200512 M RAMRedHat AS 2.1

9.16.6.0/24network

10.20.30.0/24network

RedHat ClusterConfiguration

LAB19.16.6.36Netfinity 86512xPentium 200512 M RAMRedHat AS 2.1

TESTER19.16.6.36Pentium PC64 Meg RAMWindows NT 4.0 SP6

Serial Link

Event/Failure What does it Simulate?

Baseline No Events

Kill process on Primary server A simple fault that causes an abend to the HA process but does not take out the server.

Kill process on primary server and hold the process down for 30 seconds

A core dump that takes a long time or a more complex fault.

Kill process on primary, hold down for 30 seconds and fail to start on second node

A core dump or more complex fault, as well as a misconfiguration on the secondary server.

Kill the cluster/watchdog process on the primary server

A bug in the cluster programming that causes an abend or a mistaken shutdown of the cluster processes.

Short power failure on primary node A single node power failure, technician error, or a loose power-cable, etc.

Simultaneous power failure on both nodes, primary/secondary recovers first.

A datacenter power failure with the two possible recovery orders

For AIX and Linux, Loss of serial communication for 60 seconds. For Windows, the Virtual Shared disk processes were killed and disabled for 60 seconds.

A loose serial cable or technician error such as a cable disconnect, a port misconfiguration, or a mistaken command such as echo hello> /dev/tty0.

Primary/Secondary Server public network loss for 60 seconds

A loose network cable or a technician error such as a cable disconnect, card misconfiguration, or a mistaken command such as ifconfig en0 down.

Public/Private network down 60 seconds A power failure on the public hub or MAU, a network storm, or a technican’s error such as a VLAN misconfiguration.

IP address clash public network for 60 seconds. A situation where another machine on the same VLAN is accidentally brought online with an incorrect IP address.

AIX Trials

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Failure Type

Up

tim

e (1

=10

0%)

Cluster 200s as % of baseline

Cluster 200s + 206s as % ofbaselineNocluster 200s as % of Baseline

Nocluster 200s + 206s as % ofbaseline

Win2K Trials

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Failure Type

Up

tim

e (1

=10

0%)




RedHat Trials

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Failure Type

Up

tim

e (1

=10

0%)




Inter OS ComparisonInter OS ComparisonAIXAIX Win2KWin2K LinuxLinux

ConfigurationConfiguration most difficultmost difficult reasonablereasonable simplestsimplest

Scripting required?Scripting required? some some nonenone muchmuch

FeaturesFeatures manymany manymany fewfew

OS integrationOS integration mediummedium highhigh low/nonelow/none

InstallationInstallation InterdependentInterdependent IndependentIndependent IndependentIndependent

Trials with HA Trials with HA resulting in a resulting in a longer outagelonger outage

4/144/14 2/142/14 3/143/14

Trials requiring Trials requiring manual interventionmanual intervention

00 11 11

Subjective Subjective ObservationsObservations

HA clustering is difficult to configure properly and the available HA clustering is difficult to configure properly and the available documentation is lacking documentation is lacking Multiple machines must be configured simultaneously, often Multiple machines must be configured simultaneously, often

packages and software must be installed and configured in a packages and software must be installed and configured in a specific order.specific order.

For what should be a loosely-coupled system, there are many For what should be a loosely-coupled system, there are many interdependencies.interdependencies.

Youn et al suggest that the design of “administration of Youn et al suggest that the design of “administration of clusters…needs improvement,” – I agreeclusters…needs improvement,” – I agree

Vogels et al state, “Users find it difficult to configure clusters Vogels et al state, “Users find it difficult to configure clusters with the desired management … properties. It is difficult to with the desired management … properties. It is difficult to configure applications to be automatically launched in an configure applications to be automatically launched in an appropriate order. Lacking solutions to these problems, clusters appropriate order. Lacking solutions to these problems, clusters will remain awkward and time-consuming tools.” - I agreewill remain awkward and time-consuming tools.” - I agree

Objective Objective ConclusionsConclusionsBased on Empirical EvidenceBased on Empirical Evidence

HA is not a perfect solution for every environment, and may be a bad HA is not a perfect solution for every environment, and may be a bad solution for some, depending on the expected faults.solution for some, depending on the expected faults.

High failover time for some systems contributes to a lower-than-High failover time for some systems contributes to a lower-than-expected performance of HA systems when compared to non-HA expected performance of HA systems when compared to non-HA systems.systems. Failover times need to be significantly smaller than the time required for a Failover times need to be significantly smaller than the time required for a

reboot or even a restart of a slow-to-start process.reboot or even a restart of a slow-to-start process. Primary-node negotiation time at boot contributes to poor performance Primary-node negotiation time at boot contributes to poor performance

during power outages.during power outages. There were cases where clustering is shown to actually decrease the There were cases where clustering is shown to actually decrease the

uptime of a service or site.uptime of a service or site.

an empirical examination of current high-availability clustering solutions’ performance

Documents