failure data analysis of a large- scale heterogeneous server environment authors : ramendra k. sahoo...

Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante

Yanyong Zhang

Presenter : Sajala Rajendran

AbstractFault occurrence due to increased

complexity of hardware and softwareThis paper analyzes the empirical and

statistical properties of system errors and failures testing a network of 395 nodes.

Results show that the system errors and failures are of time varying behavior containing long stationary intervals which show strong correlation structures and periodic patterns

OutlineTechnological FactorsSolutionsRelated WorkSystem EnvironmentSystem–Wide Errors and FailuresPer–Node Errors and FailuresConclusion

Technological FactorsLowering operating voltages in order to

reduce power consumption leads to the emission of alpha particle and cosmic rays which in turn causes-

Bit-flips : A random event that corrupts the value in a memory cell rather than the cell itselfHigh workload imposed on these systems

lead to thermal instability that result in breakdowns.

More complex software and applications that make the systems prone to bugs such as memory leaks etc. This may lead to crashes.

Parallel/distributed systems where nodes depend on one another are susceptible to another’s failure/errors.

SolutionsProvide sufficient redundancy, e.g. duplicate

file servers that can avoid problems when failures occur but additional use of software/hardware is more expensive.

Anticipating failures and taking pro-active measures in advance. E.g. “Software rejuvenation” is aimed to prevent unexpected/unplanned outages due to software aging.

Take action when the failure has actually occurred. E.g. When nodes/disks fail, replace them

Observations

Each of the above solutions has its own pros and cons. Thus, it is very important to understand the properties of errors and failures that can occur which will help in developing schemes to improve system performance and availability.

Related WorkDong Tang, Ravishankar and Sujatha used a

VAX cluster system consisting of 7 machines and 4 storage controllers.◦ More than 46% of the failures were due to shared

resources and 98.8% of errors are recovered. Semi Markov failure model to point out that failure distribution on machines are correlated

Heath, Martin and Nguyen collected failure data from three clustered servers ranging from 18 to 89 workstations.◦ Time between failures are independent and nodes that just failed are more likely to fail again◦ Weibull Distribution

System Environment Event logs obtained from 395

nodes in a machine room (single CPU servers, 2/4/8/12 way SMP’s) over 487 days

Workload – Long running scientific and commercial applications

Node Breakdown

Error LogsThe kernel/applications log errors in

/dev/errorDaemon process (errdaemon) monitors the

above file and compares each entry against a database of error record templates.

Each entry of error log contains the following information◦ Node number (Node)◦ Error Identifier (ID)◦ Time Stamp (Time)◦ Error Type (type) ◦ Error Class (Class)◦ Description of the problem (Description)

Classification of ErrorsError Types

◦TEMP: Error condition recovered after a number of unsuccessful attempts.

◦PERF: Performance degradation◦UNKN: Unknown error◦PEND: Loss of availability of device

or component is forthcoming◦PERM: Permanent error

Error Classes◦Class O : Informational error only◦Class S : Software related errors◦Class U : Undermined errors◦Class H : Hardware related errors

Event Processing and Filtering

6,533,152 events collected from the nodes

Needs to be filteredEvents appearing within a short

time interval in a node are the result of the same errors as proved by Tang in his article

Using this result, the filtering algorithm works as follows:

Contd.…New event type recorded as new event

ID and new node number as new node ID

Event ID and node ID at any time T is compared with the event ID and node ID of all the events since time ( T - Tth )

( Tth ) is the threshold time and

chosen to be 5 minutes Eliminated 91.82% of the total raw

events

Failures – PERM type errors + small subset of PEND type error entries

Errors – Remaining PEND type entries + TEMP + PERF + UNKN type entries

Background InformationHazard Rate FunctionDefined as h(x) = f(x) (1 – F(x))h(x) – conditional probability that an item will fail given that it survived until time x.f(x) – Marginal density functionF(x) – Marginal cumulative distribution functionCross Correlation FunctionPlot of similarity between one waveform and time shifted version of the other, as a function of the time shiftAutocorrelation FunctionCross-correlation of a signal with itself

System-Wide Errors and Failures

Rate process – Number of events per minute

Increment process – Time between events

Observations

Failures are a small fraction Average time between failures is larger than

average time between all other error types.The variability of the interfailure times is

large with a coefficient of variation (CV) exceeding 2.5

Variability of inter error time is much larger with a CV of 14◦ CV = Standard deviation mean

Empirical DistributionTail properties explain the high

variance of the failure and error processes

Hazard Rate FunctionFailures and errors increases

when the time since the last failure and error increases beyond a particular point

Auto-SLEXPartitions error and failure rate and

increment processes into stationary intervals

Result:◦Error/failure rate and increment processes

are nonstationary and have long intervals that are stationary

◦E.g. Two stationary intervals in the error+failure rate process of length 8192 minutes.

◦5 stationary intervals of length 4096 minutes.

Autocorrelation FunctionPlot for stationary interval of error+failure

rate process of length 4096 minutes.◦Results showed significant correlation structures

To gain a deeper understanding, ACF is again plotted separately for (TEMP, PERF, UNKN) termed as less serious errors and (PEND + failures) termed as more serious errors.◦Magnitude of the correlation is much smaller and

dies out with increasing lag.

These results suggest that there exists interactions between the statistical properties of the two type of errors

Cross Correlation FunctionPlot for stationary interval of 4096

minutesPositive valued lags show the

relationship between less serious errors and more serious errors+failures at lags of k minutes apart

Negative valued lags show the relationship at lags of –k minutes apart

Results : significant cross correlation structures between the two partitions and considerable periodic behavior.

Per-Node Errors and Failures

Hazard rate functions for an individual node are quite similar to those that of the entire system

Almost 70 % of the failures occur in less than 4 % of the nodes

ObservationsHigh failure rates on 5 nodes

◦3 file servers◦2 database servers

Previous studies show that◦High workload leads to failures◦Most hardware failures occur in the

I/O subsystem

Failures occurring in these nodes and their time varying behavior are influenced by the load imposed on them.

ConclusionUnderstanding failure behavior helps to modify

existing techniques and also develop new mechanisms.

Event logs were collected from 395 systems in a machine room and results were analyzed

Error and failure processes exhibit time varying behavior and different forms of strong correlation structures.

A small fraction of the nodes incur most of the failures.

The nodes having the most failures have a strong temporal correlation with time of day at the hourly level.

Thank You !!!

failure data analysis of a large- scale heterogeneous server environment authors : ramendra k. sahoo...

Documents