failure data analysis of a large- scale heterogeneous server environment authors : ramendra k. sahoo...

34
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang Presenter : Sajala Rajendran

Upload: alicia-boone

Post on 04-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante

Yanyong Zhang

Presenter : Sajala Rajendran

Page 2: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

AbstractFault occurrence due to increased

complexity of hardware and softwareThis paper analyzes the empirical and

statistical properties of system errors and failures testing a network of 395 nodes.

Results show that the system errors and failures are of time varying behavior containing long stationary intervals which show strong correlation structures and periodic patterns

Page 3: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

OutlineTechnological FactorsSolutionsRelated WorkSystem EnvironmentSystem–Wide Errors and FailuresPer–Node Errors and FailuresConclusion

Page 4: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Technological FactorsLowering operating voltages in order to

reduce power consumption leads to the emission of alpha particle and cosmic rays which in turn causes-

Bit-flips : A random event that corrupts the value in a memory cell rather than the cell itselfHigh workload imposed on these systems

lead to thermal instability that result in breakdowns.

Page 5: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

More complex software and applications that make the systems prone to bugs such as memory leaks etc. This may lead to crashes.

Parallel/distributed systems where nodes depend on one another are susceptible to another’s failure/errors.

Page 6: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

SolutionsProvide sufficient redundancy, e.g. duplicate

file servers that can avoid problems when failures occur but additional use of software/hardware is more expensive.

Anticipating failures and taking pro-active measures in advance. E.g. “Software rejuvenation” is aimed to prevent unexpected/unplanned outages due to software aging.

Take action when the failure has actually occurred. E.g. When nodes/disks fail, replace them

Page 7: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Observations

Each of the above solutions has its own pros and cons. Thus, it is very important to understand the properties of errors and failures that can occur which will help in developing schemes to improve system performance and availability.

Page 8: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Related WorkDong Tang, Ravishankar and Sujatha used a

VAX cluster system consisting of 7 machines and 4 storage controllers.◦ More than 46% of the failures were due to shared

resources and 98.8% of errors are recovered. Semi Markov failure model to point out that failure distribution on machines are correlated

Heath, Martin and Nguyen collected failure data from three clustered servers ranging from 18 to 89 workstations.◦ Time between failures are independent and nodes that just failed are more likely to fail again◦ Weibull Distribution

Page 9: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

System Environment Event logs obtained from 395

nodes in a machine room (single CPU servers, 2/4/8/12 way SMP’s) over 487 days

Workload – Long running scientific and commercial applications

Page 10: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Node Breakdown

Page 11: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Error LogsThe kernel/applications log errors in

/dev/errorDaemon process (errdaemon) monitors the

above file and compares each entry against a database of error record templates.

Each entry of error log contains the following information◦ Node number (Node)◦ Error Identifier (ID)◦ Time Stamp (Time)◦ Error Type (type) ◦ Error Class (Class)◦ Description of the problem (Description)

Page 12: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Classification of ErrorsError Types

◦TEMP: Error condition recovered after a number of unsuccessful attempts.

◦PERF: Performance degradation◦UNKN: Unknown error◦PEND: Loss of availability of device

or component is forthcoming◦PERM: Permanent error

Page 13: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Error Classes◦Class O : Informational error only◦Class S : Software related errors◦Class U : Undermined errors◦Class H : Hardware related errors

Page 14: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Event Processing and Filtering

6,533,152 events collected from the nodes

Needs to be filteredEvents appearing within a short

time interval in a node are the result of the same errors as proved by Tang in his article

Using this result, the filtering algorithm works as follows:

Page 15: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Contd.…New event type recorded as new event

ID and new node number as new node ID

Event ID and node ID at any time T is compared with the event ID and node ID of all the events since time ( T - Tth )

( Tth ) is the threshold time and

chosen to be 5 minutes Eliminated 91.82% of the total raw

events

Page 16: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Failures – PERM type errors + small subset of PEND type error entries

Errors – Remaining PEND type entries + TEMP + PERF + UNKN type entries

Page 17: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Background InformationHazard Rate FunctionDefined as h(x) = f(x) (1 – F(x))h(x) – conditional probability that an item will fail given that it survived until time x.f(x) – Marginal density functionF(x) – Marginal cumulative distribution functionCross Correlation FunctionPlot of similarity between one waveform and time shifted version of the other, as a function of the time shiftAutocorrelation FunctionCross-correlation of a signal with itself

Page 18: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

System-Wide Errors and Failures

Page 19: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Rate process – Number of events per minute

Increment process – Time between events

Page 20: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Observations

Failures are a small fraction Average time between failures is larger than

average time between all other error types.The variability of the interfailure times is

large with a coefficient of variation (CV) exceeding 2.5

Variability of inter error time is much larger with a CV of 14◦ CV = Standard deviation mean

Page 21: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Empirical DistributionTail properties explain the high

variance of the failure and error processes

Page 22: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Hazard Rate FunctionFailures and errors increases

when the time since the last failure and error increases beyond a particular point

Page 23: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Auto-SLEXPartitions error and failure rate and

increment processes into stationary intervals

Result:◦Error/failure rate and increment processes

are nonstationary and have long intervals that are stationary

◦E.g. Two stationary intervals in the error+failure rate process of length 8192 minutes.

◦5 stationary intervals of length 4096 minutes.

Page 24: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Autocorrelation FunctionPlot for stationary interval of error+failure

rate process of length 4096 minutes.◦Results showed significant correlation structures

To gain a deeper understanding, ACF is again plotted separately for (TEMP, PERF, UNKN) termed as less serious errors and (PEND + failures) termed as more serious errors.◦Magnitude of the correlation is much smaller and

dies out with increasing lag.

These results suggest that there exists interactions between the statistical properties of the two type of errors

Page 25: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang
Page 26: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Cross Correlation FunctionPlot for stationary interval of 4096

minutesPositive valued lags show the

relationship between less serious errors and more serious errors+failures at lags of k minutes apart

Negative valued lags show the relationship at lags of –k minutes apart

Results : significant cross correlation structures between the two partitions and considerable periodic behavior.

Page 27: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Per-Node Errors and Failures

Page 28: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Hazard rate functions for an individual node are quite similar to those that of the entire system

Page 29: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Almost 70 % of the failures occur in less than 4 % of the nodes

Page 30: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

ObservationsHigh failure rates on 5 nodes

◦3 file servers◦2 database servers

Previous studies show that◦High workload leads to failures◦Most hardware failures occur in the

I/O subsystem

Page 31: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang
Page 32: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Failures occurring in these nodes and their time varying behavior are influenced by the load imposed on them.

Page 33: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

ConclusionUnderstanding failure behavior helps to modify

existing techniques and also develop new mechanisms.

Event logs were collected from 395 systems in a machine room and results were analyzed

Error and failure processes exhibit time varying behavior and different forms of strong correlation structures.

A small fraction of the nodes incur most of the failures.

The nodes having the most failures have a strong temporal correlation with time of day at the hourly level.

Page 34: Failure Data Analysis of a Large- Scale Heterogeneous Server Environment Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante Yanyong Zhang

Thank You !!!