failure data analysis of a large- scale heterogeneous server environment authors : ramendra k. sahoo...
TRANSCRIPT
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment
Authors : Ramendra K. Sahoo Anand Sivasubramaniam Mark S. Squillante
Yanyong Zhang
Presenter : Sajala Rajendran
AbstractFault occurrence due to increased
complexity of hardware and softwareThis paper analyzes the empirical and
statistical properties of system errors and failures testing a network of 395 nodes.
Results show that the system errors and failures are of time varying behavior containing long stationary intervals which show strong correlation structures and periodic patterns
OutlineTechnological FactorsSolutionsRelated WorkSystem EnvironmentSystem–Wide Errors and FailuresPer–Node Errors and FailuresConclusion
Technological FactorsLowering operating voltages in order to
reduce power consumption leads to the emission of alpha particle and cosmic rays which in turn causes-
Bit-flips : A random event that corrupts the value in a memory cell rather than the cell itselfHigh workload imposed on these systems
lead to thermal instability that result in breakdowns.
More complex software and applications that make the systems prone to bugs such as memory leaks etc. This may lead to crashes.
Parallel/distributed systems where nodes depend on one another are susceptible to another’s failure/errors.
SolutionsProvide sufficient redundancy, e.g. duplicate
file servers that can avoid problems when failures occur but additional use of software/hardware is more expensive.
Anticipating failures and taking pro-active measures in advance. E.g. “Software rejuvenation” is aimed to prevent unexpected/unplanned outages due to software aging.
Take action when the failure has actually occurred. E.g. When nodes/disks fail, replace them
Observations
Each of the above solutions has its own pros and cons. Thus, it is very important to understand the properties of errors and failures that can occur which will help in developing schemes to improve system performance and availability.
Related WorkDong Tang, Ravishankar and Sujatha used a
VAX cluster system consisting of 7 machines and 4 storage controllers.◦ More than 46% of the failures were due to shared
resources and 98.8% of errors are recovered. Semi Markov failure model to point out that failure distribution on machines are correlated
Heath, Martin and Nguyen collected failure data from three clustered servers ranging from 18 to 89 workstations.◦ Time between failures are independent and nodes that just failed are more likely to fail again◦ Weibull Distribution
System Environment Event logs obtained from 395
nodes in a machine room (single CPU servers, 2/4/8/12 way SMP’s) over 487 days
Workload – Long running scientific and commercial applications
Node Breakdown
Error LogsThe kernel/applications log errors in
/dev/errorDaemon process (errdaemon) monitors the
above file and compares each entry against a database of error record templates.
Each entry of error log contains the following information◦ Node number (Node)◦ Error Identifier (ID)◦ Time Stamp (Time)◦ Error Type (type) ◦ Error Class (Class)◦ Description of the problem (Description)
Classification of ErrorsError Types
◦TEMP: Error condition recovered after a number of unsuccessful attempts.
◦PERF: Performance degradation◦UNKN: Unknown error◦PEND: Loss of availability of device
or component is forthcoming◦PERM: Permanent error
Error Classes◦Class O : Informational error only◦Class S : Software related errors◦Class U : Undermined errors◦Class H : Hardware related errors
Event Processing and Filtering
6,533,152 events collected from the nodes
Needs to be filteredEvents appearing within a short
time interval in a node are the result of the same errors as proved by Tang in his article
Using this result, the filtering algorithm works as follows:
Contd.…New event type recorded as new event
ID and new node number as new node ID
Event ID and node ID at any time T is compared with the event ID and node ID of all the events since time ( T - Tth )
( Tth ) is the threshold time and
chosen to be 5 minutes Eliminated 91.82% of the total raw
events
Failures – PERM type errors + small subset of PEND type error entries
Errors – Remaining PEND type entries + TEMP + PERF + UNKN type entries
Background InformationHazard Rate FunctionDefined as h(x) = f(x) (1 – F(x))h(x) – conditional probability that an item will fail given that it survived until time x.f(x) – Marginal density functionF(x) – Marginal cumulative distribution functionCross Correlation FunctionPlot of similarity between one waveform and time shifted version of the other, as a function of the time shiftAutocorrelation FunctionCross-correlation of a signal with itself
System-Wide Errors and Failures
Rate process – Number of events per minute
Increment process – Time between events
Observations
Failures are a small fraction Average time between failures is larger than
average time between all other error types.The variability of the interfailure times is
large with a coefficient of variation (CV) exceeding 2.5
Variability of inter error time is much larger with a CV of 14◦ CV = Standard deviation mean
Empirical DistributionTail properties explain the high
variance of the failure and error processes
Hazard Rate FunctionFailures and errors increases
when the time since the last failure and error increases beyond a particular point
Auto-SLEXPartitions error and failure rate and
increment processes into stationary intervals
Result:◦Error/failure rate and increment processes
are nonstationary and have long intervals that are stationary
◦E.g. Two stationary intervals in the error+failure rate process of length 8192 minutes.
◦5 stationary intervals of length 4096 minutes.
Autocorrelation FunctionPlot for stationary interval of error+failure
rate process of length 4096 minutes.◦Results showed significant correlation structures
To gain a deeper understanding, ACF is again plotted separately for (TEMP, PERF, UNKN) termed as less serious errors and (PEND + failures) termed as more serious errors.◦Magnitude of the correlation is much smaller and
dies out with increasing lag.
These results suggest that there exists interactions between the statistical properties of the two type of errors
Cross Correlation FunctionPlot for stationary interval of 4096
minutesPositive valued lags show the
relationship between less serious errors and more serious errors+failures at lags of k minutes apart
Negative valued lags show the relationship at lags of –k minutes apart
Results : significant cross correlation structures between the two partitions and considerable periodic behavior.
Per-Node Errors and Failures
Hazard rate functions for an individual node are quite similar to those that of the entire system
Almost 70 % of the failures occur in less than 4 % of the nodes
ObservationsHigh failure rates on 5 nodes
◦3 file servers◦2 database servers
Previous studies show that◦High workload leads to failures◦Most hardware failures occur in the
I/O subsystem
Failures occurring in these nodes and their time varying behavior are influenced by the load imposed on them.
ConclusionUnderstanding failure behavior helps to modify
existing techniques and also develop new mechanisms.
Event logs were collected from 395 systems in a machine room and results were analyzed
Error and failure processes exhibit time varying behavior and different forms of strong correlation structures.
A small fraction of the nodes incur most of the failures.
The nodes having the most failures have a strong temporal correlation with time of day at the hourly level.
Thank You !!!