the failure trace archive: enabling comparative analysis ... · enabling comparative analysis of...

75
The Failure Trace Archive: Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo 1 , Bahman Javadi 1 , Alexandru Iosup 2 , Dick Epema 2 1 INRIA, France 2 TU Delft, The Netherlands

Upload: others

Post on 11-Jan-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

The Failure Trace Archive:Enabling Comparative Analysis of Diverse Distributed Systems

Derrick Kondo1, Bahman Javadi1,Alexandru Iosup2, Dick Epema2

1INRIA, France 2 TU Delft, The Netherlands

Page 2: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Motivation• Push toward experimental computer science

Page 3: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Motivation• Push toward experimental computer science

• Hard to evaluate and compare algorithms and models for fault-tolerance

• Lack of public trace data sets

• Lack of standard trace format

• Lack of parsing and analytical tools

Page 4: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Motivation• Push toward experimental computer science

• Hard to evaluate and compare algorithms and models for fault-tolerance

• Lack of public trace data sets

• Lack of standard trace format

• Lack of parsing and analytical tools

• Failures in distributed systems have increasingly high negative impact and complex dynamics

Page 5: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Failure Trace Archive (FTA)

• Availability traces of distributed systems, differing in scale, volatility, and usage

• Standard event-based format for failure traces

• Scripts and tools for parsing and analyzing traces in svn repository

http://fta.inria.fr

Page 6: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Related WorkResource Data Sets Format Parsing

ToolsAnalysis

Tools

Grid Observatory

Emphasis on EGEE ✗ ✗ ✗

Computer Failure Repo.

12 (mainly clusters) ✗ ✗ ✗

Repo.of Avail. Traces

5 (mainly P2P) ✓ ✓ ✗

Desktop GridArchive

4 Desktop Grids ✓ ✗ ✗

FTA1 22 ✓ ✓ ✓1 FTA includes data sets of the former three resources, in addition to providing several new data sets

Page 7: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Enabled Studies

• Comparing models/algorithms using the identical data sets

Page 8: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Enabled Studies

• Comparing models/algorithms using the identical data sets

• Evaluation of generality/specificity of model/algorithm across different types of systems

Page 9: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Enabled Studies

• Comparing models/algorithms using the identical data sets

• Evaluation of generality/specificity of model/algorithm across different types of systems

• Evaluation of the generality of a system trace

Page 10: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Enabled Studies

• Comparing models/algorithms using the identical data sets

• Evaluation of generality/specificity of model/algorithm across different types of systems

• Evaluation of the generality of a system trace

• Analysis of evolution of failures over time

Page 11: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Enabled Studies

• Comparing models/algorithms using the identical data sets

• Evaluation of generality/specificity of model/algorithm across different types of systems

• Evaluation of the generality of a system trace

• Analysis of evolution of failures over time

• And many more...

Page 12: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Contributions

• Description of FTA, trace format and analysis toolbox

• High-level statistical characterization of failures in each data set

• Show importance of public data sets and methods via characterization of ambiguous data sets

Page 13: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Background Definitions

• Failure: observed deviation from correct system state

• Availability (unavailability) interval: continuous period that system is in correct state (incorrect state)

• Error: system state (not externally visible) that leads to failure

• Fault: root cause of an error

Page 14: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

Page 15: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

• Resource (versus job or user) centric

Page 16: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

• Resource (versus job or user) centric

Page 17: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

• Resource (versus job or user) centric

Page 18: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

• Resource (versus job or user) centric

• Event-based

Page 19: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

• Resource (versus job or user) centric

• Event-based

Page 20: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

• Resource (versus job or user) centric

• Event-based

Page 21: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

• Resource (versus job or user) centric

• Event-based

• Associated metadata

Page 22: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

• Resource (versus job or user) centric

• Event-based

• Associated metadata

Page 23: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

• Resource (versus job or user) centric

• Event-based

• Associated metadata

Page 24: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

• Resource (versus job or user) centric

• Event-based

• Codes for different components, events, and errors

• Associated metadata

Page 25: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

• Resource (versus job or user) centric

• Event-based

• Codes for different components, events, and errors

• Associated metadata

Page 26: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

• Resource (versus job or user) centric

• Event-based

• Codes for different components, events, and errors

• Associated metadata

Page 27: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

• Resource (versus job or user) centric

• Event-based

• Codes for different components, events, and errors

• Associated metadata

• Balance between completeness and sparseness

Page 28: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

• Resource (versus job or user) centric

• Event-based

• Codes for different components, events, and errors

• Extensibility

• Associated metadata

• Balance between completeness and sparseness

Page 29: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Schemaplatform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

• Resource (versus job or user) centric

• Raw, Tabbed, Relational database

(MySQL)

• Event-based

• Codes for different components, events, and errors

• Extensibility

• Associated metadata

• Balance between completeness and sparseness

Page 30: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Data Quality Assessment

• Syntactic: standard format library that checks data types, number fields (automated)

• Semantic: time moves forward and is non-overlapping, state is valid (automated)

• Visual: look at the distribution for outliers (manual)

Page 31: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Data Sets

• Usage (p2p, supercomputer, grids, desktop PC’s)

• Type (CPU, network, IO)

• Scale (50-240,000 hosts)

• Volatility (minutes to days)

• Resolution (wrt failure detection)

Page 32: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Currently 21 Data Sets

http://fta.inria.fr

Page 33: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Currently 21 Data Sets

http://fta.inria.fr

Page 34: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Currently 21 Data Sets

http://fta.inria.fr

Page 35: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Currently 21 Data Sets

http://fta.inria.fr

Page 36: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Currently 21 Data Sets

http://fta.inria.fr

Page 37: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Currently 21 Data Sets

http://fta.inria.fr

Page 38: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Currently 21 Data Sets

http://fta.inria.fr

Page 39: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Currently 21 Data Sets

http://fta.inria.fr

Page 40: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Currently 21 Data Sets

http://fta.inria.fr

Page 41: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Currently 21 Data Sets

http://fta.inria.fr

Page 42: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Statistical Analysis

Page 43: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

FTA Toolbox

initialize

MySQL trace database

query process finalizetexthtmlwikilatex

• Makes it easy to run a set of statistical measures across all the data sets

• Provides library of functions that can be reused and incorporated

• Implemented in Matlab

• svn checkout svn://scm.gforge.inria.fr/svn/fta/toolbox

Page 44: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Failure Modelling

• Approach

• Model availability and unavailability intervals, each with a single probability distribution

• Assume availability and unavailability is identically and independently distributed

• Descriptive, not prescriptive

Page 45: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Distributions of Availability and Unavailability Intervals

Page 46: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Distributions of Availability and Unavailability Intervals

Qualitative Description

Page 47: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Model Fitting

• For each candidate probability distribution

• Compute parameters that maximize the distribution’s likelihood

• Measure goodness of fit using Kolomorov-Smirnov (KS) and Anderson-Darling (AD) tests

• Compute p-value using 30 samples. Take average of 1000 p-values

Page 48: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

P-Values for KS & ADGoodness-of-fit tests

Availability

Unavailability

Page 49: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

P-Values for KS & ADGoodness-of-fit tests

Availability

Unavailability

p-value < 0.05 or 0.10⇒ reject H0 that data came

from fitted distribution

Page 50: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

P-Values for KS & ADGoodness-of-fit tests

Availability

Unavailability

(Un)availabilitygenerally

notheavy-tailed

p-value < 0.05 or 0.10⇒ reject H0 that data came

from fitted distribution

Page 51: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

P-Values for KS & ADGoodness-of-fit tests

Availability

Unavailability

p-value < 0.05 or 0.10⇒ reject H0 that data came

from fitted distribution

Page 52: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Exponentialusually

not a good fit.

P-Values for KS & ADGoodness-of-fit tests

Availability

Unavailability

p-value < 0.05 or 0.10⇒ reject H0 that data came

from fitted distribution

Page 53: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

P-Values for KS & ADGoodness-of-fit tests

Availability

Unavailability

p-value < 0.05 or 0.10⇒ reject H0 that data came

from fitted distribution

Page 54: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Gamma a good fit.

Amenable for Markov Models

P-Values for KS & ADGoodness-of-fit tests

Availability

Unavailability

p-value < 0.05 or 0.10⇒ reject H0 that data came

from fitted distribution

Page 55: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

P-Values for KS & ADGoodness-of-fit tests

Availability

Unavailability

p-value < 0.05 or 0.10⇒ reject H0 that data came

from fitted distribution

Page 56: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

P-Values for KS & ADGoodness-of-fit tests

Availability

UnavailabilityWeibull and Log-Normal provide

best fit

p-value < 0.05 or 0.10⇒ reject H0 that data came

from fitted distribution

Page 57: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Parameters of Distributions

Availability Unavailability

μ: mean, σ: std dev., k: shape, λ: scale

Page 58: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Parameters of Distributions

Availability Unavailability

μ: mean, σ: std dev., k: shape, λ: scale

k < 1, ∴ decreasing hazard rate

Page 59: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Can different interpretations of trace

data sets affect the model?

Page 60: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Ambiguous Data SetsData Set Ambiguity Interpretation

G5K06 Monitored state is an error or failure

error

G5K06B

Monitored state is an error or failure failure

Page 61: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Ambiguous Data SetsData Set Ambiguity Interpretation

G5K06 Monitored state is an error or failure

error

G5K06B

Monitored state is an error or failure failure

LANL0516 Overlapping intervals

union

LANL0516BOverlapping

intervals intersection

Page 62: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Ambiguous Data SetsData Set Ambiguity Interpretation

G5K06 Monitored state is an error or failure

error

G5K06B

Monitored state is an error or failure failure

LANL0516 Overlapping intervals

union

LANL0516BOverlapping

intervals intersection

ND07CPU Definition of idleness

w/o user and CPU load for 15 mins

ND07CPUBDefinition of

idleness CPU load < 10%

Page 63: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

QQ Plots for Ambiguous

Data Sets

0 200 400 600 800 10000

100

200

300

400

500

600

700

800

900

1000

Quantiles of g5k06 fit

Qua

ntile

s of

g5k

06B

fit

Page 64: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

QQ Plots for Ambiguous

Data Sets

0 200 400 600 800 10000

100

200

300

400

500

600

700

800

900

1000

Quantiles of g5k06 fit

Qua

ntile

s of

g5k

06B

fit

0 50 100 150 2000

20

40

60

80

100

120

140

160

180

200

Quantiles of lanl0516 fit

Qua

ntile

s of

lanl

0516

B fit

Page 65: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

QQ Plots for Ambiguous

Data Sets

0 200 400 600 800 10000

100

200

300

400

500

600

700

800

900

1000

Quantiles of g5k06 fit

Qua

ntile

s of

g5k

06B

fit

0 50 100 150 2000

20

40

60

80

100

120

140

160

180

200

Quantiles of lanl0516 fit

Qua

ntile

s of

lanl

0516

B fit

0 50 100 150 200 250 3000

50

100

150

200

250

300

Quantiles of nd07cpu fit

Qua

ntile

s of

nd0

7cpu

B fit

Page 66: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Distribution Parametersfor Ambiguous Data Sets

μ: mean, σ: std dev., k: shape, λ: scale

Page 67: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Distribution Parametersfor Ambiguous Data Sets

Mean of G5K06B 1.5 times greater than G5K06

μ: mean, σ: std dev., k: shape, λ: scale

Page 68: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Distribution Parametersfor Ambiguous Data Sets

μ: mean, σ: std dev., k: shape, λ: scale

Page 69: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Distribution Parametersfor Ambiguous Data Sets

Gamma scale parameter often significantly different

μ: mean, σ: std dev., k: shape, λ: scale

Page 70: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

How to identify interpretation?

• Parsing script is the exact interpretation

• Meaning explained in comments

• Publicly accessible in svn

• Format supports different interpretations of availability

• Can have multiple event_trace’s corresponding to different definitions availability

• So each interpretation can be uniquely identified

Page 71: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

How to resolve differences of interpretation?

• Determine which interpretation affects the application. (E.g. G5K06)

• Determine most common interpretation, or interpretation that is the lowest common denominator (E.g. ND07CPU)

• Exclude period of ambiguity or post-process it so that it is consistent with rest of data set (E.g. LANL05)

Page 72: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

platform

node

component

event_trace

creator

node_perf

event_state

component_type codes

event_type codes

event_end reason codes

Future Directions• Call to arms: trace data exists in

many production environments, but not always accessible

• Include more production systems

• Types of failures

• Causes of failures

• State before failures

• Automated trace collection

• Failure models and algorithms

• Integration of job and resource failures

Page 73: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Acknowledgements

• All contributors of trace data to the FTA

• INRIA ALEAE project directed by Emmanuel Jeannot

• Feedback from Cecile Germain, Eric Heien, Artur Andrzejak, anonymous reviewers

Page 74: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Summary

• FTA: Data Sets, Format, Tools

• http://fta.inria.fr

• High-level modelling and statistical characterization of 9 data sets

• Slight differences in interpretation make significant difference in model

• Got data? Questions? Please email [email protected] or any other FTA team member

Page 75: The Failure Trace Archive: Enabling Comparative Analysis ... · Enabling Comparative Analysis of Diverse Distributed Systems Derrick Kondo1, Bahman Javadi1, ... • Hard to evaluate

Thank you