fault tolerance techniques for high-performance computing...
TRANSCRIPT
Proba models Buddy Proba models 2
Fault tolerance techniquesfor high-performance computing
Part 3
Anne Benoit
ENS Lyon
http://graal.ens-lyon.fr/~abenoit
CR02 - 2016/2017
[email protected] CR02 Fault tolerance (3) 1/ 51
Proba models Buddy Proba models 2
Outline
1 Probabilistic models
2 In-memory checkpointingDouble checkpointing algorithmAnalysisTriple checkpointing algorithmExperiments
3 Probabilistic models for advanced methodsFailure predictionReplication
[email protected] CR02 Fault tolerance (3) 2/ 51
Proba models Buddy Proba models 2
Outline
1 Probabilistic models
2 In-memory checkpointing
3 Probabilistic models for advanced methods
[email protected] CR02 Fault tolerance (3) 3/ 51
Proba models Buddy Proba models 2
Outline
1 Probabilistic models
2 In-memory checkpointingDouble checkpointing algorithmAnalysisTriple checkpointing algorithmExperiments
3 Probabilistic models for advanced methods
[email protected] CR02 Fault tolerance (3) 4/ 51
Proba models Buddy Proba models 2
Motivation
Checkpoint transfer and storage⇒ critical issues of rollback/recovery protocols
Stable storage: high cost
Distributed in-memory storage:
Store checkpoints in local memory ⇒ no centralized storage, Much better scalabilityReplicate checkpoints ⇒ application survives single failure/ Still, risk of fatal failure in some (unlikely) scenarios
[email protected] CR02 Fault tolerance (3) 5/ 51
Proba models Buddy Proba models 2
Outline
1 Probabilistic models
2 In-memory checkpointingDouble checkpointing algorithmAnalysisTriple checkpointing algorithmExperiments
3 Probabilistic models for advanced methods
[email protected] CR02 Fault tolerance (3) 6/ 51
Proba models Buddy Proba models 2
Double checkpoint algorithm
Platform nodes partitioned into pairs
Each node in a pair exchanges its checkpoint with its buddy
Each node saves two checkpoints:- one locally: storing its own data- one remotely: receiving and storing its buddy’s data
Two algorithms• blocking version by Zheng, Shi and Kale• non-blocking version by Ni, Meneses and Kale
[email protected] CR02 Fault tolerance (3) 7/ 51
Proba models Buddy Proba models 2
Non-blocking checkpoint algorithm
1
1
d q s
f
f
P
Local checkpointdone
Remote checkpointdone
Perioddone
Node p
Node p'
Checkpoints taken periodically, with period P = δ + θ + σ
Phase 1, length δ: local checkpoint, blocking mode. No work
Phase 2, length θ: remote checkpoint. Overhead φ
Phase 3, length σ: application at full speed 1
[email protected] CR02 Fault tolerance (3) 8/ 51
Proba models Buddy Proba models 2
Non-blocking checkpoint algorithm
1
1
d q s
f
f
P
Local checkpointdone
Remote checkpointdone
Perioddone
Node p
Node p'
Work in failure-free period:
W = (θ − φ) + σ = P − δ − φ
[email protected] CR02 Fault tolerance (3) 8/ 51
Proba models Buddy Proba models 2
Cost of overlap
1
1
d q s
f
f
P
Local checkpointdone
Remote checkpointdone
Perioddone
Node p
Node p'
Overlap computations and checkpoint file exchanges
Large θ⇒ more flexibility to hide cost of file exchange⇒ smaller overhead φ
[email protected] CR02 Fault tolerance (3) 9/ 51
Proba models Buddy Proba models 2
Cost of overlap
1
1
d q s
f
f
P
Local checkpointdone
Remote checkpointdone
Perioddone
Node p
Node p'
θ = θmin: fastest communication, fully blocking ⇒ φ = θmin
θ = θmax: full overlap with computation ⇒ φ = 0
Linear interpolation θ(φ) = θmin + α(θmin − φ)
φ = 0 for θ = θmax = (1 + α)θmin
α: rate of overhead decrease w.r.t. communication length
[email protected] CR02 Fault tolerance (3) 10/ 51
Proba models Buddy Proba models 2
Assessing the risk
1
1
d q s
f
f
P
Node p
Node p'
1
1
d q
f
f
tlost
Checkpoint ofp
Checkpoint ofp'
Risk Period
Node to replace p
q
f 1
tlostD R
After failure: downtime D and recovery from buddy node
Two checkpoint files lost, must be re-sent to faulty processor1 Checkpoint of faulty node, needed for recovery⇒ sent as fast as possible, in time R = θmin
2 Checkpoint of buddy node, needed in case buddy fails later on⇒ ??
Application at risk until complete reception of both messages
[email protected] CR02 Fault tolerance (3) 11/ 51
Proba models Buddy Proba models 2
Checkpoint of buddy node
Scenario DoubleNBL
File sent at same speed as in regular mode, in time θ(φ)
Overhead φ
Favors performance, at the price of higher risk
Scenario DoubleBoF
File sent as fast as possible, in time θmin = R
Overhead R
Favors risk reduction, at the price of higher overhead
[email protected] CR02 Fault tolerance (3) 12/ 51
Proba models Buddy Proba models 2
Outline
1 Probabilistic models
2 In-memory checkpointingDouble checkpointing algorithmAnalysisTriple checkpointing algorithmExperiments
3 Probabilistic models for advanced methods
[email protected] CR02 Fault tolerance (3) 13/ 51
Proba models Buddy Proba models 2
Computing the waste
Waste= fraction of time where nodes do not perform useful computations
Tbase base time without any overhead due to resilience
Time for fault-free execution Tff
Period P ⇒ W = P − δ − φ work unitsTff = P
W Tbase(1− δ+φ
P
)Tff = Tbase
[email protected] CR02 Fault tolerance (3) 14/ 51
Proba models Buddy Proba models 2
Computing the waste
T expectation of total execution time→ single application→ platform life (many jobs running concurrently)
In average, failures occur every µ seconds→ platform MTBF µ = µind/p
For each failure, F seconds are lost:
T = Tff +T
µF
(1− F
µ
)(1− δ + φ
P
)T = Tbase
[email protected] CR02 Fault tolerance (3) 15/ 51
Proba models Buddy Proba models 2
Computing the waste
(1−Waste
)T = Tbase
Waste = 1−(1− F
µ
)(1− δ + φ
P
)
Two sources of overhead:Wasteff = δ+φ
P : checkpointing in a fault-free executionWastefail = F
µ : failures striking during execution
Waste = Wastefail + Wasteff −WastefailWasteff
[email protected] CR02 Fault tolerance (3) 16/ 51
Proba models Buddy Proba models 2
Time lost due to failures
Scenario DoubleNBL
1
1
d q s
f
f
P
Node p
Node p'
1
1
d q
f
f
tlost
Checkpoint ofp
Checkpoint ofp'
Risk Period
Node to replace p
q
f 1
tlostD R
Fnbl = D + R +δ
PRE1 +
θ
PRE2 +
σ
PRE3
[email protected] CR02 Fault tolerance (3) 17/ 51
Proba models Buddy Proba models 2
Failure during third part of period
1
1
d q s
f
f
P
Node p
Node p'
1
1
d q
f
f
tlost
Checkpoint ofp
Checkpoint ofp'
Risk Period
Node to replace p
q
f 1
tlostD R
No work during D + R
Then re-execution of Wlost = (θ − φ) + tlostFirst θ seconds: overhead φ (receiving buddy checkpoint)Then full speed
E (tlost) = σ2 (failures strike uniformly)
RE3 = θ +σ
2
[email protected] CR02 Fault tolerance (3) 18/ 51
Proba models Buddy Proba models 2
Waste minimization
Scenario DoubleNBL Fnbl = D + R + θ + P2
T Onbl =√
2(δ + φ)(µ− R − D − θ)
Scenario DoubleBoF Fbof = Fnbl + R − φ
T Obof =√
2(δ + φ)(µ− 2R − D − θ + φ)
Not same δ as in Young/Daly for coordinated checkpointing onglobal remote storage ,
[email protected] CR02 Fault tolerance (3) 19/ 51
Proba models Buddy Proba models 2
Waste minimization
Scenario DoubleNBL Fnbl = D + R + θ + P2
T Onbl =√
2(δ + φ)(µ− R − D − θ)
Scenario DoubleBoF Fbof = Fnbl + R − φ
T Obof =√
2(δ + φ)(µ− 2R − D − θ + φ)
Not same δ as in Young/Daly for coordinated checkpointing onglobal remote storage ,
[email protected] CR02 Fault tolerance (3) 19/ 51
Proba models Buddy Proba models 2
Risk
1
1
d q s
f
f
P
Node p
Node p'
1
1
d q
f
f
tlost
Checkpoint ofp
Checkpoint ofp'
Risk Period
Node to replace p
q
f 1
tlostD R
Application at risk until complete reception of both messages:
Risk = D + R + θ for DoubleNBL
Risk = D + 2R for DoubleBoF
Analysis:
Failures strike with uniform distribution over time
λ = 1nµ instantaneous processor failure rate
Success probability Pdouble = (1− 2λ2TRisk)n/2
[email protected] CR02 Fault tolerance (3) 20/ 51
Proba models Buddy Proba models 2
Risk
Consider a pair made of one processor and its buddy:
Probability of first processor failing: λT ,
Probability of one failure in the pair : 1− (1− λT )2 ≈ 2λT
Probability of second failure within risk period: λRisk
Probability of fatal failure in the pair: (2λT )(λRisk)
Probability of application fatal failure: 1− (1− 2λ2TRisk)n/2
Success probability Pdouble = (1− 2λ2TRisk)n/2
compare to Pbase = (1− λTbase)n
[email protected] CR02 Fault tolerance (3) 21/ 51
Proba models Buddy Proba models 2
Outline
1 Probabilistic models
2 In-memory checkpointingDouble checkpointing algorithmAnalysisTriple checkpointing algorithmExperiments
3 Probabilistic models for advanced methods
[email protected] CR02 Fault tolerance (3) 22/ 51
Proba models Buddy Proba models 2
Principle
1
1
q s
f
P
Remote checkpointdone on preferred buddy Period
done
Node p
Node p'
f
f
Node p"
q
f
Remote checkpointdone on secondary buddy
f 1
f
Processors organized in triples
Each processor has a preferred buddy and a secondary buddy
Rotation of buddies
[email protected] CR02 Fault tolerance (3) 23/ 51
Proba models Buddy Proba models 2
Principle
1
1
q s
f
P
Remote checkpointdone on preferred buddy Period
done
Node p
Node p'
f
f
Node p"
q
f
Remote checkpointdone on secondary buddy
f 1
f
Waste in fault-free execution tends to zero
Application failure = three successive failures within a triple⇒ Smaller risk even for large θ
Only need non-blocking version Triple
[email protected] CR02 Fault tolerance (3) 23/ 51
Proba models Buddy Proba models 2
Memory requirement
1
1
q s
f
P
Remote checkpointdone on preferred buddy Period
done
Node p
Node p'
f
f
Node p"
q
f
Remote checkpointdone on secondary buddy
f 1
f
Copy-on-write for local checkpoint file
Same memory usage as double checkpointing algorithm
[email protected] CR02 Fault tolerance (3) 24/ 51
Proba models Buddy Proba models 2
Analysis
Waste
Wastefail same as for DoubleNBL
Wasteff = 2φP instead of Wasteff = δ+φ
P for DoubleNBL
Risk
Risk = D + R + 2θ
Success probability Ptriple = (1− 6λ3TRisk2)n/3
[email protected] CR02 Fault tolerance (3) 25/ 51
Proba models Buddy Proba models 2
Outline
1 Probabilistic models
2 In-memory checkpointingDouble checkpointing algorithmAnalysisTriple checkpointing algorithmExperiments
3 Probabilistic models for advanced methods
[email protected] CR02 Fault tolerance (3) 26/ 51
Proba models Buddy Proba models 2
Scenarios
Scenario D δ φ R α n
Base 0 2 0 ≤ φ ≤ 4 4 10 324× 32
Exa 60 30 0 ≤ φ ≤ 60 60 10 106
Exa corresponds to the Exa-Slim scenario
[email protected] CR02 Fault tolerance (3) 27/ 51
Proba models Buddy Proba models 2
Waste for scenario Base
1min10min
1h4h
1 day 0 0.2
0.4 0.6
0.8 1
0 0.05
0.1 0.15
0.2 0.25
Wa
ste
Diffe
ren
ce
(w
/ T
rip
le)
M (s)φ/R
Wa
ste
Diffe
ren
ce
(w
/ T
rip
le)
0
0.05
0.1
0.15
0.2
0.25
1min10min
1h4h
1 day 0 0.2
0.4 0.6
0.8 1
0 0.05
0.1 0.15
0.2 0.25
Wa
ste
Diffe
ren
ce
(w
/ T
rip
le)
M (s)φ/R
Wa
ste
Diffe
ren
ce
(w
/ T
rip
le)
0
0.05
0.1
0.15
0.2
0.25
1min10min
1h4h
1 day 0 0.2
0.4 0.6
0.8 1
0 0.2 0.4 0.6 0.8
1
Wa
ste
M (s)φ/R
Wa
ste
0
0.2
0.4
0.6
0.8
1
DoubleBoF DoubleNBL Triple
Waste as a function of φ/R and µ
[email protected] CR02 Fault tolerance (3) 28/ 51
Proba models Buddy Proba models 2
Waste for scenario Base (µ = 7h)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Wa
ste
Ra
tio
φ/R
DoubleBoF/DoubleNBLTriple/DoubleNBL
[email protected] CR02 Fault tolerance (3) 29/ 51
Proba models Buddy Proba models 2
Success probability for scenario Base
0 5 10 15 20 25 30
1
10
20
30
0 0.2 0.4 0.6 0.8
1
Su
cce
ss P
rob
ab
ility
Ra
tio
M (minutes)Platform Exploitation
(days)
Su
cce
ss P
rob
ab
ility
Ra
tio
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 5 10 15 20 25 30
1
10
20
30
0 0.2 0.4 0.6 0.8
1
Su
cce
ss P
rob
ab
ility
Ra
tio
M (minutes)Platform Exploitation
(days)
Su
cce
ss P
rob
ab
ility
Ra
tio
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Ratio DoubleNBL/ DoubleBoF Ratio DoubleBoF/ Triple
Relative success probabilityfunction of µ and platform life T (θ = (α + 1)R)
[email protected] CR02 Fault tolerance (3) 30/ 51
Proba models Buddy Proba models 2
Waste for scenario Exa
1min
10min1h
4h1 day 0
0.2 0.4
0.6 0.8
1
0 0.05
0.1 0.15
0.2 0.25
Wa
ste
Diffe
ren
ce
(w
/ T
rip
le)
Mφ/R
Wa
ste
Diffe
ren
ce
(w
/ T
rip
le)
0
0.05
0.1
0.15
0.2
0.25
1min
10min1h
4h1 day 0
0.2 0.4
0.6 0.8
1
0 0.05
0.1 0.15
0.2 0.25
Wa
ste
Diffe
ren
ce
(w
/ T
rip
le)
Mφ/R
Wa
ste
Diffe
ren
ce
(w
/ T
rip
le)
0
0.05
0.1
0.15
0.2
0.25
1min
10min1h
4h1 day 0
0.2 0.4
0.6 0.8
1
0 0.2 0.4 0.6 0.8
1
Wa
ste
Mφ/R
Wa
ste
0
0.2
0.4
0.6
0.8
1
DoubleBoF DoubleNBL Triple
Waste as a function of φ/R and µ
[email protected] CR02 Fault tolerance (3) 31/ 51
Proba models Buddy Proba models 2
Waste for scenario Exa (µ = 7h)
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Wa
ste
Ra
tio
φ/R
DoubleBoF/DoubleNBLTriple/DoubleNBL
[email protected] CR02 Fault tolerance (3) 32/ 51
Proba models Buddy Proba models 2
Success probability for scenario Exa
0 5 10 15 20 25 30 35 40 45 50 55 60
0 10
20 30
40 50
60
0 0.2 0.4 0.6 0.8
1
Su
cce
ss P
rob
ab
ility
Ra
tio
M (minutes) Platform Exploitation(weeks)
Su
cce
ss P
rob
ab
ility
Ra
tio
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 5 10 15 20 25 30 35 40 45 50 55 60
0 10
20 30
40 50
60
0 0.2 0.4 0.6 0.8
1
Su
cce
ss P
rob
ab
ility
Ra
tio
M (minutes) Platform Exploitation(weeks)
Su
cce
ss P
rob
ab
ility
Ra
tio
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Ratio DoubleNBL/ DoubleBoF Ratio DoubleBoF/ Triple
Relative success probabilityfunction of µ and platform life T (θ = (α + 1)R)
[email protected] CR02 Fault tolerance (3) 33/ 51
Proba models Buddy Proba models 2
Conclusion
Triple checkpointing
Save checkpoint on two remote processes instead of one,without much more memory or storage requirements
Excellent success probability, almost no failure-free overhead
Assessment of performance and risk factors using unified mode
Realistic scenarios conclude to superiority of Triple
Future work
Study real-life applications and propose refined values for αfor a set of widely-used benchmarks
Very small MTBF values on future exascale platforms⇒ combine distributed in-memory strategieswith uncoordinated or hierarchical checkpointing protocols
[email protected] CR02 Fault tolerance (3) 34/ 51
Proba models Buddy Proba models 2
Outline
1 Probabilistic models
2 In-memory checkpointing
3 Probabilistic models for advanced methodsFailure predictionReplication
[email protected] CR02 Fault tolerance (3) 35/ 51
Proba models Buddy Proba models 2
Outline
1 Probabilistic models
2 In-memory checkpointing
3 Probabilistic models for advanced methodsFailure predictionReplication
[email protected] CR02 Fault tolerance (3) 36/ 51
Proba models Buddy Proba models 2
Framework
Predictor
Exact prediction dates (at least C seconds in advance)
Recall r : fraction of faults that are predicted
Precision p: fraction of fault predictions that are correct
Events
true positive: predicted faults
false positive: fault predictions that did not materialize asactual faults
false negative: unpredicted faults
[email protected] CR02 Fault tolerance (3) 37/ 51
Proba models Buddy Proba models 2
Fault rates
µ: mean time between failures (MTBF)
µP mean time between predicted events (both true positiveand false positive)
µNP mean time between unpredicted faults (false negative).
µe : mean time between events (including three event types)
r =TrueP
TrueP + FalseNand p =
TruePTrueP + FalseP
(1− r)
µ=
1
µNPand
r
µ=
p
µP
1
µe=
1
µP+
1
µNP
[email protected] CR02 Fault tolerance (3) 38/ 51
Proba models Buddy Proba models 2
Example
Error Error Error Error Error
pred. pred. pred. pred. pred. pred.
Time
F+P F+Ppred.
F+Ppred.
F+PError
t
Actual faults:
Predictor:
Overlap:
Predictor predicts six faults in time t
Five actual faults. One fault not predicted
µ = t5 , µP = t
6 , and µNP = t
Recall r = 45 (green arrows over red arrows)
Precision p = 46 (green arrows over blue arrows)
[email protected] CR02 Fault tolerance (3) 39/ 51
Proba models Buddy Proba models 2
Algorithm
1 While no fault prediction is available:• checkpoints taken periodically with period T
2 When a fault is predicted at time t:• take a checkpoint ALAP (completion right at time t)• after the checkpoint, complete the execution of the period
[email protected] CR02 Fault tolerance (3) 40/ 51
Proba models Buddy Proba models 2
Computing the waste
1 Fault-free execution: Waste[FF ] = CT
Checkpointing
the first chunk
Computing the first chunk
Processing the second chunkProcessing the first chunk
Time
Time spent checkpointing
Time spent working
2 Unpredicted faults: 1µNP
[D + R + T
2
]TimeT -C T -C Tlost T -C
Error
C C C D R C
Waste[fail ] =1
µ
[(1− r)
T
2+ D + R +
r
pC
]⇒ Topt ≈
√2µC
1− r
[email protected] CR02 Fault tolerance (3) 41/ 51
Proba models Buddy Proba models 2
Computing the waste
3 Predictions: 1µP
[p(C + D + R) + (1− p)C ]
TimeT -C Wreg
Error Predicted failure
T -Wreg -C T -C
C C Cp D R C C
with actual fault (true positive)
TimeT -C Wreg
Predicted failure
T -Wreg -C T -C T -C
C C Cp C C C
no actual fault (false negative)
Waste[fail ] =1
µ
[(1− r)
T
2+ D + R +
r
pC
]⇒ Topt ≈
√2µC
1− r
[email protected] CR02 Fault tolerance (3) 41/ 51
Proba models Buddy Proba models 2
Computing the waste
3 Predictions: 1µP
[p(C + D + R) + (1− p)C ]
TimeT -C Wreg
Error Predicted failure
T -Wreg -C T -C
C C Cp D R C C
with actual fault (true positive)
TimeT -C Wreg
Predicted failure
T -Wreg -C T -C T -C
C C Cp C C C
no actual fault (false negative)
Waste[fail ] =1
µ
[(1− r)
T
2+ D + R +
r
pC
]⇒ Topt ≈
√2µC
1− r
[email protected] CR02 Fault tolerance (3) 41/ 51
Proba models Buddy Proba models 2
Refinements
Use different value Cp for proactive checkpoints
Avoid checkpointing too frequently for false negatives⇒ Only trust predictions with some fixed probability q⇒ Ignore predictions with probability 1− qConclusion: trust predictor always or never (q = 0 or q = 1)
Trust prediction depending upon position in current period⇒ Increase q when progressing⇒ Break-even point
Cp
p
[email protected] CR02 Fault tolerance (3) 42/ 51
Proba models Buddy Proba models 2
With prediction windows
TimeTR-C TR-C Tlost TR-C
Error(Regular mode)
Time
Regular mode Proactive mode
TR-C Wreg
I
TP-Cp TP-Cp TP-Cp TR-C-Wreg
(Prediction without failure)
Time
Regular mode Proactive mode
TR-C Wreg
I
TP-Cp TP-Cp TR-C-Wreg
Error(Prediction with failure)
C C C D R C
C C Cp Cp Cp Cp C
C C Cp Cp Cp D R C
Gets too complicated! /
[email protected] CR02 Fault tolerance (3) 43/ 51
Proba models Buddy Proba models 2
Outline
1 Probabilistic models
2 In-memory checkpointing
3 Probabilistic models for advanced methodsFailure predictionReplication
[email protected] CR02 Fault tolerance (3) 44/ 51
Proba models Buddy Proba models 2
Replication
Systematic replication: efficiency < 50%
Can replication+checkpointing be more efficient thancheckpointing alone?
Study by Ferreira et al. [SC’2011]: yes
[email protected] CR02 Fault tolerance (3) 45/ 51
Proba models Buddy Proba models 2
Model by Ferreira et al. [SC’ 2011]
Parallel application comprising N processes
Platform with ptotal = 2N processors
Each process replicated → N replica-groups
When a replica is hit by a failure, it is not restarted
Application fails when both replicas in one replica-group havebeen hit by failures
[email protected] CR02 Fault tolerance (3) 46/ 51
Proba models Buddy Proba models 2
Example
p1
p2
p1
p2
p1
p2
p1
p2
Time
Pair1
Pair2
Pair3
Pair4
[email protected] CR02 Fault tolerance (3) 47/ 51
Proba models Buddy Proba models 2
The birthday problem
Classical formulationWhat is the probability, in a set of m people, that two of themhave same birthday ?
Relevant formulationWhat is the average number of people required to find a pair withsame birthday?
Birthday(N) = 1 +
∫ +∞
0e−x(1 + x/N)N−1dx
The analogy
Two people with same birthday≡
Two failures hitting same replica-group
[email protected] CR02 Fault tolerance (3) 48/ 51
Proba models Buddy Proba models 2
Differences with birthday problem
1 2
. . .
i
. . .
N
N processes; each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure
[email protected] CR02 Fault tolerance (3) 49/ 51
Proba models Buddy Proba models 2
Differences with birthday problem
1 2
. . .
i
. . .
N
N processes; each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure
[email protected] CR02 Fault tolerance (3) 49/ 51
Proba models Buddy Proba models 2
Differences with birthday problem
1 2
. . .
i
. . .
N
N processes; each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure: can failed PE be hit?
[email protected] CR02 Fault tolerance (3) 49/ 51
Proba models Buddy Proba models 2
Differences with birthday problem
1 2
. . .
i
. . .
N
N processes; each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure cannot hit failed PE
Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem
[email protected] CR02 Fault tolerance (3) 49/ 51
Proba models Buddy Proba models 2
Differences with birthday problem
1 2
. . .
i
. . .
N
N processes; each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure cannot hit failed PE
Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem
[email protected] CR02 Fault tolerance (3) 49/ 51
Proba models Buddy Proba models 2
Differences with birthday problem
1 2
. . .
i
. . .
N
N processes; each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure cannot hit failed PE
Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem
[email protected] CR02 Fault tolerance (3) 49/ 51
Proba models Buddy Proba models 2
Differences with birthday problem
1 2
. . .
i
. . .
N
N processes; each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure cannot hit failed PE
Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem
[email protected] CR02 Fault tolerance (3) 49/ 51
Proba models Buddy Proba models 2
Differences with birthday problem
1 2
. . .
i
. . .
N
N processes; each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure cannot hit failed PE
Failure uniformly distributed over 2N − 1 PEsProbability that replica-group i is hit by failure: 1/(2N − 1)Probability that replica-group 6= i is hit by failure: 2/(2N − 1)Failure not uniformly distributed over replica-groups:this is not the birthday problem
[email protected] CR02 Fault tolerance (3) 49/ 51
Proba models Buddy Proba models 2
Differences with birthday problem
1 2
. . .
i
. . .
N
N processes; each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure can hit failed PE
[email protected] CR02 Fault tolerance (3) 49/ 51
Proba models Buddy Proba models 2
Differences with birthday problem
1 2
. . .
i
. . .
N
N processes; each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure can hit failed PE
Suppose failure hits replica-group iIf failure hits failed PE: application survivesIf failure hits running PE: application killedNot all failures hitting the same replica-group are equal:this is not the birthday problem
[email protected] CR02 Fault tolerance (3) 49/ 51
Proba models Buddy Proba models 2
Differences with birthday problem
1 2
. . .
i
. . .
N
N processes; each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure can hit failed PE
Suppose failure hits replica-group iIf failure hits failed PE: application survivesIf failure hits running PE: application killedNot all failures hitting the same replica-group are equal:this is not the birthday problem
[email protected] CR02 Fault tolerance (3) 49/ 51
Proba models Buddy Proba models 2
Differences with birthday problem
1 2
. . .
i
. . .
N
N processes; each replicated twice
Uniform distribution of failures
First failure: each replica-group has probability 1/N to be hit
Second failure can hit failed PE
Suppose failure hits replica-group iIf failure hits failed PE: application survivesIf failure hits running PE: application killedNot all failures hitting the same replica-group are equal:this is not the birthday problem
[email protected] CR02 Fault tolerance (3) 49/ 51
Proba models Buddy Proba models 2
Correct analogy
� � � � . . . �1 2 3 4 . . . n
⇑• • • • • • • • • • • . . .
N bins, red and blue balls
Mean Number of Failures to Interruption (bring down application)MNFTI = expected number of balls to throw
until one bin gets one ball of each color
[email protected] CR02 Fault tolerance (3) 50/ 51
Proba models Buddy Proba models 2
Exponential failures
Theorem: MNFTI = E(NFTI |0) where
E(NFTI |nf ) =
{2 if nf = N,
2N2N−nf + 2N−2nf
2N−nf E (NFTI |nf + 1) otherwise.
E(NFTI |nf ): expectation of number of failures to kill application,knowing that• application is still running• failures have already hit nf different replica-groups
How do we obtain this result?
[email protected] CR02 Fault tolerance (3) 51/ 51