analysis and optimization of fault-tolerant embedded systems with hardened processors
DESCRIPTION
Analysis and Optimization of Fault-Tolerant Embedded Systems with Hardened Processors. Ilia Polian Institute for Computer Science, Albert-Ludwigs-University of Freiburg, Germany. Paul Pop Dept. of Informatics and Mathematical Modeling Technical University of Denmark (DTU), Denmark. - PowerPoint PPT PresentationTRANSCRIPT
1 of 141
Analysis and Optimization of Fault-Tolerant Embedded Systems with
Hardened ProcessorsViacheslav Izosimov, Petru Eles, Zebo Peng
Embedded Systems Lab (ESLAB), Linköping University, Sweden
Ilia PolianInstitute for Computer Science, Albert-Ludwigs-University of Freiburg,
GermanyPaul Pop
Dept. of Informatics and Mathematical ModelingTechnical University of Denmark (DTU), Denmark
2 of 142
Hard real-time safety-critical applications Time-constrained Cost-constrained Quality-of-service Fault-tolerant etc.
Motivation
Focus on transient faults and intermittent faults
3 of 143
Transient and Intermittent Faults
Radiation
Electromagneticinterference (EMI)
Lightning storms
Internal EMICrosstalk
Power supplyfluctuations
Software errors(Heisenbugs)
Errors caused by transient (intermittent)faults have to be tolerated before
they crash the system
4 of 144
Hardening Hardening
Improving the hardware architecture to reduce the error rate Hardware redundancy (selective duplication of gates/units/nodes,
dedicated additional hardware modules/flip-flops) Re-designing the hardware to reduce susceptibility to transient
faults Using higher voltages / lower frequencies / larger transistor sizes Protecting with shields
Lead to lower performance Use of technologies few generations back Increase of the critical path and silicon area
Very expensive Extra-design effort / More expensive technologies More silicon / Increase in the number of gates or computation units Low production volumes
Still may not guarantee the required reliability levels at affordable cost!
5 of 145
P1
Software fault tolerance Reliability increase with time redundancy Lead to lower performance
Fault tolerance overheads Overheads due to error detection, voting, agreement
Low hardware cost
Often cannot guarantee the required reliability levels and, at the same time, meet deadlines!
Software-level Fault Tolerance
P1 P1
6 of 146
Motivation
A trade-off between hardware and softwarefault tolerance has to be addressed to provide
a reliable and low-cost system!
Neither hardening nor pure software-level fault tolerance can guarantee the required level of
reliability…
Fault tolerance against transient faults may lead to
significant performance or cost overhead!
7 of 147
Outline
Motivation
Architecture
Application example
Fault tolerance: hardening & re-execution
Hardening/re-execution trade-off
Problem formulation & design strategy
Experimental results
Conclusions
8 of 148
Architecture
Processes: Re-executionComputation nodes: Hardening
Messages: Fault-tolerant predictable
protocol
…
Transient faults
P2
P4P3
P5
P1
m1
m2
The error rates for each hardening version (h-
version) of each computation node
is the maximum probability of a system failure due to transient faults on any computation node within a
time unit
The reliability goal = 1
9 of 149
Application Example
80P1
N1h = 1
10
h = 2
20Cost
h = 3
40
t t tp p p
100 1604·10-2 4·10-4 4·10-6
N1
= 1 10-5
Hardening versions of computation node N1
Increase in reliabilityDecrease in process failure
probabilities
Worst-case execution times are increasedHardening performance degradation (HPD)
Cost is increasedwith more hardening!
t – worst-case execution time p – process failure probability Cost – h-version cost
P1
10 of 1410
System Failure Probability (SFP) Analysis
80P1
N1h = 1
t p
We have proposed a system failure probability (SFP) analysis to connect error rates and the reliability goal to
the number of re-executions in software
SFP
= 1 10-5
4·10-2T = 360ms
main execution +
k = 6 re-executions
Non-trivialExactSafe
11 of 1411
System Failure Probability (SFP) Analysis
100P1
N1h = 2
t p
We have proposed a system failure probability (SFP) analysis to connect error rates and the reliability goal to
the number of re-executions in software
SFP
= 1 10-5
4·10-4T = 360ms
main execution +
k = 2 re-executions
12 of 1412
System Failure Probability (SFP) Analysis
160P1
N1h = 3
t p
We have proposed a system failure probability (SFP) analysis to connect error rates and the reliability goal to
the number of re-executions in software
SFP
= 1 10-5
4·10-6T = 360ms
main execution +
k = 1 re-executions
13 of 1413
Application Example
80P1
N1h = 1
10
h = 2
20Cost
h = 3
40
t t tp p p
100 1604·10-2 4·10-4 4·10-6 = 20 ms
D = 360ms
N1
= 1 10-5
P1/1N1 P1/23
P1/1N1 P1/2 P1/32
P1/1N1 P1/2 P1/3 P1/4 P1/5 P1/6 P1/71
14 of 1414
Application Example
P2
P1
P4 = 1 10-5
= 15 ms
N1 N2
D = 360 msP3
m1 m4
m3
m2
60
75
60
P1
P2
P3
1.2·10-3
1.3·10-3
1.4·10-3
N1h = 1
16
75P4 1.6·10-3
h = 2
32Cost
h = 3
64
t t tp p p
75
90
75
90
1.2·10-5
1.3·10-5
1.4·10-5
1.6·10-5
90
105
90
105
1.2·10-10
1.3·10-10
1.4·10-10
1.6·10-10
P1
P2
P3
N2h = 1
20
P4
h = 2
40Cost
h = 3
80
t t tp p p
65
50
50 1·10-3
1.2·10-3
1.2·10-3
65 1.3·10-3
75
60
60 1·10-5
1.2·10-5
1.2·10-5
75 1.3·10-5
90
75
75 1·10-10
1.2·10-10
1.2·10-10
90 1.3·10-10
15 of 1415
Application Example
Cc = 64
Ce = 72
Cb = 40
P1 P3 P2 P4N13
P1 P3 P2/1 P4/1N2 P2/2 P4/2
2
P4N2
N1P2/1
bus m2
m3
P3/1
P2/2
P3/2
P1
2
2
Ca = 322N1 P1 P3 P2/1 P4/1P2/2 P4/2
3N2 P1 P3 P2 P4Cd = 80
16 of 1416
Problem Formulation (Input)Input: Application as a set of directed acyclic graphs Reliability goal Deadline D, period T Recovery overhead Bus-based hardware architecture
1. Process worst-case execution times for all h-versions of computation nodes
2. Process failure probabilities for all h-versions3. Costs of all h-versions4. Worst-case message sizes, transformed into the
worst-case transmission times on the bus
17 of 1417
Problem Formulation (Output)
Output:
Selection of h-versions of computation nodes Mapping of all processes Maximum number of re-executions (by using
our SFP analysis) Schedule (static cyclic) of all processes and
messages The final solution has to
1.Be schedulable2.Meet reliability goal3.Minimize the overall system cost
18 of 1418
Design Optimization Strategy
Re-executionOptimization
(based on SFP)
SatisfyReliability
Number ofRe-executions
SFP
Input:Reliability Goal
Period TProcess Failure Probabilities
Mapping +Hardening Setup
19 of 1419
Design Optimization Strategy
HardeningOptimization
+Scheduling
Re-executionOptimization
(based on SFP)
MeetDeadlin
e
HardeningSetup
SatisfyReliability
Number ofRe-executions
SFPInput:Reliability Goal
Period TProcess Failure Probabilities
Mapping
20 of 1420
Design Optimization Strategy
MappingOptimization
+Scheduling
HardeningOptimization
+Scheduling
Re-executionOptimization
(based on SFP)
MeetDeadlin
e
Mapping
MeetDeadlin
e
HardeningSetup
SatisfyReliability
Number ofRe-executions
Input:Reliability Goal
Period T
Architecture(Set of Nodes)
Selection
SFP
21 of 1421
Design Optimization Strategy
ArchitectureOptimization
MappingOptimization
+Scheduling
HardeningOptimization
+Scheduling
Re-executionOptimization
(based on SFP)
BestCost
ArchitectureSelection
MeetDeadlin
e
Mapping
MeetDeadlin
e
HardeningSetup
SatisfyReliability
Number ofRe-executions
SFP
DATE’05
22 of 1422
Selected Experimental Results
% accepted architectures as a function of soft error rate (SER)
0
20
40
60
80
100
10-12 10-11 10-10
% a
cce
pte
d a
rch
itect
ure
s
MAXMINOPT
MAX – hardware optimizationMIN – software optimizationOPT – combined architecture
Accepted architecture:Satisfying maximum accepted cost
Satisfying reliability goalSchedulable
Hardening performance
degradation (HPD) 5%Performance difference
between the least hardened and the most hardened
versions
Maximum cost 20
23 of 1423
10-12 10-11 10-100
20
40
60
80
100
% a
cce
pte
d a
rch
itect
ure
s
MAX MIN OPT
Selected Experimental Results
% accepted architectures as a function of soft error rate (SER)
MAX – hardware optimizationMIN – software optimizationOPT – combined architecture
Hardening performance
degradation (HPD) 100%
Maximum cost 20
24 of 1424
Conclusions
Design optimization strategy for minimization of overall system cost by trading-off between hardening and re-execution Hardware + software fault tolerance techniques System failure probability (SFP) analysis A set of design optimization heuristics
Combining hardware and software fault tolerance techniques is essential
for obtaining cost efficient implementation of fault-tolerant
embedded systems
25 of 1425
System Failure Probability (SFP) Analysis
Given: Application as a set of directed acyclic graphs,
period T Reliability goal Architecture composed of a set of h -versions of
computation nodes Mapping of processes on the nodes Process failure probabilities for all h –versions The number of re-executions kj on each node NjOutput: True, if the system reliability is above or
equal to the reliability goal False, if the system reliability is below the
reliability goal
26 of 1426
System Failure Probability (SFP) Analysis
Probability of a system failure during period T due to transient faults,
or the probability that any of Nj nodes experience more than kj transient faults during period T
( is time unit for reliability goal )
27 of 1427
System Failure Probability (SFP) Analysis
Probability that node Nj experience more than kj transient faults
28 of 1428
System Failure Probability (SFP) Analysis
No fault probability on
node Nj
Probability of that all the
combinations of exactly f faults
are tolerated on node Nj
Probability of that all the
combinations of faults f kj are tolerated on node
Nj
29 of 1429
System Failure Probability (SFP) Analysis
No fault probability on node Nj
A multiplication of no fault probabilities of all the
processes mapped on node Nj
Probability of process Pi failure on node Nj
with hardening level h
30 of 1430
System Failure Probability (SFP) Analysis
Probability of recovery from f faults in a particular fault scenario s* on node Nj
Probability of that all the combinations of exactly f faults are tolerated on node Nj
S* is a multiset!
31 of 1431
System Failure Probability (SFP) Analysis
No fault probability on
node Nj
Probability of that all the
combinations of exactly f faults
are tolerated on node Nj
Node failure probability:
32 of 1432
System Failure Probability (SFP) Analysis
System failure probability during period T:
33 of 1433
System Failure Probability (SFP) Analysis
The evaluation criteria:
34 of 1434
System Failure Probability (SFP) Analysis
Computation example:
P4N2
N1P2/1
bus m2
m3
P3/1
P2/2
P3/2
P1
2
2
35 of 1435
System Failure Probability (SFP) Analysis
60
75
60
P1
P2
P3
1.2·10-3
1.3·10-3
1.4·10-3
N1h = 1
16
75P4 1.6·10-3
h = 2
32Cost
h = 3
64
t t tp p p
75
90
75
90
1.2·10-5
1.3·10-5
1.4·10-5
1.6·10-5
90
105
90
105
1.2·10-10
1.3·10-10
1.4·10-10
1.6·10-10
Cost
P1
P2
P3
N2h = 1
20
P4
h = 2
40
h = 3
80
t t tp p p
65
50
50 1·10-3
1.2·10-3
1.2·10-3
65 1.3·10-3
75
60
60 1·10-5
1.2·10-5
1.2·10-5
75 1.3·10-5
90
75
75 1·10-10
1.2·10-10
1.2·10-10
90 1.3·10-10
P4N2
N1P2/1
bus m2
m3
P3/1
P2/2
P3/2
P1
2
2
36 of 1436
System Failure Probability (SFP) Analysis
1) No re-execution:
Probability of no faulty processes for both nodes N12 and N2
2
Pr (NF ;N12) = (1– 1.2·10-5)·(1– 1.3·10-5) =0.99997500015
Pr (NF ;N22) = (1– 1.2·10-5)·(1– 1.3·10-5) =0.99997500015
Probability of more than no faults:
Pr ([f > 0]F ; N12) = 1 – 0.99997500015 = 0.000024999844
Pr ([f > 0]F ; N22) = 1 – 0.99997500015 = 0.000024999844
The system failure probability during period T without any re-executions:
Pr ([f > 0]F ; N12 [f > 0]F ; N2
2) = 0.000024999844 + 0.000024999844 – 0.000024999844 · 0.000024999844 = 0.00004999907
T = 360 ms(1 – 0,00004999907)10000 = 0.95122912011 < = 1 – 10-5
FALSE!
37 of 1437
System Failure Probability (SFP) Analysis
2) One re-execution on each node:
Probability of exactly one fault to be tolerated with re-execution on each node:Pr (1F ;N1
2)=0.99997500015·(1.2·10-5+1.3·10-5) =0.00002499937
Pr (1F ;N22)=0.99997500015·(1.2·10-5+1.3·10-5) =0.00002499937
Probability of more than 1 fault:Pr ([f >1]F ;N1
2)= 1 – 0.99997500015 – 0.00002499937 = 4.8·10-10
Pr ([f >1]F ;N22)=1 – 0.99997500015 – 0.00002499937 = 4.8·10-10
The system failure probability during period T with one re-execution on each node:Pr ([f > 1]F ; N1
2 [f > 1]F ; N22)= 9.6·10-10
T = 360 ms (1 – 9.6·10-10)10000= 0,99999904000 > = 1 – 10-5
TRUE!
38 of 1438
System Failure Probability (SFP) Analysis
P4N2
N1P2/1
bus m2
m3
P3/1
P2/2
P3/2
P1
2
2
SFPA ( ) True
39 of 1439
Questions?