analysis and optimization of fault-tolerant embedded systems with hardened processors

1 of 141

Analysis and Optimization of Fault-Tolerant Embedded Systems with

Hardened ProcessorsViacheslav Izosimov, Petru Eles, Zebo Peng

Embedded Systems Lab (ESLAB), Linköping University, Sweden

Ilia PolianInstitute for Computer Science, Albert-Ludwigs-University of Freiburg,

GermanyPaul Pop

Dept. of Informatics and Mathematical ModelingTechnical University of Denmark (DTU), Denmark

2 of 142

Hard real-time safety-critical applications Time-constrained Cost-constrained Quality-of-service Fault-tolerant etc.

Motivation

Focus on transient faults and intermittent faults

3 of 143

Transient and Intermittent Faults

Radiation

Electromagneticinterference (EMI)

Lightning storms

Internal EMICrosstalk

Power supplyfluctuations

Software errors(Heisenbugs)

Errors caused by transient (intermittent)faults have to be tolerated before

they crash the system

4 of 144

Hardening Hardening

Improving the hardware architecture to reduce the error rate Hardware redundancy (selective duplication of gates/units/nodes,

dedicated additional hardware modules/flip-flops) Re-designing the hardware to reduce susceptibility to transient

faults Using higher voltages / lower frequencies / larger transistor sizes Protecting with shields

Lead to lower performance Use of technologies few generations back Increase of the critical path and silicon area

Very expensive Extra-design effort / More expensive technologies More silicon / Increase in the number of gates or computation units Low production volumes

Still may not guarantee the required reliability levels at affordable cost!

5 of 145

P1

Software fault tolerance Reliability increase with time redundancy Lead to lower performance

Fault tolerance overheads Overheads due to error detection, voting, agreement

Low hardware cost

Often cannot guarantee the required reliability levels and, at the same time, meet deadlines!

Software-level Fault Tolerance

P1 P1

6 of 146

Motivation

A trade-off between hardware and softwarefault tolerance has to be addressed to provide

a reliable and low-cost system!

Neither hardening nor pure software-level fault tolerance can guarantee the required level of

reliability…

Fault tolerance against transient faults may lead to

significant performance or cost overhead!

7 of 147

Outline

Motivation

Architecture

Application example

Fault tolerance: hardening & re-execution

Hardening/re-execution trade-off

Problem formulation & design strategy

Experimental results

Conclusions

8 of 148

Architecture

Processes: Re-executionComputation nodes: Hardening

Messages: Fault-tolerant predictable

protocol

…

Transient faults

P2

P4P3

P5

P1

m1

m2

The error rates for each hardening version (h-

version) of each computation node

is the maximum probability of a system failure due to transient faults on any computation node within a

time unit

The reliability goal = 1

9 of 149

Application Example

80P1

N1h = 1

10

h = 2

20Cost

h = 3

40

t t tp p p

100 1604·10-2 4·10-4 4·10-6

N1

= 1 10-5

Hardening versions of computation node N1

Increase in reliabilityDecrease in process failure

probabilities

Worst-case execution times are increasedHardening performance degradation (HPD)

Cost is increasedwith more hardening!

t – worst-case execution time p – process failure probability Cost – h-version cost

P1

10 of 1410

System Failure Probability (SFP) Analysis

80P1

N1h = 1

t p

We have proposed a system failure probability (SFP) analysis to connect error rates and the reliability goal to

the number of re-executions in software

SFP

= 1 10-5

4·10-2T = 360ms

main execution +

k = 6 re-executions

Non-trivialExactSafe

11 of 1411


100P1

N1h = 2

t p



SFP

= 1 10-5

4·10-4T = 360ms

main execution +

k = 2 re-executions

12 of 1412


160P1

N1h = 3

t p



SFP

= 1 10-5

4·10-6T = 360ms

main execution +

k = 1 re-executions

13 of 1413

Application Example

80P1

N1h = 1

10

h = 2

20Cost

h = 3

40

t t tp p p

100 1604·10-2 4·10-4 4·10-6 = 20 ms

D = 360ms

N1

= 1 10-5

P1/1N1 P1/23

P1/1N1 P1/2 P1/32

P1/1N1 P1/2 P1/3 P1/4 P1/5 P1/6 P1/71

14 of 1414

Application Example

P2

P1

P4 = 1 10-5

= 15 ms

N1 N2

D = 360 msP3

m1 m4

m3

m2

60

75

60

P1

P2

P3

1.2·10-3

1.3·10-3

1.4·10-3

N1h = 1

16

75P4 1.6·10-3

h = 2

32Cost

h = 3

64

t t tp p p

75

90

75

90

1.2·10-5

1.3·10-5

1.4·10-5

1.6·10-5

90

105

90

105

1.2·10-10

1.3·10-10

1.4·10-10

1.6·10-10

P1

P2

P3

N2h = 1

20

P4

h = 2

40Cost

h = 3

80

t t tp p p

65

50

50 1·10-3

1.2·10-3

1.2·10-3

65 1.3·10-3

75

60

60 1·10-5

1.2·10-5

1.2·10-5

75 1.3·10-5

90

75

75 1·10-10

1.2·10-10

1.2·10-10

90 1.3·10-10

15 of 1415

Application Example

Cc = 64

Ce = 72

Cb = 40

P1 P3 P2 P4N13

P1 P3 P2/1 P4/1N2 P2/2 P4/2

2

P4N2

N1P2/1

bus m2

m3

P3/1

P2/2

P3/2

P1

2

2

Ca = 322N1 P1 P3 P2/1 P4/1P2/2 P4/2

3N2 P1 P3 P2 P4Cd = 80

16 of 1416

Problem Formulation (Input)Input: Application as a set of directed acyclic graphs Reliability goal Deadline D, period T Recovery overhead Bus-based hardware architecture

1. Process worst-case execution times for all h-versions of computation nodes

2. Process failure probabilities for all h-versions3. Costs of all h-versions4. Worst-case message sizes, transformed into the

worst-case transmission times on the bus

17 of 1417

Problem Formulation (Output)

Output:

Selection of h-versions of computation nodes Mapping of all processes Maximum number of re-executions (by using

our SFP analysis) Schedule (static cyclic) of all processes and

messages The final solution has to

1.Be schedulable2.Meet reliability goal3.Minimize the overall system cost

18 of 1418

Design Optimization Strategy

Re-executionOptimization

(based on SFP)

SatisfyReliability

Number ofRe-executions

SFP

Input:Reliability Goal

Period TProcess Failure Probabilities

Mapping +Hardening Setup

19 of 1419


HardeningOptimization

+Scheduling


(based on SFP)

MeetDeadlin

e

HardeningSetup

SatisfyReliability


SFPInput:Reliability Goal

Period TProcess Failure Probabilities

Mapping

20 of 1420


MappingOptimization

+Scheduling


+Scheduling


(based on SFP)

MeetDeadlin

e

Mapping

MeetDeadlin

e

HardeningSetup

SatisfyReliability


Input:Reliability Goal

Period T

Architecture(Set of Nodes)

Selection

SFP

21 of 1421


ArchitectureOptimization

MappingOptimization

+Scheduling


+Scheduling


(based on SFP)

BestCost

ArchitectureSelection

MeetDeadlin

e

Mapping

MeetDeadlin

e

HardeningSetup

SatisfyReliability


SFP

DATE’05

22 of 1422

Selected Experimental Results

% accepted architectures as a function of soft error rate (SER)

0

20

40

60

80

100

10-12 10-11 10-10

% a

cce

pte

d a

rch

itect

ure

s

MAXMINOPT

MAX – hardware optimizationMIN – software optimizationOPT – combined architecture

Accepted architecture:Satisfying maximum accepted cost

Satisfying reliability goalSchedulable

Hardening performance

degradation (HPD) 5%Performance difference

between the least hardened and the most hardened

versions

Maximum cost 20

23 of 1423

10-12 10-11 10-100

20

40

60

80

100

% a

cce

pte

d a

rch

itect

ure

s

MAX MIN OPT

Selected Experimental Results

% accepted architectures as a function of soft error rate (SER)

MAX – hardware optimizationMIN – software optimizationOPT – combined architecture

Hardening performance

degradation (HPD) 100%

Maximum cost 20

24 of 1424

Conclusions

Design optimization strategy for minimization of overall system cost by trading-off between hardening and re-execution Hardware + software fault tolerance techniques System failure probability (SFP) analysis A set of design optimization heuristics

Combining hardware and software fault tolerance techniques is essential

for obtaining cost efficient implementation of fault-tolerant

embedded systems

25 of 1425


Given: Application as a set of directed acyclic graphs,

period T Reliability goal Architecture composed of a set of h -versions of

computation nodes Mapping of processes on the nodes Process failure probabilities for all h –versions The number of re-executions kj on each node NjOutput: True, if the system reliability is above or

equal to the reliability goal False, if the system reliability is below the

reliability goal

26 of 1426


Probability of a system failure during period T due to transient faults,

or the probability that any of Nj nodes experience more than kj transient faults during period T

( is time unit for reliability goal )

27 of 1427


Probability that node Nj experience more than kj transient faults

28 of 1428


No fault probability on

node Nj

Probability of that all the

combinations of exactly f faults

are tolerated on node Nj


combinations of faults f kj are tolerated on node

Nj

29 of 1429


No fault probability on node Nj

A multiplication of no fault probabilities of all the

processes mapped on node Nj

Probability of process Pi failure on node Nj

with hardening level h

30 of 1430


Probability of recovery from f faults in a particular fault scenario s* on node Nj

Probability of that all the combinations of exactly f faults are tolerated on node Nj

S* is a multiset!

31 of 1431


No fault probability on

node Nj


combinations of exactly f faults

are tolerated on node Nj

Node failure probability:

32 of 1432


System failure probability during period T:

33 of 1433


The evaluation criteria:

34 of 1434


Computation example:

P4N2

N1P2/1

bus m2

m3

P3/1

P2/2

P3/2

P1

2

2

35 of 1435


60

75

60

P1

P2

P3

1.2·10-3

1.3·10-3

1.4·10-3

N1h = 1

16

75P4 1.6·10-3

h = 2

32Cost

h = 3

64

t t tp p p

75

90

75

90

1.2·10-5

1.3·10-5

1.4·10-5

1.6·10-5

90

105

90

105

1.2·10-10

1.3·10-10

1.4·10-10

1.6·10-10

Cost

P1

P2

P3

N2h = 1

20

P4

h = 2

40

h = 3

80

t t tp p p

65

50

50 1·10-3

1.2·10-3

1.2·10-3

65 1.3·10-3

75

60

60 1·10-5

1.2·10-5

1.2·10-5

75 1.3·10-5

90

75

75 1·10-10

1.2·10-10

1.2·10-10

90 1.3·10-10

P4N2

N1P2/1

bus m2

m3

P3/1

P2/2

P3/2

P1

2

2

36 of 1436


1) No re-execution:

Probability of no faulty processes for both nodes N12 and N2

2

Pr (NF ;N12) = (1– 1.2·10-5)·(1– 1.3·10-5) =0.99997500015

Pr (NF ;N22) = (1– 1.2·10-5)·(1– 1.3·10-5) =0.99997500015

Probability of more than no faults:

Pr ([f > 0]F ; N12) = 1 – 0.99997500015 = 0.000024999844

Pr ([f > 0]F ; N22) = 1 – 0.99997500015 = 0.000024999844

The system failure probability during period T without any re-executions:

Pr ([f > 0]F ; N12 [f > 0]F ; N2

2) = 0.000024999844 + 0.000024999844 – 0.000024999844 · 0.000024999844 = 0.00004999907

T = 360 ms(1 – 0,00004999907)10000 = 0.95122912011 < = 1 – 10-5

FALSE!

37 of 1437


2) One re-execution on each node:

Probability of exactly one fault to be tolerated with re-execution on each node:Pr (1F ;N1

2)=0.99997500015·(1.2·10-5+1.3·10-5) =0.00002499937

Pr (1F ;N22)=0.99997500015·(1.2·10-5+1.3·10-5) =0.00002499937

Probability of more than 1 fault:Pr ([f >1]F ;N1

2)= 1 – 0.99997500015 – 0.00002499937 = 4.8·10-10

Pr ([f >1]F ;N22)=1 – 0.99997500015 – 0.00002499937 = 4.8·10-10

The system failure probability during period T with one re-execution on each node:Pr ([f > 1]F ; N1

2 [f > 1]F ; N22)= 9.6·10-10

T = 360 ms (1 – 9.6·10-10)10000= 0,99999904000 > = 1 – 10-5

TRUE!

38 of 1438


P4N2

N1P2/1

bus m2

m3

P3/1

P2/2

P3/2

P1

2

2

SFPA ( ) True

39 of 1439

Questions?

analysis and optimization of fault-tolerant embedded systems with hardened processors

Documents