feedback based real-time fault tolerance issues and possible solutions

21
1 Feedback Based Real-Time Fault Tolerance Issues and Possible Solutions Xue Liu, Hui Ding, Kihwal Lee, Marco Caccamo, Lui Sha

Upload: neith

Post on 02-Feb-2016

25 views

Category:

Documents


0 download

DESCRIPTION

Feedback Based Real-Time Fault Tolerance Issues and Possible Solutions. Xue Liu, Hui Ding, Kihwal Lee , Marco Caccamo, Lui Sha. Major Issues in Software Reliability. Software becoming more and more complex More features → larger code size Rapid evolution → introduction of new code. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

1

Feedback Based Real-Time Fault Tolerance

Issues and Possible Solutions

Xue Liu, Hui Ding, Kihwal Lee,

Marco Caccamo, Lui Sha

Page 2: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

2

• Software becoming more and more complex– More features → larger code size– Rapid evolution → introduction of new code

Major Issues in Software Reliability

E.g. Apache

1998 0.8 MLOC

2002 10 MLOC

2004 27 MLOC

E.g. Windows XP 40-50 MLOC

Gray’s Estimate : 1 bug / KLOC

Page 3: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

3

Growing Software Complexity

Poorly managed or maintained; Software bugs and errors.

• Managed by human operators– Shortage of skilled

operators due to the growing complexity

– Costly– To err is human

• Faults

Sources of computing system downtime

(Cite from: Candea, Stanford’03)

Category Source of downtime (percentage)

Hardware 20%

Software 40%

Human operators

40%

Complexity adds difficulty to management and breeds bugs.

- Control the complexity in computer systems!

- Build systems that are robust against software bugs

Page 4: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

4

Feedback Control Reflection• Successful track record in controlling

electro/mechanical systems• Observation 1: Computing systems haven been

crucial in the success of feedback control– Digital designs & implementations etc

• Observation 2: Feedback control have appealing properties– Tolerance of errors (model/sensing/actuation etc) in

the physical process• Utilize runtime feedback for error correction

Computing Systems

Feedback Control

Reflection: Can feedback control help to solve fault tolerance problem in computing systems?

Fault tolerance

Page 5: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

5

Idea 1: Feedback Control of Software Execution

Mechanical systems: Sense (feedback)->Control (error correction) -> Actuation

Software systems: Sense (feedback)->Control (error correction) -> Execution

• A simple and reliable core which gives acceptable performance;• The system under complex control software remains in states that are recoverable by the simple core. (achieve fault tolerance)

Idea 2: Using Simplicity to Control Complexity

Q: Feedback control can help to tolerate errors in mechanical systems, can feedback control help to tolerate software errors also?

Targeted applications: Real-time control systems

Tolerant of Errors in Software Systems

Feedback Control

Tolerant of Errors in Mechanical Systems

Page 6: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

6

A Typical Feedback Control Loop for Mechanical Systems

• Sense: System output, identify if error exists

• Control: Decision

• Actuation: Execution

Mechanical System (Plant)

Sensor

Controller Actuator_

Reference Input

(Decision) (Execution)

(Sensing/error identification)

Page 7: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

7

Related Work – Simplex Architecture

Simple high assurancecontrol subsystem (HAC)

Complex high performancecontrol subsystem (HPC)

Data Flow Block Diagram

Plant

Decision

• A simple reliable core (HAC)

• Diversity in the form of 2 alternatives (HAC, HPC)

• Feedback control of the software execution.

Sense (feedback)->Decision (control/error correction) -> Execution (actuation)

Page 8: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

8

Drawbacks of Simplex• P1: Analytically redundant high assurance controller

(HAC) runs in parallel with complex controller (HPC)– Lowers system performance, increase operating costs– Limits the application of Simplex in only safety-critical domains

• P2: HAC and HPC must run at the same period

Design Goals of ORTGA

1. Similar functionalities with Simplex2. Much less resource usage 3. Flexibility

Our new Proposal: On-demand Real-Time Guard (ORTGA)HAC only runs when faulty occurs!

Page 9: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

9

ORTGA Architecture: Key Ideas

(1) : Reduce resource usage of Simplex

Solution:

• “On-demand” execution of HAC.– Only when the control under HPC is detected as faulty, the HAC is switched in to take over the plant

(2): Flexibility

Solution:• HAC and HPC ‘s periods are multiples of subperiod• HAC and HPC can have different periods.

Page 10: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

10

Background: Maximum Stability Region

• The largest state space such that system is still stable under the current controller

Maximum Stability Region (Recovery

Region)

Stability Region

Lyapunov Functions

State Constraints

Page 11: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

11

How to determine the Maximum Stability Region?

• In the operation of a plant, there is a set of state constraints: representing the safety, device physical limitations, environmental and other operation requirements.

• They can be represented as a normalized polytope, CTX 1, in the N-dimensional state space. We must be able – take the control away from a faulty

State constraints

Admissible States

Operation Constraints and Admissible states

Page 12: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

12

Maximum Stability Region• A stability region is closed with

respect to the operations of simple controller. It is Lyapunov function inside the polytope.

• The maximum recovery region can be found using LMI.

State constraints

RecoveryRegion

Lyapunov function

State Constraints and the switching rule (Lyapunov function)min l

subject to

Switching rule:

T

1

T

T

X AX

A Q + Q A < 0

og det Q

C X < 1

X QX < 1

Page 13: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

13

Research Issues of ORTGA

• How to detect faults in HPC– Timing faults:

• Application level support: Monitor detect heartbeat messages misses

• OS support: Scheduler detect task deadline misses– Other faults:

• Wide range of traditional fault detection techniques can be used.

• When to recover if a fault in HPC is detected?– Recover early?

• Too early: False alarms– Recover late?

• Too late: could not recover in time

Page 14: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

14

When to recover

• Why not recover too early?– Control tasks are shown can tolerate several deadline

misses– Sometimes system just have some delay (overloaded,

communication delay etc)– These are not “real” faults– Try to minimize the recovery due to false alarms

• Why not recover too late?– If you recover too late, then no time to make the

system stable!

Page 15: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

15

Right Time To Recover (RTTR)

• An example of a “desirable” late but timely recovery (under RM)

0 2 4 86

0 2 4 86

0 2 4 86

(b) recover 2 immediately

(a) Normal schedule of 1 and 2

(c) recover 2 late

Observation: Sometimes, a late but timely recovery makes system more schedulable

Assumption: Fault is detected at t=2.0 before its task deadline D=8

Find RTTR instead of minimize MTTR!

Page 16: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

16

A possible solution to determine RTTR

• Idea– Recover as late as possible, – But not too late

• If the state of HPC is going to be out of the HAC-established stability region, recover!

• Otherwise, wait (maybe HPC still OK )

HB1 (t1)

When to recover?

Recovered Threads

HB2 (t2)

Prediction ts

Monitor find HB3 missing

Stability Region S of Controlled Plant

(t3) tr

S

Page 17: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

17

Performance Gain of ORTGA

Reduce Resource Usage: On-demand Execution of HAC

HPC’s timing parameters: {Cp, Tp}; HAC’s timing parameters: {Ca, Ta};

A total savings of:

Relative saving:

Page 18: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

18

Ongoing Work: A proof-of-concept System

Double Inverted Pendulum System

- Double Quanser inverted pendulum with custom-made tracks

- PC/104 sized, i486 compatible system

- Customized Linux 2.6 kernel and root image in flash memory

- ORTGA middleware layer

Page 19: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

19

Conclusions

• Feedback Based Real-Time Fault Tolerance– Leverage feedback control of software execution

• ORTGA Architecture– On-demand execution of reliable core (HAC) only

when fault occurs– Significantly reduces resource usage

• Issues and possible solutions– How to detect fault– When to recover to maintain system stability– How to find the RTTR (instead of minimize MTTR)

Page 20: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

20

Backup Slides

Page 21: Feedback Based  Real-Time Fault Tolerance  Issues and Possible Solutions

21

Software Fault Model in RT Control systems

• Timing fault: misses its deadlines

• Capability abuse: – Corrupt others’ code or data– Unauthorized acquisition of

process/resource management capability

• Semantic fault: incorrect results that can lead to:– Poor control performance– Instability in the plant

Timing fault

GRMS

Semantic fault

Analytic Redundancy(simple & complex Controllers

Capability abuse

Privilege management