5. software redundancy reliable system design 2010 by: amir m. rahmani

5. Software Redundancy

Reliable System Design 2010by: Amir M. Rahmani

matlab1.ir

Software

There are many kinds of software System software

• – Operating system (Windows, Linux, Solaris, etc.)• – Device driver (for printer, graphic card, etc.)• – Compiler (gcc)• – Library (DLLs)• – Distributed system (software shared memory,

Napster, etc.)

User-level software• – E.g., simulator, word processor, spreadsheet, game,

etc.

matlab1.ir

Software Faults/Errors Operating system software (including device drivers)

• – Deadlock (may be able to escape with Ctrl-Alt-Delete)• – Crash and reboot• – Incorrect I/O

User-level software• – Deadlock (can escape with Control-C)• – Incorrect algorithm• – Array bounds violation• – Memory leak (C, C++, but not Java)

• Allocating memory, but not de allocating it• – Reference to a NULL pointer (C, C++, but not Java)• – Incorrect synchronization in multithreaded code

• Allowing more than one thread in critical section at a time• Blocking when holding a lock

• – Inability to handle unanticipated inputs• – Exception that triggers OS to kill process

• Segmentation fault• Bus error

matlab1.ir

Specification vs. Implementation

There are many, many techniques Problem is in two parts:

• 1- Correct specification, erroneous implementation• 2- Erroneous specification, correct implementation

Both parts concern us as software engineers Both parts need attention - why a system fails is

not important to the users Different techniques for the two parts

matlab1.ir

Static (Pre-Release) Fault Detection in Software

As with hardware, can try to find faults before shipping the product

• * Design reviews• * Formal verification (analysis of design)• * Testing (analysis of implementation)

Can try to add in redundancy to mask potential faults• * N-Version programming

Can try to proactively “scrub” the software to remove latent errors (due to aging) before failures occur

• * Software rejuvenation: It involves stopping the running software occasionally, “cleaning” its internal state (e.g., garbage collection, flushing operating system kernel tables, and reinitializing internal data structures) and restarting. E.g. periodically reboot system to flush out remaining latent problems due to aging

matlab1.ir

Static Fault Detection with Formal Verification Formal verification is a systematic, mathematical

way to prove that a system (software or hardware) is correct or incorrect

Correctness is based on a specification Examples of mathematical objects often used to

model systems are:• finite state machines, lPetri nets, timed automata,

process algebra Two broad approaches for formal verification – Theorem proving – Model checking

http://en.wikipedia.org/wiki/Finite_state_machine

http://en.wikipedia.org/wiki/Labelled_transition_system

http://en.wikipedia.org/wiki/Petri_net

http://en.wikipedia.org/w/index.php?title=Timed_automata&action=edit&redlink=1

http://en.wikipedia.org/wiki/Process_algebra

matlab1.ir

Formal Verification: Theorem Proving Theorem Proving consists of using a formal

version of mathematical reasoning (logical inference) about the system.

Example: theorem proving software such as a• HOL (Higher Order Logic) theorem prover, • ACL2

(A Computational Logic for Applicative Common Lisp) Develop logical/mathematical equations that

describe:• – System to be verified• – Specification of correctness for the system• In the rules of this logic/mathematics, prove that the system

is equivalent to its specification Theorem proving is difficult for very large

complex systems, but can work on small sub-systems

http://en.wikipedia.org/wiki/HOL_theorem_prover

http://en.wikipedia.org/wiki/ACL2

http://en.wikipedia.org/wiki/HOL_theorem_prover

matlab1.ir

Formal Verification: Model Checking

which consists of a systematically exhaustive exploration of the mathematical model (possible for finite models)

• Example: Describe system as finite state machine (FSM) Develop logical/mathematical equations that describe

required properties of the FSM Example properties:

• – Never ends up in state X• – Can reach every desired state in FSM

Software has been developed to perform model checking logically, this is an exhaustive search

• – Example: Murp ِ model checker (from Stanford) Similar to theorem proving, model checking is difficult

for large complicated systems• – Algorithms tend to be exponential in number of states

Verification and Validation

Validation: "Are we trying to make the right thing?", i.e., is the product specified to the user's actual needs?

Verification: "Have we made what we were trying to make?", i.e., does the product conform to the specifications?

Often refers to the overall checking process as V & V

matlab1.ir

http://en.wikipedia.org/wiki/Verification_and_validation

matlab1.ir

Software tools for Static Analysis There are tools that can analyze software to determine

if it has bugs• In most cases the analysis is performed on some version of the

source code and in the other cases some form of the object code. Can check to see if:

• – All code is reachable• – Deadlock is possible

Advantage of static analysis tools• – Checks all possible control flow paths through application

can detect any possible specified problem, even if it would only occur very rarely in practice

Disadvantages• – Must have access to entire code base, e.g., can’t deal with

dynamically loaded libraries• – Difficult to assess probability of error occurring in practice

http://en.wikipedia.org/wiki/Source_code

http://en.wikipedia.org/wiki/Object_code

matlab1.ir

Dynamic Fault Detection in Software

Must add code to check software as it is running

• – Unless you’re willing to wait for it to crash Added code = redundancy! Most common form of error detection:

assertions• – E.g., assert (Grade >= 0 && Grade <= 20)

Challenges• – Knowing which invariants to check• – Knowing when to check these invariants• – Dealing with black box code (e.g., libraries)

matlab1.ir

Automatic Dynamic Fault Detection with Meta-Compilation Recent research from Berkeley explores how to

have the compiler automatically integrate error checking to code

User can specify general high-level invariants Compiler automatically integrates invariant

checking into the code Example

• 1- 99% of lock_acquire() must have corresponding lock_release(). // so that other 1% is probably wrong

• 2- if (ptr = = NULL){printf(%d, ptr->data) // what’s wrong here?}

matlab1.ir

Other Forms of Dynamic Fault Detection

Java has automatic array bounds checking, and it won’t let you write beyond the bounds of the array

Operating system will not let an application process access memory that doesn’t belong to it. This is what is happening when you see “segmentation fault”!

FTP software uses a checksum to make sure that the data that was received is the same as the data that was sent

Other examples?

matlab1.ir

Self-Checking Code

Can we write software that checks that its output is• • Example: if we divide A/B = C, we can check the result by

multiplying B*C. If B*C != A, then the division was incorrect.– Detects hardware faults (famous Pentium bug)

– Detects software faults (assuming more complicated operation than just division, which is a single instruction)

Key idea: checking a computation is always at least as easy as performing it (result from computational complexity theory)

Other examples? Finding paper.

matlab1.ir

Hardware for Software Fault-Tolerance

Difficult for HW to know that SW is in error, because HW doesn’t know what SW is trying to do

Example• – it’s unlikely that a program really wants to divide by

zero• – Any others?• - Watchdog timer

Current work at Duke is exploring hardware support for detecting starvation

matlab1.ir

Software for Hardware Fault-Tolerance

Many examples of using software to tolerate HW faults

In fact, all schemes for tolerating software errors will detect hardware errors that manifest themselves in the same way (i.e., they have the same error model)

• – E.g., self-checking software will detect a hardware fault if it leads to an incorrect result• Example: if we divide A/B = C, we can check the

result by multiplying B*C. If B*C != A, then the division was incorrect.

matlab1.ir

What is Software Fault Tolerance?

The term ”software fault-tolerance” can mean two things:

1. ”the tolerance of software faults”, or

2. ”the tolerance of faults by the use of s software”

Definition 1 is more commonly used.

The term ”software redundancy” corresponds to definition 2.

Remember: All software faults are design faults (Specification and Implementation mistakes)!

matlab1.ir

Cause-and-Effect Relationship

matlab1.ir

Software Redundancy

Software redundancy techniques can be divided in two major classes:• With diversity• – Design or data diversity• – Aim is to tolerate design faults

• Without diversity• – Implements error detection, recovery, etc• – Aim is to handle errors of any origin (physical

faults, design faults, operator faults)

matlab1.ir

Design Diversity

Design diversity is used to tolerate design faults in hardware and software

Two techniques for tolerating software design faults:

• • N-version programming• • Recovery blocks

matlab1.ir

N-version programming

Heterogeneous redundancy• – TMR is homogeneous redundancy • – Question? Why would TMR not work here?

Uses majority voting on results produced by N program versions

Program versions are developed by different teams of programmers

Assumes that programs fail independently Look likes masking hardware redundancy Uses Forward Error Recovery

matlab1.ir

N-version programming

matlab1.ir

Achieving Version Independence-Diversity

Different design teams for each version Diverse specifications Versions with differing capabilities Teams working on different modules are forbidden to

directly communicate Diverse programming languages, development tools,

compilers, hardware, operating systems and etc. Questions regarding ambiguities in specifications or any

other issue have to be addressed to some central authority who makes any necessary corrections and updates all teams

…

matlab1.ir

Causes of Version Correlation

Common specifications: errors in specifications will propagate to software

Inherent difficulty of problem: algorithms may be more difficult to implement for some inputs, causing faults triggered by same inputs

Common algorithms: algorithm itself may contain instabilities in certain regions of input space - different versions have instabilities in same region

Cultural factors: programmers make similar mistakes in interpreting ambiguous specifications

Common software and hardware platforms: if same hardware, operating system, and compiler are used - their faults can trigger a correlated failure

matlab1.ir

N-version programming depends on

Initial specification — The majority of software faults stem from inadequate specification? A specification error will manifest itself in all N versions of the implementation

Independence of effort — Experiments produce conflicting results. Where part of a specification is complex, this leads to a lack of understanding of the requirements. If these requirements also refer to rarely occurring input data, common design errors may not be caught during system testing

Adequate budget — The predominant cost is software. A 3-version system will triple the budget requirement and cause problems of maintenance. Would a more reliable system be produced if the resources potentially available for constructing an N-versions were instead used to produce a single version?

matlab1.ir

Evaluation of N-version programming

Few experimental studies of effectiveness of N-version programming

Published results only for work in universities. Program: Anti-missile application

• • 27 versions produced by students at University of Virginia and University of California at Irvine. Published in 1985.

• • Some had no prior industrial experience while others over ten years

• • All students was given the same specification• • All versions were written in Pascal• • 200 test cases to validate each program• • 1 million test cases to test independence (simulation of

production • • 93 correlated faults were identified by standard statistical

hypothesis-testing methods• • No correlation observed between quality of programs produced

and experience of programmer

matlab1.ir

Recovery Block

N-versions; one running - if it fails, execution is switched to a backup

Uses one primary software module and one or several secondary (back-up) software modules

Assumes that program failures can be detected by acceptance tests

Executes only the primary module under error-free conditions

Look likes dynamic hardware redundancy

matlab1.ir

Recovery Block

matlab1.ir

Recovery Block Mechanism

EstablishRecovery

Point

AnyAlternatives

Left?

EvaluateAcceptance

Test

RestoreRecovery

Point

ExecuteNext

Alternative

DiscardRecovery

Point

Fail Recovery Block

Yes

No

Pass

Fail

matlab1.ir

Recovery Block Format

Acceptance test is provided to check if answers are reasonable

Format:ensure

acceptance testby

primary moduleelse by

first alternativeelse by second alternative …. else error

matlab1.ir

Example: Solution to Differential Equation

Explicit Kutta Method fast but inaccurate when equations are stiff

Implicit Kutta Method more expensive but can deal with stiff equations

• - The above will cope with all equations• - It will also potentially tolerate design errors in the Explicit

Kutta Method if the acceptance test is flexible enough

ensure Rounding_err_has_acceptable_toleranceby Explicit Kutta Methodelse by Implicit Kutta Method else error

matlab1.ir

Construction of Acceptance Tests An acceptance test is a software implemented check designed to

detect errors in the results produced by a primary or a secondary module

The design of the acceptance test is crucial to the efficacy of the Recovery Block scheme

Acceptance tests often relies on application specific information All the previously discussed error detection techniques discussed can

be used to form the acceptance tests There is a trade-off between providing comprehensive acceptance

tests and keeping overhead to a minimum, so that fault-free execution is not affected

Note that the term used is acceptance not correctness; this allows a component to provide a degraded service

However, care must be taken as a faulty acceptance test may lead to residual errors going undetected

Success of Recovery Block approach depends on failure independence of different versions (modules) and quality of acceptance test

matlab1.ir

Examples of how acceptance can be constructed

Satisfaction of requirements (Structural checks)

• • Inversion of mathematical functions; e.g. squaring the result of a square-root operation to see if it equals the original operand.

• • Checking sort functions; result should have elements in descending order

matlab1.ir

Examples of how acceptance can be constructed

Reasonable checks• • Checking physical constraints; e.g. speed,

pressure, etc• • Checking sequence of application states

Structural checks• • Structural checks are based on known properties• of data structures• – a number or elements in a list can be counted, or

links and pointer can be verified

matlab1.ir

Evaluation of Recovery Blocks

Naval command and control system (8000 statements in the Coral language)

117 abnormal events• Correct recovery 78 %• Incorrect recovery, program failure 3 %• Incorrect recovery, no program failure 15

%• Unnecessary recovery 3 %

• Anderson, T., et al., ”Software Fault Tolerance: An Evaluation,” IEEE Trans. on Software Engineering, vol. SE-11, no. 12, Dec 1985, pp. 1502-1510.

matlab1.ir

N-Version vs. Recovery Block N-version programming

• • Applied at the program level• • Runs N programs at the same time• • Look likes static hardware redundancy• • Vote comparison (error masking)• • Assumes that independence among program versions is

achieved by random differences in programming style among programmers

Recovery block• • Applied at the module (subprogram) level• • Runs only the primary module under error-free conditions• • Look likes dynamic hardware redundancy• • Error detection : acceptance test• • Independence is achieved by intentionally designing the

primary and secondary modules to be as different as possible (different algorithms)

Data Diversity

This technique is cheaper to implement than the design diversity tecghnique.

Popular techniques which are based on the data diversity concept for fault tolerance in software are:

• • Retry blocks• • N-copy programming

matlab1.ir

Retry Blocks

A retry block is a modification of the recovery block structure that uses data diversity instead of design diversity (data and re-expressed data like complement of data).

Rather than the multiple alternate algorithms used in a recovery block, a retry block use only one algorithm.

A retry block's acceptance test has the same form and purpose as a recovery block's acceptance test.

matlab1.ir

N-Copy Programming

An N-copy programming is similar to an N-version programming but uses data diversity instead of design diversity.

N copies of a program execute in parallel, each on a set of data produced by re-expression.

The system selects the output to be used by an enhanced voting scheme.

matlab1.ir

Airbus A330

National origin Multi-national Manufacturer Airbus First flight 2 November 1992 Status In production, in service Primary users

• Cathay Pacific• Delta Air Lines• Qatar Airways• Emirates

Produced 1993–present Number built 1,016 as of 10 October 2013 Unit cost A330-300, €215 million(2011)

http://en.wikipedia.org/wiki/Airbus_A330 (3 Nov. 2013)

matlab1.ir

http://en.wikipedia.org/wiki/Airbus_A330

A340

National origin Multi-national Manufacturer Airbus First flight 25 October 1991 Status Out of production, in service Primary users

• Lufthansa• Iberia, • South African Airways• Virgin Atlantic Airways

Produced 1993-2011 Number built 375 Unit cost A340-600: US$275.4 million

matlab1.ir

matlab1.ir

Design Diversity in Airbus A330/A340

Two types of computers• • 3 primary computers• • 2 secondary computers

Each computer are internally duplicated and consists of two channels

• • Command channel• • Monitor channel

matlab1.ir

Architecture for A330/A340

Flight control Flight control Flight control

primary computers secondary computers data concentrators

matlab1.ir

Implementation of primary computers• • Supplier: Aerospatiale (HW&SW)• • Hardware: Two Intel 80386 (one for each channel)• • Software: assembler for command channel, PL/M

for monitor channel.

Implementation of secondary computers• • Supplier: Sextant Avionique (HW), Aerospatiale

(SW)• • Hardware: Two Intel 80186 (one for each channel)• • Software: assembler for command channel, Pascal

for monitor channel.

Design Diversity in Airbus 330/A340

matlab1.ir

Exception Handling Exception indicates that something happened during

execution that needs attention Control is transferred to an exception-handler - routine

which takes appropriate action Example: When executing y = a*b, if overflow => result

incorrect => signal an exception Effective exception-handling can make a significant

improvement to system fault tolerance Over half of code lines in many programs are devoted to

exception-handling Exception handling is a Forward Error Recovery

mechanism, as there is no roll back to a previous state; instead control is passed to the handler so that recovery procedures can be initiated

However, the exception handling facility can be used to provide Backward Error Recovery

matlab1.ir

Example: Domain and Range Failure Exceptions can be used to deal with

• - domain or range failure• - out-of-ordinary event (not failure) needing special attention• - timing failure

A domain failure happens when illegal input is used Example: if X, Y are real numbers and X = √Y is

attempted with Y = -1, a domain failure occurs A range failure occurs when program produces an output

or carries out an operation that is seen to be incorrect in some way

Examples include:• - Encountering an end-of-file while reading data from file• - Producing a result that violates an acceptance test• - Trying to print a line that is too long• - Generating an arithmetic overflow or underflow

matlab1.ir

Timing Failure

Timing Checks: Timing checks are an effective form of software check for detecting errors even in cases of running programs in a dual redundant execution mode, if the specification of a component includes timing constraints.

Watch-dog timer • - is used to guard against program hang-ups.• - Also used in communications between CPU and main

store.• -Also used in periodic "hello" exchanges (network

surveillance) and in I/O operations

5. software redundancy reliable system design 2010 by: amir m. rahmani

Documents