www.mobilab.unina.it edcc-8 28 april 2010 valencia, spain mobilab [email protected] roberto natella,...

25
www.mobilab.unin a.it EDCC-8 28 April 2010 Valencia, Spain MobiLab [email protected] Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it The MobiLab Group Dipartimento di Informatica e Sistemistica, Università degli Studi di Napoli Federico II Via Claudio 21, 80125 - Napoli, Italy 28 April 2010, Valencia, Spain Emulation of Transient Emulation of Transient Software Faults for Software Faults for Dependability Assessment: A Dependability Assessment: A Case Study Case Study MobiLab The 8th European Dependable Computing Conference

Upload: priscilla-booker

Post on 18-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

Roberto Natella, Domenico Cotroneo{roberto.natella, cotroneo}@unina.it

The MobiLab GroupDipartimento di Informatica e Sistemistica, Università degli Studi di Napoli Federico II

Via Claudio 21, 80125 - Napoli, Italy

28 April 2010,Valencia, Spain

Emulation of Transient Software Emulation of Transient Software Faults for Dependability Faults for Dependability

Assessment: A Case StudyAssessment: A Case Study

MobiLab

The 8th European Dependable Computing Conference

Page 2: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

2 / 22

Context and problem statement

Software Fault Injection

Bohrbugs and Mandelbugs

Case study from the ATC domain

Evaluation of state-of-the-art fault injection

An experiment involving concurrency faults

Conclusions

::. Outline

Page 3: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Rationale (1/2)

Software faults represent an important cause of system failures

Despite of efforts on Verification activities, fault avoidance, and fault removal, software systems are often delivered with residual software faults

Critical systems adopt Fault Tolerance Mechanisms (FTMs) to avoid failures at run-time

3 / 22

Page 4: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Rationale (2/2)

FTMs: A few examples Spatial redundancy

• CORBA FT, TANDEM90 Process Pairs

Temporal redundancy• Checkpointing and rollback

Software Fault Injection (SFI) is a valuable approach for the verification and the improvement of FTMs

To correctly emulate software faults, we need to understand their features

4 / 22

Page 5: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Software Faults

BohrBugs Faults whose activation is reproducible, i.e., it

is straightforward to identify its activation pattern

Typically detected and then fixed during testing phase

5 / 22

MandelBugs Faults whose activation is transient and not

systematically reproducible Their activation conditions depend on complex

combinations of user inputs, the internal state and the external environment

Page 6: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Problem statement

Mandelbugs represent the major cause of failure in mission-critical system ..up to 82 % in well-tested software [2] [5] [6]

Mandelbugs are typically tolerated by the adoption of several redundancy schemes

6 / 22

Are existing SFI techniques able to emulate Mandelbugs adequately?

How should Mandelbugs be emulated?

Page 7: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Software Fault Injection (SFI)

To date, representativeness of injected faults has still not been investigated with respect to: Fault manifestation; Their effectiveness in testing FTMs (i.e.,

to emulate faults that most often occur and that they should tolerate)

7 / 22

Page 8: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Contributions

We aim to investigate this issue with

a simple experimental campaign but…

….in a complex and real-world software sytems

We evaluated G-SWFIT, with respect to Mandelbugs We compared the results with an experiment,

specifically designed to emulate MandelbugsCase study: a fault-tolerant system from the Air Traffic

Control (ATC) domain It is a Flight Data Processor (FDPS) based on a

CORBA-compliant middleware

8 / 22

Page 9: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Case study (1/2)9 / 22

Page 10: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Case study (2/2)

We modeled the FDPS as a FSM to support the analysis of faults

10 / 22

A state consists of the following internal variables:1) The number of FDP

requests queued by the Façade

2) The number of requests under processing

3) The number of requests queued by Processing Servers (PSs)

CR, FR, PSC, …, are the messages exchanged in the FDPS

Page 11: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Experimental campaign using G-SWFIT (1/4)

We implemented G-SWFIT fault operators in an open-source fault injection tool

The tool analyzes a C/C++ source code file, to produce a set of faulty source files Freely available at: http://www.mobilab.unina.it/SFI.htm

11 / 22

C pre-processor

C/C++Source Files

C/C++frontend

FaultInjector

+

÷ 2

6 3

Abstract Syntax Tree

Patch Files(with faults)

Page 12: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Experimental campaign using G-SWFIT (2/4)

533 faults have been injected in the Façade source code

1599 experiments (3 different workloads)For each experiment, we collected:

• Information about a failure (e.g., Façade crash, switch to the backup, missed FDP requests)

• The state in which a failure occurred• The state in which the fault was activated

12 / 22

Page 13: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Experimental campaign using G-SWFIT (3/4)

G-SWFIT is useful to test important system states (e.g., the checkpointing mechanism)

However, faults did not emulate well Mandelbugs because: A great amount of faults

(56%) manifest themselves during Façade initialization or during the first request (state 0:0:0); but Mandelbugs usually manifest themselves during the operational phase of a system

13 / 22

faults activated failures

Page 14: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Experimental campaign using G-SWFIT (4/4)

14 / 22

However, faults did not emulate well Mandelbugs because (CONTINUED): In most of cases (93%) in

which the backup Façade is activated, the backup also fails (i.e., fault activation is simple to reproduce, like Bohrbugs)

Some important states (potentially affected by Mandelbugs) are untested (e.g., when one or more requests are queued by the PSs)

State coverage: 65%

Page 15: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Concurrency fault emulation (1/3)

To emulate Mandelbugs, we analyzed the scientific literature on software faults

We identified the following fault triggers: Concurrency Timing of external events Wrong memory state Faulty error handlers Complex input sequences Software aging

15 / 22

Page 16: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Concurrency fault emulation (2/3)

Features of most frequent concurrency faults (from a field data study [29]): They are atomicity-violation faults (49%) Only 1 shared variable is involved (66%) At most 2 threads are needed to trigger

the fault (90%)Our fault model:

2 threads access to a shared variable without acquiring a lock (race condition)

16 / 22

Page 17: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Concurrency fault emulation (3/3)

We propose a fault emulation technique in two phases:

Fault injection: Collects information about critical regions and

their memory accesses Removes lock operations before and after a pair

of conflicting critical regionsTrigger injection :

Submits an input sequence to drive the system to a target state

Schedules 2 threads such that memory accesses interfere with each other

17 / 22

Page 18: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Preliminary system characterization

Focusing on fault triggering we have to profile the system to recognize (and then to drive) the operating state

An input is associated to: A sequence of

messages sent and received by the Façade

A sequence of lock and memory accesses

18 / 22

sent by the tester

Page 19: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. How to trigger a fault?

An algorithm identifies (i) which inputs to send and (ii) in which state to send inputs to trigger a fault

The algorithm exploits preliminary information to match messages with shared memory accesses

19 / 22

Page 20: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. An example of concurrency fault (1/2)

The CRQ input is sent

20 / 22

Lock operation omittedThread 1 is blockedThe CR input is sentThread 2 writes aninconsistent valueThread 1 reads the(faulty) value

An algorithm processes the FSM to find when to send inputs (next slide)

Page 21: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Experimental campaign using concurrency faults

4 injected concurrency faults lead to a failure of primary Façade and not the backup one

We covered 13 out of 14 states (93%) in which faults were injected

Cumulative state coverage: 95% In particular, states *:3:1

were tested (i.e., one or more requests queued by PSs)

21 / 22

Page 22: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Lessons Learned

Are existing SFI techniques able to emulate Mandelbugs adequately? No, G-SWFIT should be complemented by

taking into account MandelbugsHow should Mandelbugs be emulated?

Our solution is to identify most common fault triggers, and to try to emulate them in addition to modifying the source code

22 / 22

Page 23: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::.

Thank you!Any questions?

Page 24: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::.

Backup slides

Page 25: Www.mobilab.unina.it EDCC-8 28 April 2010 Valencia, Spain MobiLab cotroneo@unina.it Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. G-SWFIT fault operators

G-SWFIT fault operators were derived from a field data study [14]

Fault activation and manifestation were neglected due to lack of data

:-D :-D :-D