fundamentals of dependability -...
TRANSCRIPT
1ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Fundamentals of Dependability
Jean-Claude Laprie
2ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Dependability: ability to deliver service that can justifiably be trusted
Service delivered by a system: its behavior as it is perceived by its user(s)
User: another system that interacts with the former
Function of a system: what the system is intended to do
(Functional) Specification: description of the system function
Correct service: when the delivered service implements the system function
(Service) Failure: event that occurs when the delivered service deviatesfrom correct service, either because the system does not comply with thespecification, or because the specification did not adequately describe itsfunction
Failure modes: the ways in which a system can fail, ranked according tofailure severities
Part of system state that may cause a subsequent service failure: error
Adjudged or hypothesized cause of an error: fault
Dependability: ability to avoid failures that are unacceptably frequent or
severe
Failures unacceptably frequent or severe: dependability failure
3ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Absenceof catastrophic
consequences onthe user(s) and the environment
Continuityof service
Readinessfor usage
Absence of unauthorized disclosure of information
Absenceof improper
systemalterations
Ability toundergo
repairs andevolutions
SafetyReliability ConfidentialityAvailability Integrity Maintainability
SecurityAbsence of unauthorized access to, or handling of, system state
Dependability
Authorized actions
4ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
SafetyReliability ConfidentialityAvailability Integrity Maintainability
FaultPrevention
FaultTolerance
FaultRemoval
FaultForecasting
Faults Errors FailuresActivation Propagation Causation
FaultsFailures …… Causation
5ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Dependability attributes
! Availability, Reliability, Safety, Confidentiality, Integrity, Maintainability:Primary attributes
! Secondary attributes
" Specialization
# Robustness: dependability with respect to external faults
# Survivability: dependabilty in the presence of active fault(s)
# Resilience: dependability when facing functional, environmental,or technological changes
" Distinguishing among various types of (meta-)information
# Accountability: availability and integrity of the person whoperformed an operation
# Authenticity: integrity of a message content and origin, andpossibly some other information, such as the time of emission
# Non-repudiability: availability and integrity of the identity of thesender of a message (non-repudiation of the origin),or of the receiver (non-repudiation of reception)
6ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Fault Error Failure
Deviation of thedelivered
service fromcorrect service,
i.e.,implementing
the systemfunction
Part of systemstate that may
cause asubsequent
failure
Adjudged or
hypothesized causeof an error
Failure… …Fault
System does notcomply withspecification
Specification does notadequately describe
function
Dependability Threats
PropagationActivation Causation
7ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Error ErrorError
activation
propagation prop.
ServiceInterface
IncorrectService: Outage
CorrectService
Failure
Correct ServiceIncorrectService: Outage
Failure
extern
al fault
occurre
nce
Internal fault,dormant
vulnerability
activation
(computationprocess)
8ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Ca
usa
tio
n
Propagation, Causation
System: Set of components interconnected in order to interactComponent: Another system
9ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Means for Dependability
Preventingoccurrence orintroduction of
faults
FaultPrevention
Delivering correctservice in spite of
faults
Fault Tolerance
Reducing thepresence of
faults
FaultRemoval
Estimating the presentnumber, the future
incidence and the likelyconsequences of faults
FaultForecasing
Dependability Provision
Dependability Assessment
Fault Avoidance
Fault Acceptance
! !
! !
! !
! !
10ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
" Original definition: ability to deliver service that can justifiably betrusted
! Enables to generalize availability, reliability, safety,confidentiality, integrity, maintainability, that are then attributesof dependability
" Alternate definition: ability to avoid service failures that areunacceptably frequent or severe
! A system can, and usually does, fail. Is it however stilldependable? When does it become undependable?
$criterion for deciding whether or not, in spite of service failures,a system is still to be regarded as dependable
! Dependence of system A on system B is the extent to which systemA’s dependability is (or would be) affected by that of system B
! Trust: accepted dependence
Dependability definitions
Explicitly Implicitly
11ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
F. Schneider, ed.,‘Trust in cyberspace’,National AcademyPress, 1999
A. Ellison et al.,‘Survivable networksystems’, SEI Report,1999
‘Informationtechnology frontiers fora new millenium’, BlueBook 2000, NTSC
Reference
1) hostile attacks (from
hackers or insiders)
2) environmentaldisruptions (accidentaldisruptions, either man-made or natural)
3) human and operatorerrors (e.g., softwareflaws, mistakes by humanoperators)
1) attacks (e.g.,
intrusions, probes, denialsof service)
2) failures (internally
generated events due to,e.g., software designerrors, hardwaredegradation, humanerrors, corrupted data)
3) accidents (externallygenerated events such asnatural disasters)
• internal and externalthreats
• naturally occurringhazards and maliciousattacks from asophisticated and well-funded adversary
1) development faults(e.g., software flaws,hardware errata, maliciouslogic)
2) physical faults (e.g.,production defects,physical deterioration)
3) interaction faults(e.g., physicalinterference, inputmistakes, attacks,including viruses, worms,intrusions)
Threatspresent
assurance that asystem will perform asexpected
capability of a systemto fulfill its mission in atimely manner
consequences of thesystem behavior arewell understood andpredictable
1) ability to deliverservice that canjustifiably be trusted
2) ability of a system toavoid service failuresthat are unacceptablyfrequent or severe
Goal
TrustworthinessSurvivabilityHigh ConfidenceDependabilityConcept
Dependability and similar notions
12ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Attributes
AvailabilityReliabilitySafetyConfidentialityIntegrityMaintainability
Dependability Means
Fault PreventionFault ToleranceFault RemovalFault Forecasting
Threats
FaultsErrorsFailures
Security
13ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Faultforecating
Ordinal or qualitative evaluation
Probabilistic orquantitative evaluation
Modeling
Operational testing
Faulttolerance
Error detection
System recoveryError handling
Fault handling
Development
Static verification
Dynamic verificationVerification
Diagnosis
CorrectionNon-regression verificationFault
removalOperational life Corrective or preventive maintenance
Means
Faultprevention
Attributes
Availability/Reliability
Safety
Confidentiality
Integrity
Maintainability
Threats
Faults
Errors
Failures
Development
Physical
Interaction
Dependability
14ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
111!References
109!Conclusion
102!From dependability to resilience
90!Development of dependable systems
66!Fault tolerance
57!Fault forecasting
45!Fault removal
28!State of the art from statistics
15!Threats: faults, errors, failures
Slides
Outline
15ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Dependability threats: faults, errors, failures
16ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Error of a programmer
FaultImpaired instructions or data
ActivationFaulty component and inputs
Error
PropagationWhen delivered service deviates (value, timing) from
implementing function
Failure
17ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Short-circuit in integrated circuit
Failure
FaultStuck-at connection, modification of circuit function
ActivationFaulty component and inputs
Error
PropagationWhen delivered service deviates (value, timing) from
implementing function
Failure
Causation
18ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Operator ErrorInappropriate human-system interaction
Fault
Error
PropagationWhen delivered service deviates (value, timing) from
implementing function
Failure
19ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Electromagnetic perturbation
Fault
Error
ActivationFaulty component and inputs
FaultImpaired memory data
PropagationWhen delivered service deviates (value, timing) from
implementing function
Failure
20ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Faults Errors Failures
Phase of creationor occurrence
Development faults
Operational faults
System boundariesInternal faults
External faults
Phenomenologicalcause
Natural faults
Human-made faults
PersistencePermanent faults
Transient faults
FaultsFailures ……
IntentMalicious faults
Non-malicious faults
Capability
Accidental faults
Incompetence faults
Deliberate faults
Signaled failures
Unsignaled failuresDetectability
ConsistencyConsistent failures
Inconsistent(Byzantine) failures
Consequences
Minor failures
Catastrophic failures
%
%
%
Content failuresDomain
Timing failures
PropagationActivation CausationCausation
21ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Failures
Détectability
Signalled failures [delivered service is detected as incorrect, and signalled as such]
Unsignalled failures[incorrect service deivery is not detected]
Consistency
Consistent failures[incorrect service identically perceived by all users]
Inconsistent, or Byzantine, failures[some, or all, users perceive differently incorrect service]
Consequences
Minor failures[harmful consequences are of similar cost to the benefits provided by correctservice delivery ]
Catastrophic failures[cost of harmful consequences is orders of magnitude, or evenincommensurably higher than the benefits provided by correct service delivery]
•••
Domain
Value failures[value of delivered service deviates from implementing system function]
Timing failures[timing of service delivery (instant or duration) deviates from implementingsystem function]
22ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Failuredomain
Value(correct timing)
Timing(correct value)
Value andtiming
Value failures
Service deliveredtoo early
Early timing failures
Service deliveredtoo late
Late timing failures
Suspendedservice
Halt failures
Erraticservice
Erratic failures
Non signalling of incorrect service: unsignalled failure
Signalling incorrect service in absence of failure: false alarm
• halt failures: fail-halt system
• minor failures: fail-safe system
Failure of detecting mechanisms
System whose all failures are, to an acceptable extent
23ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Faults
Phase of creationor occurrence
Development faults[occur during (a) system development, (b) maintenance during the use phase,and (c) generation of procedures to operate or to maintain the system]
Operational faults[occur during service delivery of the use phase]
Systemboundaries
Internal faults[originate inside the system boundary]
External faults[originate outside the system boundary and propagate errors into the system by interaction or interference]
Phenomenologicalcause
Natural faults[caused by physico-chemical natural phenomena without human participation]
Human-made faults[result from human actions]
Intent
Malicious faults[introduced by a human with the malicious objective of causing harmto the system]
Non-malicious faults[introduced without a malicious objective]
Deliberate faults[result of a decision] Capability
Accidental faults[introduced inadvertently]
Incompetence faults[result from lack of professional competence by the authorized human(s),or from inadequacy of the development organization]
Persistence
Permanent faults[presence is assumed to be continuous in time]
Transient faults[presence is bounded in time]
24ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Development faults Physical faults Interaction faults
Faults
Perm. Perm. Perm. Perm. Perm. Perm. Trans. Perm. Trans. Trans. Trans. Perm. Trans. Trans. Perm.
Nonmalicious
Malicious Nonmalicious
Nonmalicious
Nonmalicious
Nonmalicious
Malicious
Internal Internal External
Development Operational
Accid. Inc. Accid. Accidental Accidental Accid. Delib. Incompetence DeliberateDelib. Delib.
Persistence
Intent
System boundaries
Phase of creationor occurrence
Capability
Phenomenologicalcause
Human-made
Natural Natural NaturalHuman-made
Maliciouslogic
PhysicalDeterior.
PhysicalInterference
IntrusionAttempts
VirusesWorms
Input Mistakes
Overload
Configuration Mistakes
Design Flaws
Production Defects
25ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Human-made Faults
Non-malicious MaliciousIntent
Accidental(Mistakes)
Deliberate(Bad decisions)
DeliberateIncompetenceCapability
Interaction(operators,
maintainers)&
Development(designers)
Malicious logicfaults:
logic bombs,Trojan horses,
trapdoors,viruses, worms,
zombies
Intrusionattempts
Individuals&
organizations
Development faultsPhysical faults Interaction faults
Hardware Software System
26ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Failure pathology
Activation of dormant fault by computation process
Activation of vulnerability
" Active fault produces error(s)
" Error is latent or detected
Error can disappear before detection
Error propagation Other errors
" Component failure
Fault as viewed by• Components interacting with failed component• System containing failed component
Fault Error Failure FaultActivation
oroccurrence
Propagation Causation(interaction,composition)
External fault
27ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Fault Error FailureFailure… …Fault
Erroralters
serviceFacility forstoppingrecursion
Contextdependent
PropagationActivation orOccurrence
Causation
Interaction orcomposition
Activationreproducibility
Solid (hard)faults
Elusive (soft)faults
Interactionfaults
Prior presenceof a vulnerability:Internal fault that
enables anexternal fault to
harm the systemPermanent faults
(development,physical, interaction)
Transient faults(physical, interaction)
Elusive faults
Solid faults Intermittent faults
28ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
State of the art from statistics
29ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
June 1980: False alerts at the North American Air Defense (NORAD)
June 1985 - January 1987: Excessive radiotherapy doses (Therac-25)
November 1988: Internet worm
15 January 1990: 9 hours outage of the long-distance phone in the USA
February 1991: Scud missed by a Patriot (Dhahran, Gulf War)
November 1992: Crash of the communication system of the Londonambulance service
26 and 27 June 1993: Authorization denial of credit card operations in France
4 June 1996: Failure of Ariane 5 maiden flight
13 April 1998: Crash of the AT&T data network
February 2000: Distributed denials of service on large Web sites
May 2000: Virus I love you
July 2001: Worm Code Red
August 2003: Propagation of the electricity blackout in the USA and Canada
August 1986 - 1987: the "wily hacker" penetrates several tens ofsensitive computing facilities
October 2006: 83,000 e-mail addresses, credit card info, banking transactionfiles stolen in UK
Faults
Ph
ysic
al
De
ve
lop
me
nt
Inte
ractio
n
Ava
ilab
ility
/R
elia
bili
ty
Sa
fety
Co
nfid
en
tia
lity
Lo
ca
lize
d
Dis
trib
ute
d
Failures
" ""
" ""
" " ""
" ""
" " """
" " """
"" ""
" " "
" " ""
" " ""
" " ""
" " ""
" " ""
" " ""
" " ""
30ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
1
1.1
1.2
1.4
1.6
2.8
Industrysector
Banking
Retail
Insurance
Financial institutions
Manufacturing
Millions of $revenue/hourlost
EnergyAverage outage costs#
# Maintenance costs
Space shuttle on-board software: 100 M $ / an!
# Cost of software project cancellation (failure of the development process)
FAA AASEstimate 1983
1 G $
Estimate 1988(contract passed)
4 G $
Estimate 1994
7 G $
Timing shift(estimate 1994)
6 - 8 yrs
!
USA [Standish Group, 2002,13522 projects]
Success
34%
Challenged
51%
Cancelled
15%
loss ~ 38 G$ (out of total 225 G$)
!
Yearly cost of failures#
Accidental faults
Malicious faults
UK
1.25 G£
France (private sector)
0.9 G!
1 G!
USA
4 G$
Estimates of insurance companies (2000)
Global estimate USA : 80 G$ UE : 60 G!
31ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Accidental faults
15-20%2~ 60%1Development
40-50%1~ 20%2Human interactions
15-20%2~ 10%3Physical external
15-20%2~ 10%3Physical internal
ProportionRankProportionRankFaults
Controlled systems
(e.g., civil airplanes, phonenetwork, Internet frontend
servers)
Dedicated computingsystems
(e.g., transactionprocessing, electronic
switching, Internet backendservers)
Number of failures
[consequences and outagedurations depend uponapplication]
32ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Permanent faults and productionyield for Intel microprocessors
1 FIT = 10-9/hDPM : defects per million
1
10
100
1000
10000
1972
1976
1980
1984
1988
1992
1996
2000
FIT
DPM
1980 1990 2000
-1
-2
310
1
10
10
10
10
8086
286DX 386DX
486DX Pentium
Pentium MMX
K7
Pentium2
Pentium3
Pentium4
PentiumM
Athlon
PowerPC4
Itanium2
2
Number of transistors byintegrated circuit (x 106)
1st microprocessor : 4004,1971, 2300 transistors
33ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Transient operational physical faults:proportion increases with integration
Reduced geometric dimensions
Lower and lower energy
Environmental perturbations become faults more easily
Directly:! particles,
outer spaceparticles
Indirectly
Limitation of electrical fields
Lower supply voltage
Redfuced noise immunity
34ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Cosmic rays
Low energy deflected
Primariesdisappear~ 25 km
~ 1600/m2-s cascade to sea level
~ 100/cm2-s at 12000 m
~ 1/cm2-s sea-level flux
µ-
! "0
"+
"-
e+
e-
e+e-
!
!
!
!
n
pp
np
np
np
p
n
n
35ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
" Andrew network of CMU, 13 SUN stations, 21 station.yrs
689298System failures
6552
183
29
1056
Permanent faults
Intermittent faults
Mean time tomanifestation (h)
Number ofmanifestations
" Tandem experimental data
Examination of anomaly log files (several tens of systems, 6months): 132 recorded software faults
• 1 solid fault (« bohrbug »)• 131 elusive faults (« heisenbugs »)
36ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
• Complexity
• Economic pressure
[From J. Gray, ‘Dependability in the Internet era’]
36j 12h0,9
3j 16h0,99
8h 46mn0,999
52mn 34s0,9999
5mn 15s0,99999
32s0,999999
Outage duration/yrAvailability
1950 1960 1970 1980 1990 2000
9%
99%
99.9%
99.99%
99.999%
99.999%
Computer Systems
Telephone Systems
Cellphones
Internet
Availa
bili
ty
2010
37ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Mean time to system crash, dueto hardware failure
1985 1990 1995 20000
10
20
30
ECL-TCM
CMOS
308X/3090
9020
G1-G5
MT
TF
(yrs
)
21 yearsSystem MTBF
438Reported outages
20000074000Disks
8000025500Processors
300009000Systems
70002000Clients
Duration (yrs)Number
TandemHigh end IBM servers
[From J. Gray, ‘A Census of Tandem SystemAvailability Between 1985 and 1990’, IEEE Tr. Onreliability, Oct.1990]
[From L. Spainhower and T. A. Gregg, ‘IBM S/390Parallel Enterprise Server G5 fault tolerance: Ahistorical perspective’, IBM J. Research andDevelopment, 1999]
Environment
Total
Maintenance
Operations
Software
Hardware
50
100
150
200
250
300
350
400
450
0
MTBF (yrs)
38ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
AvailabilityMTTR
98.598 hours
99.011 hour
99.8310 mins
99.981 minAvailabilityfor 100h
MTBF
Website uptime statistics (Netcraft)
Top 50 most requested sites
0
50
100
150
200
250
300
Dec
2003
Nov
2004
July
2005
July
2006
Sep
2007
June
2008
Avg of Avg
Avg of Max
39ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Average availability
Average TTRby part ofservice (hrs)
Service failurecause bylocation
Servicecharacteristic
Website
97.8%97.2%93.5%
1.2 (2 serv. fai.)1.2 (16 serv. fai.)7.8 (4 serv. fai.)Network
14 (3 serv. fai.)0.2 (1 serv. fai.)7.3 (5 serv. fai.)Back-end
2.5 (10 serv. fai.)N/A9.4 (16 serv. fai.)Front-end
4%9%2%Unknown
18%81%18%Network
11%10%3%Back-end
66%0%77%Front-end
39 hours206 hours126 hoursMTTF
562140Service failures
205N/A296Component failures
3 months6 months7 monthsPeriod of data stud.
Open-source OSon x86
Open-source OSon x86
Solaris onSPARC and x86
Front-end nodearchitecture
~500, ~15 sites>2000, 4 sites~500, 2 sites# of machines
~7 million~100 million~100 millionHits per day
Content
(bleeding edge)
Readmostly
(mature)Online
(mature)
Three large websites [from D. Oppenheimer, A. Ganapathi, D.A. Patterson, ‘Why
do Internet services fail, and what can be done about it?’, USISTS ‘03]
40ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Component failure to service failure
node operator
node hardware
node software
net
Content
32
90
48
814
6 9103
10
06
0 10
10
20
30
40
50
60
70
80
90
100
component failure
service failure
node opera
tor
node hard
ware
node softw
are
net opera
tor
net hard
ware
net softw
are
net unkn
own
Online
operator hardware software
40
104
54
10 9 10
75
91
81
component failure
service failure
coverage (%)
0
40
80
120
0
10
20
30
40
50
60
70
80
90
100
36
18
59
37
18
1
14
7
50
94
7681
component failure
service failure
coverage (%)
41ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
0
1000
2000
3000
4000
5000
6000
7000
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
SEI/CERT Statistics: vulnerabilities reported
Malicious faults
42ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
The geographic spread of Sapphire in the 30 minutes after release.
Slammer/Sapphire worm
The fastest computer worm in history. As it began spreading throughout the Internet, it doubled in sizeevery 8.5 seconds. It infected more than 90 percent of vulnerable hosts within 10 minutes.The wormbegan to infect hosts slightly before 05:30 UTC on Saturday, January 25, 2003. Sapphire exploited abuffer overflow vulnerability in computers on the Internet running Microsoft's SQL Server or MSDE 2000(Microsoft SQL Server Desktop Engine). This weakness in an underlying indexing service was discoveredin July 2002; Microsoft released a patch for the vulnerability before it was announced. The worm infectedat least 75,000 hosts, perhaps considerably more, and caused network outages and such unforeseenconsequences as canceled airline flights, interference with elections, and ATM failures.
[From: http://www.caida.org/publications/papers/2003/sapphire/sapphire.html]
43ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Global Information Security Survey 2004 — Ernst & Young
Loss of availability: Top ten incidents
Percentage of respondents that indicated the following incidents resulted in an unexpected orunscheduled outage of their critical business
0% 20% 40% 60% 80%
Hardware failures
Major virus, Trojan horse, or Internet worms
Telecommunications failure
Software failure
Third party failure, e.g., service provider
System capacity issues
Operational erors, e.g., wrong software loaded
Infrastructure failure, e.g., fire, blackout
Former or current employee misconduct
Distributed Denial of Servive (DDoS) attacks
Non malicious76%
Malicious24%
44ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Yearly survey on computer damages in France — CLUSIF (2000, 2001, 2002)
3 year trends& stable' increase( decrease
Occurrences
0
5,0%
10,0%
15,0%
20,0%
25,0%
Internal failures Loss of
essential services
Naturalevents
Physical accidents
Utilization errors
Design errorsVirus
infection
Targetted logic
attacks
Disclosures and frauds
Image damage
Thefts
Physical sabotage
Non malicious causes71%
!
"
"
"
!
#
#
"!
"
!
!
Malicious causes29%
Risk perception
Non malicious causes29%
!
!
"
"
""
!
#
#
Internal failures
Loss of essential services
Naturalevents
Physical accidents
Utilization errors
Design errorsVirus
infection
Targetted logic
attacks
Disclosures and frauds
Image damage
Thefts
Physical sabotage
Malicious causes71%
10,0%
20,0%
30,0%
Occurrence impact
Loss ess. serv.
Int. fail.
Nat events
Util. err.
Des. err.
Virus Infec.
Thefts
High
Average
Low
Non malicious causes Malicious Causes
0%
20%
40%
60%
80%
100%
45ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Fault Removal
46ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Reducing the presence of faults
Diagnosis
Correction
Non-regression verification
Verification Checking whether the systemsatisfies verification conditions
general specific
47ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Verification
DynamicStatic
Static analysis
System
Effective execution
Reviews andinspections
Staticanalysis
Theoremproving
Systembehaviormodel
Modelchecking
Symbolicinputs
Systemmodel
Symbolicexecution
Valued inputs
System
Test
Specification
Design
Implementation
%
%
% %
%
%
%
%
%
% %
48ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Reviews and inspections $ Testing
! Schneider Electric data: nuclear control software
0
50
100
150
200
250
300
Componentstatic
analysis
Unittests
Integration tests
Qualification tests
Number of uncovered faults
Verification effortin man.day
! HP data Uncovered faults/hour
Operational testing 0.21
Functional testing 0.28
Structural testing 0.32
Inspections 1.06
49ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
! Space shutlle on-board software (500 000 lines de code)
Versions
0
5
10
15
20
25
30
5/83 12/85 6/91
Product (operation)
Process (testing)
Early detection (pre-build)
Faults /
KS
LO
C
50ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Cost of software verification
! Average figures: for critical systems, " 70%
Corrective20%
[D,I,V]Adaptative
25%[S,D,I,V]
Perfective55%
[R,S,D,I,V]
Maintenance67%
Verification15%
Implementation7%
Design5%
Specification3%
Requirements3%
Verification45%
Implementation22%
Design17%
Specification 8%
Requirements 7%
Verification46%
51ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Relative cost of fault removal
1976(Boehm)
1995(Baziuk)
Design 5 - 7.5
Implementation 8 - 15
Requirements & specification 1 1
Integration 10 - 25 90 - 440
Qualification 25 - 75 440
Operation/Maintenance 470 - 88050 - 500
Cost of fault removal all the more high as it is late
IBM data
Fa
ult p
erc
en
tag
e
Implem Unittesting
Integr. testing
Qualif. testing
Operat.life
Faultsintroducedby phase
Faultscorrectedby phase
Cost offault
correctionby phase
52ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Functional specs review 1%
Detailed specs review 2%
Inspection of component logic 2%
Inspection of component code 3%
Unit testing 4%
Functional testing 6%
System testing 10%
Percentage of erroneous corrections wrt number ofremoved faults
53ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Incorrect or poorly expressed requirements
Incorrect or poorly expressed specification
Design-implementation faults affectingseveral components
Design-implementation faults affectingone components
Typographical fault
Regression fault
Others
0 10 20 30 40
Percentage of uncovered faults
Typical distribution of software faults uncovered during development
54ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Persistent software faults classified by manifestation frequency
(200 faults examined)
1. Omitted logic (existing code too simple)
2. Non re-initialization of data
3. Regression fault
4. Documentation fault (correct software )
5. Inadequate specification
6. Faulty binary correction
7. Erroneous comment
8. IF instruction too simple
9. Faulty data reference
10. Data alignment fault (left bitswrt right bits, etc.)
11. Timing fault causing data loss
12. Non-initialization of data
13. Other categories of lesser importance (total less than or
equal to 4)
Category:Projet B
(100 k ins.) Total
60
23
17
16
11
11
11
11
10
7
6
5
Projet A(500 K ins.)
24
6
12
6
1
11
11
2
4
3
3
1
36
17
5
10
10
0
0
9
6
4
3
4
55ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Fault density
! Created faults: 10 à 300 / KLOC
! Residual faults: 0.01 à 10 / KLOC
Example : increments of AT&T ESS-5
Size of Density of faults Density of faults Learning Type ofincrement uncovered during uncovered in factor software
development the field (y/x)(x) (y)
42069 28,5 7,2 0,25 Preventive maintenance5422 67,3 21,0 0,31 Billing9313 79,3 27,7 0,35 On-line upgrading14467 26,5 7,2 0,27 System growth
165042 101,6 5,3 0,05 Hardware fault recovery16504 84,1 2 0,02 Software fault recovery38737 149,4 5,8 0,04 System integrity
56ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
0
20
40
60
80
100
120
140
Pentium[march 95]
PentiumPro
[nov 95]
PentiumII
[may 97]
MobileP. II
[aug 98]
Celeron[apr 98]
Xeon[may 98]
Total
Present jan 99
Design faults (‘errata’) of Intel processors
Desig
n faults
Processors[date of first update]
57ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Fault Forecasting
58ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Identification, analysis ofconsequences, and classification
of failures
Probabilistic evaluation of the extentto which some dependability
attributes are satisfied
Ordinal or
qualitativeevaluation
Probabilistic or
quantitative evaluation
Estimation of presence, creation and consequences of faults
Operational testing
Evaluation testaccording to
operational inputprofile
Behavior model of system /failure, maintenance actions,
solicitations
Modelling
59ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Failure mode and effect (and criticality)analysis [FME(C)A]
Reliability diagrams
Fault trees
State diagrams
Markov chains
Petri nets
Ordinal evaluation Probabilistic evaluation
60ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
a
c
Time
Failure intensity (average number of failures per unit of time)
Reliability growth
Improved ability to delivercorrect service
[stochastic increaseof times to failure]
Decreasing failure intensity
Non-stationary stochasticprocesses
Stable reliability
Preserved ability to delivercorrect service
[stochastic equalityof times to failure]
Constant failure intensity
Stationary stochasticprocesses
Reliability decrease
Degraded ability to delivercorrect service
[stochastic decreaseof times to failure]
Increasing failure intensity
61ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Before version X
Version X
Version X+1
Version X+2
Version X+3
Version X+4
IBM Data
18
16
14
12
10
8
6
4
2
0
Avera
ge n
um
ber
of uniq
ue faults / 3
month
s
Relative month where failure first reported
1 3 5 7 9
11
13
15
17
19
21
23
25
27
29
31
33
35
Version X+1 X+2 X+3 X+4 X+5
62ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
0
.1
.2
.3
.4packs p
er
1k lin
es p
er
month
(/5
0)
Yr 3Yr 2Yr 1
Hardware replacement rate trend
Field data
AT&T Data
63ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
64ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
65ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Study of the FAA Airworthiness Directives (AD)
1 Jan. 1980 - 21 Sept. 1994
Confirmed avionics ADs : 33
Hardware : 20 Software : 13
Equipments Rockwell/CollinsBendix/KingHoneywell/SperryTracor Aerospace
Estimation of software reliability
10-610-710-810-910-10
DME (88)
DME (94)
TCAS II (91)
TCAS (94)
Omega
ATC Transp.
AverageNo software problem reported
Failure rate (h-1)
66ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Fault Tolerance
67ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Delivering service implementing system function in spite of faults
Error detection: identification of error presence
Error handling: error removal from system state, if possiblebefore failure occurrence
Fault handling: avoiding fault(s) to be activated again
System recovery: transformation of erroneous state in a state freefrom detected error and from fault that can be activated again
68ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Concurrent detection, during service delivery
Preemptive detection: service delivery suspended, search forlatent errors and dormant faults
Rollback: brings the system back into a state saved prior to erroroccurrence
Saved state: recovery point
Rollforward: new state (free from detected error) found
Compensation: erroneous state contains enough redundancy forenabling error masking
Addition of error detection mechanisms in component
Self-checking component
Error detection
Error handling
69ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Diagnostics: identifies and records the error cause(s), in terms oflocalisation and category
Isolation: performs physical or logical exclusion of the faultycomponent(s) from further contribution to service delivery, i.e.,makes the fault(s) dormant
Reconfiguration: either switches in spare components or reassignstasks among non-failed components
Reinitialization: checks, updates and records the new configuration,and updates system tables and records
Fault handling
! Intermittent faults
" Isolation and reconfiguration not necessary
Error handling Non recurrenceof error
Fault diagnosis Absenceof fault
Intermittentfault
" Identification
70ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Error detection and systemrecovery
orDetection - recovery
Fault masking andsystem recovery
orMasking
Error detection
Error handling
Systematic application
even in error absence
2
Rollback 2
1
Rollforward 2
1
Compensation 12
1
Fault handling 3 3 3 3
71ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
coverage factor influent when 1-c ~
Effectiveness of error processing: coverage factor
c = P{ service is delivered ! failed component }
Component failure is covered if system error successfully processed
Duplex system, active redundancy
Component failure rate: "
Repair rate: µ -c = 1 - c
Reliability
R(t) ! exp{ -2 ( -c + "/µ ) " t }
MTTF ! 1"
2"("-c"+""
µ")""
Unavailability
-A = 1 - A ! 2 ( -c + "/µ ) ( " / µ )
-ANR = " / µ
Equivalent failure rate "eq = 2 ( -c + "
µ ) "
2 -c " ! non-covered failure
2 ( " / µ ) " ! two successive failures, first failure covered
# "/µ
72ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
µ
!
µ
!
!
eq
10-1
10-2
10-3
10-4
1
10-4
10-3
10-2
c = .95
c = .9
c = .99
c = .995
c = .999 c = 1.2
.4
.6
.8
1
0
10 -2 10 -1 10 10 2 10 31
! t
R(t)
c = .95
c = .99
c = .995
c = .999
c = 1
!
µ = 10
-3
10-4 10 -3 10-2
10 -23,6 j / yr
10 -38,8 h / yr
10 -453 min / yr
10 -55 min / yr
10 -630 sec / yr
10 -73 sec / yr
c = .9
c = .95
c = .99
c = .995
c = .999 c = 1
NON REDUNDANT
!
-A
73ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
! Tandem
Dynabus
Dynabus interface
CPU
Memory
I/O channels
Disk control.
Disk control.
Disk control.
Tape control.
VDU control.
Processor
moduleProcessor
module
Processor
module
74ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Experimental data
Systems 9000
Processors 25500
Disks
MTBF System 21 years
1 2 3 4 5 6 7 80
50
100
150
200
71
183
53
21
1 1 1
Num
ber
of re
port
ed failu
res
Length of fault chains
Length of fault
chains having
led to failure
Environment
Total
Maintenance
Operations
Software
Hardware
50
100
150
200
250
300
350
400
450
0
MTBF (yrs)
75ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Hierarchy of faulttolerant protocols
according to relation"uses service”
Group membership
Atomic multicast
Clock synchronization
Common mechanism:consensus
Distributed system: set of processors (processing and storage)
communicating by messages via communication network
Pro
cesso
r
Processor Processor Processor
Processor
Processor Processor Processor
Communication network
76ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
! Implies self-checking processors
! Consequences:
1)! Only correct value messages are sent
" minimum number of processors for consensus in presence of t faults: n"#"t+2
2)! Error detection: halting detection by interrogating and waiting
3)! Resource replication for tolerating t faults ("halts"): # t+1
4)! Communication network saturation by “babbling” is impossible
5)! Network architecture:
Fail-silent processors
77ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
! Induced complications:
1)! Possibility of non consistent (“byzantine”) behavior
" minimum number of processors for consensus in presence of t faults: n"#"3.t+1
" t+1 exchanges of messages
2)! Error detection $ halt detection
3)! Resource replication for tolerating t faults # 2.t+1
4)! Communication network saturation by “babbling” possible
5)! Possibility of "lying" on source address
6) ! Network architecture:
Arbitrary failure processors
78ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
(1) 2 3 9
1 (2) 3 4
1 2 (3) 9
X X X (4)
1
2
3
4
P1P2 P3 P4 P1P2 P3 P4 P1 P2 P3 P4 P1P2 P3 P4Origin
Private values
P1 P2 P3 P4
After 1st exchange
– – 3 4
– 2 – 9
– 5 6 –
– – 3 9
(1 2 3 4)
1 – – 9
6 – 7 –
– 2 – 9
1 – – 4
(1 2 3 9)
7 8 – – (X X X 4)
(1 2 3 9)
X X – X
X – X X
– X X XFrom P1
From P2
From P3
From P4
After 2nd exchange
After majority vote 1 2 3 9 1 2 3 9 1 2 3 9 X X X X
P1P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4
Broadcastingprivate values
Broadcastingvalues receivedfrom theirprocessors
79ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
! Practical aspects
" Performance overhead due to volume of exchanged data
# t=1 % 9 values received by each processor
# t=2 % 156 values received by each processor
" Use of signatures
Signatures attached to exchanged data such that corruption probability withoutdetection is negligible
& 2t+1 processors and communication channels
& t+1 exchanges
& Reduction of volume of data exchanged
) t=1% 4 values received by each processor
) t=2% 40 values received by each processor
80ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
(176,178,233)
178
(176,178,233)
178
(176,178,233)
178
177
176
233
178
! Other utilization: data with likely variations (e.g., from sensors),median vote
(110,176,178)
176
(176,177,178)
177
(176,178,233)
178
177
176
?
178
(176,177,178)
177
(176,177,178)
177
(176,177,178)
177
81ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Prevention of errorpropagation
Fail-fast
Error detection(defensive
programming)and exception
handling
Service continuity
Error detectionand
recovery points
Design diversity
Recoveryblocks
Two-folddiversity
+acceptance
test
N-Versionprogramming
Three-folddiversity
+vote
N-self-checkingprogramming
Four-folddiversity
+switching
Two-folddiversity
+comparison
Doubleprogramming
Soft faults Solid faults
Development fault tolerance
82ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Examples of recovery points
Tandem process pairs
Statistics on 169 software failures
1 processor halted
Several processors halted
138
31
efficiency: 82% (138/169)# does not account for under-reporting
Causes of multiple processor halts
2nd halt: processor executing the backup 24of failed primarysame fault primary-backup 17different fault 4unclear 3Not process-pair related 4Unclear 3
83ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
"Aim: failure independency
! Obstacles: common specification, common difficulties,inter-variant synchronizations and decisions
"Operational use
! Civil avionics: generalized, at differing levels
! Railway signalling: partial
! Nuclear control: partial
"Dependability improvement
! Gain, although less than for physical fault tolerance
! Contribution to verification of specification
Design diversity
84ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
CRA - NCSU - RTI - UCLA/UCSB - UIUC - UVA EXPERIMENT
Sensor management in a redundant inertia platform
20 programmes (1600 to 5000 lines)
920746 tests on flight simulator
Average failure probab 1 version 5.3 10-3
Average failure probab 3 -version systems 4.2 10-4
Reliability improvement : 13
85ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Architecture of Airbus computers
28V
ProcessorMemory I/O
Power
Control channel
Monitoring channelPower
ProcessorMemory I/O
Protection lightning, EMI, over/under voltage
86ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Airbus A-320
Ailerons
Horizontal plane
Elevators
SFCC1
SFCC2
Rudder
Slats and
flaps
FAC1
FAC2
SpoilersSEC1
SEC2
SEC3
ELAC1
ELAC2
87ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Verification and evaluation of fault tolerance
Faults (deficiencies) in algorithms and mechanisms of fault tolerance
Fault toleranceCoverage
Modelling Test
Evaluation / Influence
Fault forecasting
Improvement
Fault removal
Test
Dynamicverification
Staticverification
Model checking
Fault injection
88ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
ActivityTargetsystem
Inputs
Faults
Outputs
Correct/Incorrect
Error detection,error and faulthandling
Fault injection
By hardware
•Radiations
•Interferences
•Pins
Physical
By software
•Memory
•Executive
•Processor
Insimulation
Informational
Injection
Prototype oractual system
Simulationmodel
Target system
! Representativityof faults
89ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Delta-4 system (LAAS)
Robustness testing of POSIX calls (CMU) Test of Chorus Classix R3 (LAAS)
Detectederrors
Toleratederrors
ErrorsFaults
Failures
94% 85% 99%
3% 1%
12%
Dig
ital U
nix
4.0
Fre
e B
SD
2.2
.5
HP
UX
B.1
0.2
0
0
0,05
0,1
0,15
0,2
0,25
Failu
re p
robabili
ty
AIX
4.1
Irix
6.2
Lin
ux 2
.0.1
8
LynxO
S 2
.4.0
NetB
SD
1.3
QN
X 4
.2.4
SunO
S 5
.5
Kern
el
Debug.
No o
bs
Err
or
sta
tus0
10
20
30
40
50
60
70
80
90
Syst.
hang
Excep.
Appl
.hang
Incor.
re
s.
% r
esponses
Internal faults
External faults
Average Min-Max
Average Min-Max
90ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Development of DependableSystems
91ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Dvelopment steps
Requirements and specification
Design
Realisation(including selection and
adaptation of pre-existingcomponents)
Integration
Qualification
Means fordependability
Fault Tolerance
Fault Removal
Fault Forecasting
Fault Prevention
92ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Architecturaldesign
QualificationSpecification
Detaileddesign
Componentrealisation
Integration
Creation
Verification
V development model
93ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Effort distribution
30%30%40%1990
20%60%20%1980
10%80%10%1960-1970
QualificationIntegrationComponentrealisation
Detaileddesign
Architecturaldesign
Requirementsand
specificationYears
94ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Standards
!CEI 61508 (1998) – Combination of a damage probability and of itsseverity
! ISO/CEI 15026 (1998) – A function of the probability of occurrence of agiven threat and the potential adverse consequences of that threat’soccurrence
!MIL-STD-882D (February 2000) – An expression of the impact andpossibility of a mishap in terms of potential mishap severity and probabilityof occurrence
Techniques and approaches $ System criticitality
Risk
Failurelikelihood
Possibleconsequences of
failures
95ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Major
Improbable
Infrequent
Occasional
Probable
Frequent
Probabilityofoccurrence
CatastrophicSignificantMinor
Severity
Probabilities of occurrence andseverities of consequences
Application domains (transportation,energy production,
telecommunications, etc.)
96ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
10-4 - 10-610-8 - 10-10Improbable
10-3 - 10-410-6 - 10-8Infrequent
10-2 - 10-310-5 - 10-6Occasional
10-1 - 10-210-4 - 10-5Probable
> 10 -1> 10-4Frequent
Discrete time (bysolicitation)
Continuous time(h-1)
Probability ofoccurrence
Examples of occurrence probabilities
97ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Example of damage severity (civil avionics)
No effect
Minor
Major
Hazardous /severe-major
Catastrophic
failure conditions which do not effect the operational capability of theaircraft or increase pilot workload.
failure conditions which would not significantly reduce aircraft safety, andwhich involve crew actions that are well within their capabilities. Minorfailure conditions include, for example, a slight increase in crew workload,such as routine flight plan changes or some inconvenience to occupants.
failure conditions which would reduce the capability of the aircraft or theability of the crew to cope with adverse operating conditions to the extentthat there would be, for example, a significant reduction in safety marginsor functional capabilities, a significant increase in crew workload or inconditions impairing crew efficiency, or some discomfort to occupants.
failure conditions which would reduce the capability of the aircraft or theability of the crew to cope with adverse operating conditions to the extentthat there would be:
• a large reduction in safety margins or functional capabilities;
• physical distress or higher workload such that the flight crew could notbe relied on to perform their tasks accurately or completely; or
• serious or fatal injury to a relatively small number of the occupants.
failure conditions which would prevent continued safe flight and landing
98ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Major
Improbable
NegligibleInfrequent
Occasional
Probable
Frequent
Probabilityofoccurrence
CatastrophicSignificantMinor
Severity
UnacceptableUndesirable
Tolerable
Risk classes
99ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Major
TrivialImprobable
Infrequent
Occasional
IntermediateProbable
Frequent
Probabilityofoccurrence
CatastrophicSignificantMinor
Severity
Low
High
Risk classes ' Criticality
100ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Utilisation of techniques and approaches
M : Mandatory
SR : Strongly recommended. If technique or approachnot used, reasons have to be made explicit
R : Recommended
— : No recommendation in favor or against
NR : Explicitly not recommended. If technique orapproach used, reasons have to be made explicit
101ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
! Actor independency
Designer/Implementer
Verifyer Verifyer
Designer/Implementer
Verifyer Verifyer
Designer/Implementer
Verifyer Verifyer
Trivial
Low
Intermediate and High
Can be thesameperson(s)
Can belong tothe samecompany
Criticality
102ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
From Dependability to Resilience
103ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
!Adjective Resilient" In use for 30+ years
" Recently, escalating use( buzzword
" Used essentially as synonym tofault tolerant
" Noteworthy exception: prefaceof Resilient Computing Systems,T. Anderson (Ed.), Collins, 1985
«The two key attributes here aredependability and robustness. […] Acomputing system can be said to berobust if it retains its ability to deliverservice in conditions which are beyondits normal domain of operation»
) in dependability and securityof computing systems
Materialscience
Ecology
Childpsychiatryandpsychology
Industrialsafety
Business
Socialpsychology
Adaptation tochanges, andgetting back
after asetback
!Fault and change tolerance
) in other domains
Resilience
104ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Resilience: persistence of service delivery thatcan justifiably be trusted, when facing changes
Nature
Short termFunctional Foreseen
Medium termEnvironmental Foreseeable
Long termTechnological Unforeseen
Prospect Timing
Dependability: Ability to deliverservice that can justifiably be trusted
At stake: Maintain dependabilityin spite of changes
! The definition does not exclude the possibility of failure
Alternate definition of dependability:Ability to avoid unacceptably frequent or severe service failures
Threat evolution
Resilience: persistence of avoidance of unacceptablyfrequent or severe service failures, when facing changes
Resilience: persistence of dependability when facing changes
105ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Changes
Nature
Short term, e.g., seconds
to hours, as in dynamicallychanging systems(spontaneous, or ‘ad-hoc’,networks of mobile nodes andsensors, etc.)
Functional Foreseen,as in newversioning
Medium term, e.g., hours
to months, as in newversioning or reconfigurations
Environmental
Foreseeable,as in the advent ofnew hardwareplatforms
Long term, e.g., months to
years, as in reorganizationsresulting from merging ofsystems in companyacquisitions, or from couplingof systems in militarycoalitions
Technological
Unforeseen,as drastic changesin service requestsor new types ofthreats
Prospect Timing
Threat evolution, e.g.,
) ever increasingproportion of transienthardware faults,
) ever-evolving andgrowing attacks,
) mismatches between the modificationsthat implement the changes and theformer status of the system
106ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Context: emerging ubiquitous systems
immense information systems,perhaps involving everything from
super-computers, huge server "farms"and giant data centers, to myriads ofsmall mobile computers (billions) and
tiny embedded devices (trillions)
Characteristic: perpetual evolution
Development: dynamiccomponentization via service
discovery and assembly
107ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
[From M.P. Papazoglou et al., ‘Service-Oriented Computing: State of
the Art and Research Challenges’, Computer, Nov. 2007]
108ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Technologies for resilience
Evolutionary changes Evolvability! Adaptation
Trusted service Assessability! Verification and evaluation
Complex systems Diversity! Taking advantage of existing
diversity for avoiding single pointsof failure, and augmenting diversity
Fault Removal
Fault Forecasting
Assessability
Fault Prevention
Fault Tolerance
Evolvability
Usability
Diversity
*
*
*
*
*
*
*
* *
Ubiquitous systems Usability! Human and system users
Means fordependability
109ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
Conclusion
110ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
integration
inte
rcon
nect
ion p
erfo
rmance
!! vanishing substitutes ! for computers
funnel factor!! decreasing natural
! robustness
unavoidability of faults!! ill-mastered complexity
dependency
111ReSIST courseware — Jean-Claude Laprie — Fundamentals of Dependability
References(in addition to those mentioned in the slides)
A.Avizienis, J.C. Laprie, B. Randell, C. Landwehr, Basic Concepts
and Taxonomy of Dependable and Secure Computing, IEEETransactions on Dependable and Secure Computing, Vol. 1, no. 1,pp. 11-33, January-March 2004
J.C. Laprie, From Dependability to Resilience, 38th IEEE/IFIP Int.Conf. On Dependable Systems and Networks, Anchorage, Alaska,June 2008, Sup. Vol., pp. G8-G9