© Erik Hollnagel, 2010
Resilience Engineering:The changing nature of safety
Erik Hollnagel
E-mail: [email protected]
Professor &Industrial Safety ChairMINES ParisTechSophia Antipolis, France
Professor IINTNU
Trondheim, Norge
© Erik Hollnagel, 2010
Normal accident theory (1984)
“On the whole, we have complex systems because we don’t know how to produce the output through linear systems.”
© Erik Hollnagel, 2010
Mor
als,
soc
ial n
orm
s
Gov
ern m
ent
MTO view: sharp-end, blunt-end
Aut
hor it
y
Com
pan y
Man
age m
ent
Wor
kpla
ce
Activity
Blunt-end failures
happened earlier and
somewhere else
Sharp-end failures happen here and now
© Erik Hollnagel, 2010
Focus on operation (sharp end)
Design Maintenance
Organisation (management)
Scope of work system (1984)
Technology, automation
Upstream
Downstream
Work has clear objectives and takes place in well-defined situations. Systems and technologies are loosely coupled and
tractable.
© Erik Hollnagel, 2010
Scope of work system (2008)
Vertical and horizontal extensions
A vertical extension to cover the entire system, from technology to organisation
Design Maintenance
Organisation (management)
Upstream Technology, automation
Downstream
© Erik Hollnagel, 2010
Scope of work system (2008)
Vertical and horizontal extensions
One horizontal extension, to cover the lifecycle,
from design to maintenance
A vertical extension to cover the entire system, from technology to organisation
Design Maintenance
Organisation (management)
Upstream Technology, automation
Downstream
© Erik Hollnagel, 2010
DownstreamScope of work system (2008)
Vertical and horizontal extensions
One horizontal extension, to cover the lifecycle,
from design to maintenance
A vertical extension to cover the entire system, from technology to organisation
A second horizontal extension, to cover upstream and downstream processes
Design Maintenance
Organisation (management)
Upstream Technology, automation
© Erik Hollnagel, 2010
Tractable and intractable systems
Principles of functioning known (white box)
System does not change while being described
Description are simple with few details
Tractable system(independent, clockwork)
System changes before description is completed
Comprehensibility
Underspecified
Stability
Complicacy
Fully specified Partly specified
Intractable system(interdependent, teamwork)
System changes before description is completed
Underspecified
Principles of functioning unknown (black box)
Elaborate descriptions with many details
© Erik Hollnagel, 2010
Performance variability is necessary
Many socio-technical systems are intractable. The conditions of work therefore never completely match what has been specified or prescribed.
Systems are so complex that work situations always are underspecified – hence partly unpredictable
Few – if any – tasks can successfully be carried out unless procedures and tools are adapted to the situation. Performance variability is both normal and necessary.
?
Individuals, groups, and organisations normally adjust their performance to meet existing conditions, specifically actual resources and requirements.
Because resources (time, manpower, information, etc.) always are finite, such adjustments will always be approximate rather than exact.
Performance variability
Success
Failure
© Erik Hollnagel, 2010
If thoroughness dominates, there may be too little time to carry out the actions.
If efficiency dominates, actions may be badly
prepared or wrong
Neglect pending actionsMiss new events
Miss pre-conditionsLook for expected results
Thoroughness: Time to thinkRecognising situation.Choosing and planning.
Efficiency: Time to doImplementing plans. Executing actions.
Efficiency-Thoroughness Trade-Off
Time & resources needed
Time & resources available
© Erik Hollnagel, 2010
The ETTO principle
The ETTO principle describes the fact that people (and organisations) as part of their activities practically always must make a trade-off between the resources (time and effort) they spend on preparing an activity and the resources (time, effort and materials) they spend on doing it.
ETTOing favours thoroughness over efficiency if safety and quality are the dominant concerns, and efficiency over thoroughness if throughput and output are the dominant concerns.
The ETTO principle means that it is impossible to maximise efficiency and thoroughness at the same time. Neither can an activity expect to succeed, if there is not a minimum of either.
© Erik Hollnagel, 2010
Failures or successes?
Who or what are responsible for the remaining 10-20%?
When something goes right, e.g., 9.999 events
out of 10.000, are humans also responsible in 80-90% of the cases?
When something goes wrong, e.g., 1 event out of 10.000
(10E-4), humans are assumed to be responsible in
80-90% of the cases.
Who or what are responsible for the remaining 10-20%?
Investigation of failures is accepted as important.
Investigation of successes is rarely
undertaken.
© Erik Hollnagel, 2010
Why only look at what goes wrong?
Focus is on what goes wrong. Look for failures and malfunctions. Try to eliminate causes and improve barriers.
Focus is on what goes right. Use that to
understand normal performance, to do
better and to be safer.
Safety = Reduced number of adverse events.
10-4 := 1 failure in 10.000 events
1 - 10-4 := 9.999 non-failures in 10.000 events
Safety and core business help each other.
Learning uses most of the data available
Safety and core business compete for resources. Learning only uses a fraction of the data available
Safety = Ability to succeed under varying
conditions.
© Erik Hollnagel, 2010
Low Low Low Moderate
Risk profile (= lack of safety)
Very critical
Critical
Marginal
Negligible
Cons
eque
nce
ProbabilityCertainLikelyPossibleUnlikelyRare
Low Low Moderate High High
Moderate Moderate High High Extreme
High Extreme Extreme ExtremeHigh
Moderate
© Erik Hollnagel, 2010
Low
Benefit profile (= safety)
Innovative
Effective
Acceptable
Negligible
Cons
eque
nce
ProbabilityRare
Low
Moderate
High
Low
Unlikely
Low
Moderate
High
Possible
Very high
High
Low
Moderate
Moderate
Likely
High
Very high
High
Moderate
Certain
High
Very high
Very high
© Erik Hollnagel, 2010
Normal outcomes(things that go
right)
Serendipity
Very highPredictability
Range of event outcomes
Outcome
Very low
Posi
tive
Nega
tiv e
Neut
ral
Disasters
Near misses
Accidents
Incidents
Good luck
Mishaps
© Erik Hollnagel, 2010
Normal outcomes(things that go
right)
Serendipity
Very high Predictability
Frequency of event outcomes
Outcome
Very low
Posi
tive
Nega
tiv e
Neut
ral
Disasters
Accidents
Good luck
Mishaps
IncidentsNear misses
102
104
106
© Erik Hollnagel, 2010
Normal outcomes(things that go
right)
Serendipity
Very high Predictability
Being safe versus being unsafe
Outcome
Very low
Posi
tive
Nega
tiv e
Neut
ral
Disasters
Accidents
Good luck
Mishaps
IncidentsNear misses
102
104
106
Safe Safe FunctioningFunctioning
(invisible)(invisible)
Unsafe Unsafe FunctioningFunctioning
(visible)(visible)
© Erik Hollnagel, 2010
Consequence: Accidents are prevented by monitoring and damping variability. Safety requires constant ability to anticipate future events.
Non-linear accident model
Assumption:
Hazards-risks:
Functional Resonance Accident Model
C e r t i f i c a t i o nI
P
C
O
R
TF A A
L u b r i c a t i o nI
P
C
O
R
T
M e c h a n i c s
H i g h w o r k l o a d
G r e a s e
M a i n t e n a n c e o v e r s i g h t
I
P
C
O
R
T
I n t e r v a l a p p r o v a l s
H o r i z o n t a l s t a b i l i z e r
m o v e m e n tI
P
C
O
R
TJ a c k s c r e w
u p - d o w n m o v e m e n t
I
P
C
O
R
T
E x p e r t i s e
C o n t r o l l e ds t a b i l i z e r
m o v e m e n t
A ir c r a f t d e s i g n
I
P
C
O
R
T
A i r c r a f t d e s i g n k n o w l e d g e
A i r c r a f t p i t c h c o n t r o l
I
P
C
O
R
T
L i m i t i n g s t a b i l i z e r
m o v e m e n tI
P
C
O
R
T
L i m i t e ds t a b i l i z e r
m o v e m e n t
A i r c r a f t
L u b r i c a t i o n
E n d - p l a y c h e c k i n g
I
P
C
O
R
T
A l l o w a b l ee n d - p l a y
J a c k s c r e w r e p l a c e m e n tI
P
C
O
R
T
E x c e s s i v ee n d - p l a y
H i g h w o r k l o a d
E q u i p m e n t E x p e r t i s e
I n t e r v a l a p p r o v a l s
R e d u n d a n td e s i g n
P r o c e d u r e s
P r o c e d u r e s
* ETTO = Efficiency-Thoroughness Trade-Off
Accidents result from unexpected combinations (resonance) of variability of normal performance.
Emerge from combinations of normal variability (socio-technical system), hence looking for ETTO* and sacrificing decision
© Erik Hollnagel, 2010
Risks as non-linear combinations
CertificationI
P
C
O
R
TFAA
LubricationI
P
C
O
R
T
Mechanics
High workload
Grease
Maintenance oversight
I
P
C
O
R
T
Interval approvals
Horizontal stabilizer
movementI
P
C
O
R
TJackscrew up-down
movementI
P
C
O
R
T
Expertise
Controlledstabilizer
movement
Aircraft design
I
P
C
O
R
T
Aircraft design knowledge
Aircraft pitch control
I
P
C
O
R
T
Limiting stabilizer
movementI
P
C
O
R
T
Limitedstabilizer
movement
Aircraft
Lubrication
End-play checking
I
P
C
O
R
T
Allowableend-play
Jackscrew replacementI
P
C
O
R
T
Excessiveend-play
High workload
Equipment Expertise
Interval approvals
Redundantdesign
Procedures
Procedures
Systems at risk are intractable rather than tractable.
The established assumptions therefore have to be revised
CertificationI
P
C
O
R
TFAA
LubricationI
P
C
O
R
T
Mechanics
High workload
Grease
Maintenance oversight
I
P
C
O
R
T
Interval approvals
Horizontal stabilizer
movementI
P
C
O
R
TJackscrew up-down
movementI
P
C
O
R
T
Expertise
Controlledstabilizer
movement
Aircraft design
I
P
C
O
R
T
Aircraft design knowledge
Aircraft pitch control
I
P
C
O
R
T
Limiting stabilizer
movementI
P
C
O
R
T
Limitedstabilizer
movement
Aircraft
Lubrication
End-play checking
I
P
C
O
R
T
Allowableend-play
Jackscrew replacementI
P
C
O
R
T
Excessiveend-play
High workload
Equipment Expertise
Interval approvals
Redundantdesign
Procedures
Procedures
Unexpected combinations (resonance) of variability of
normal performance.
If accidents happen like
this ...
... then risks can be found
like this ...
Unexpected combinations (resonance) of variability of
normal performance.
© Erik Hollnagel, 2010
Dams Power grids
NPPs
Space missions
Nano-tech
Chemical plants
Marine transport
Military adventures
Assembly lines
Post offices
UncertaintyLow (tractable)
High (intractable)
Mining
Air traffic control
Universities
Public servicesManufacturing
Railways
Healthcare
Coupling
Tig
htLo
ose
Emergency department
Integrated operations
R&D companies
Hazard
Hig
hLo
w
Methods and realityFinancial markets
LSCITS
Simple linear
Focus on technology
FMECA
Fault tree
HAZOP
FMEA
© Erik Hollnagel, 2010
Dams Power grids
NPPs
Space missions
Nano-tech
Chemical plants
Marine transport
Military adventures
Assembly lines
Post offices
UncertaintyLow (tractable)
High (intractable)
Mining
Air traffic control
Universities
Public servicesManufacturing
Railways
Healthcare
Coupling
Tig
htLo
ose
Emergency department
Integrated operations
R&D companies
Hazard
Hig
hLo
w
Methods and realityFinancial markets
LSCITS
Simple linear
Complex linear
Focus on technology
Focus on Human Factors
FMECA
Fault tree
HAZOP
FMEAHERA
AEB
RCA
ATHEANA
HEATHCR
HPES
THERPCSNI
TRACEr
© Erik Hollnagel, 2010
Dams Power grids
NPPs
Space missions
Nano-tech
Chemical plants
Marine transport
Military adventures
Assembly lines
Post offices
UncertaintyLow (tractable)
High (intractable)
Mining
Air traffic control
Universities
Public servicesManufacturing
Railways
Healthcare
Coupling
Tig
htLo
ose
Emergency department
Integrated operations
R&D companies
Hazard
Hig
hLo
w
Hazard and uncertaintyFinancial markets
LSCITS
Simple linear
Complex linear
Complex linear
Focus on technology
Focus on Human Factors
Focus on organisations
Non-linear
Focus on normal functions
FMECA
Fault tree
HAZOP
FMEAHERA
AEB
RCA
ATHEANA
HEATHCR
HPES
THERPCSNI
TRACEr
STEP
MORT
TRIPOD
MTO
Swiss cheese
AcciMap
CREAM STAMP
MERMOS
FRAMRE
© Erik Hollnagel, 2010
All outcomes (positive and negative) are due to
performance variability..
From the negative to the positive
Negative outcomes are caused by failures and
malfunctions.
Safety = Reduced number of adverse
events.
Eliminate failures and malfunctions as far
as possible.
Safety = Ability to respond when
something fails.
Improve ability to respond to adverse
events.
Safety = Ability to succeed under varying
conditions.
Improve resilience.
© Erik Hollnagel, 2010
What is safety?
Particularistic (product view)Risk should be As Low As Reasonably Practicable
(ALARP)
The ability to succeed under varying conditions (respond, monitor, anticipate, learn)
The reduction of unnecessary harm to an acceptable
minimum
Systemic (process view)Safety should be As High As
Reasonably Practicable (AHARP)
© Erik Hollnagel, 2010
Conclusions so far ...
Complex socio-technical systems can only function if performance is adjusted to conditions (ETTO)
Performance variability is the reason why things go right, but also the reason why things sometimes go wrong.
We need to understand how things go right before we can understand how they go wrong.
Resilience engineering is about how we can ensure that systems remain productive and safe in expected and unexpected conditions alike.