the role of complexity in system safety and how to manage it

50
The Role of Complexity The Role of Complexity in System Safety and in System Safety and How to Manage It How to Manage It Nancy Leveson

Upload: cybill

Post on 05-Jan-2016

26 views

Category:

Documents


2 download

DESCRIPTION

The Role of Complexity in System Safety and How to Manage It. Nancy Leveson. You’ve carefully thought out all the angles You’ve done it a thousand times It comes naturally to you You know what you’re doing, it’s what you’ve been trained to do your whole life. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Role of Complexity in System Safety and  How to Manage It

The Role of Complexity in The Role of Complexity in System Safety and System Safety and How to Manage ItHow to Manage It

Nancy Leveson

Page 2: The Role of Complexity in System Safety and  How to Manage It

– You’ve carefully thought out all the angles

– You’ve done it a thousand times

– It comes naturally to you

– You know what you’re doing, it’s what you’ve been trained to do your whole life.

– Nothing could possibly go wrong, right?

Page 3: The Role of Complexity in System Safety and  How to Manage It
Page 4: The Role of Complexity in System Safety and  How to Manage It

What is the Problem?

• Traditional safety engineering approaches developed for relatively simple electro-mechanical systems

• New technology (especially software) is allowing almost unlimited complexity in the systems we are building

• Complexity is creating new causes of accidents

• Should build simplest systems possible, but usually unwilling to make the compromises necessary1. Complexity related to the problem itself

2. Complexity introduced in the design of solution of problem

• Need new, more powerful safety engineering approaches to dealing with complexity and new causes of accidents

Page 5: The Role of Complexity in System Safety and  How to Manage It

What is Complexity?

• Complexity is subjective

– Not in system, but in minds of observers or users

– What is complex to one person or at one point in time may not be to another

• Relative• Changes with time

• Many aspects of complexity: Will focus on aspects most relevant to safety

Page 6: The Role of Complexity in System Safety and  How to Manage It

Relation of Complexity to Safety

• In complex systems, behavior cannot be thoroughly

– Planned

– Understood

– Anticipated

– Guarded against

• Critical factor is intellectual manageability

• Leads to “unknowns” in system behavior

• Need tools to

– Stretch our intellectual limits

– Deal with new causes of accidents

Page 7: The Role of Complexity in System Safety and  How to Manage It

Types of Complexity Relevant to Safety

• Interactive Complexity: arises in interactions among system components

• Non-linear complexity: cause and effect not related in an obvious way

• Dynamic complexity: related to changes over time

• Decompositional complexity: related to how decompose or modularize our systems

• Others ??

Page 8: The Role of Complexity in System Safety and  How to Manage It

Interactive Complexity

• Level of interactions has reached point where can no longer be thoroughly anticipated or tested

• Coupling causes interdependence

– Increases number of interfaces and potential interactions

– Software allows us to build highly coupled and interactively complex systems

• How affects safety engineering?

– Component failure vs. component interaction accidents

– Reliability vs. safety

Page 9: The Role of Complexity in System Safety and  How to Manage It

Accident with No Component Failures

Page 10: The Role of Complexity in System Safety and  How to Manage It

Software-Related Accidents

• Are usually caused by flawed requirements

– Incomplete or wrong assumptions about operation of controlled system or required operation of computer

– Unhandled controlled-system states and environmental conditions

• Merely trying to get the software “correct” or to make it reliable will not make it safer under these conditions.

Page 11: The Role of Complexity in System Safety and  How to Manage It

Types of Accidents

• Component Failure Accidents

– Single or multiple component failures

– Usually assume random failure

• Component Interaction Accidents

– Arise in interactions among components

– Related to interactive complexity and tight coupling

– Exacerbated by introduction of computers and software

Page 12: The Role of Complexity in System Safety and  How to Manage It

Safety = Reliability

• Safety and reliability are NOT the same

– Sometimes increasing one can even decrease the other.

– Making all the components highly reliable will not prevent component interaction accidents.

• For relatively simple, electro-mechanical systems with primarily component failure accidents, reliability engineering can increase safety.

• But this is untrue for complex, software-intensive socio-technical systems

• Our current safety engineering techniques assume accidents are caused by component failures

Page 13: The Role of Complexity in System Safety and  How to Manage It

(From Rasmussen)

Page 14: The Role of Complexity in System Safety and  How to Manage It

Accident Causality ModelsAccident Causality Models

• Underlie all our efforts to engineer for safety

• Explain why accidents occur

• Determine the way we prevent and investigate accidents

• May not be aware you are using one, but you are

• Imposes patterns on accidents

“All models are wrong, some models are useful”

George Box

Page 15: The Role of Complexity in System Safety and  How to Manage It

Chain-of-Events Model

• Explains accidents in terms of multiple events, sequenced as a forward chain over time.

– Simple, direct relationship between events in chain

• Events almost always involve component failure, human error, or energy-related event

• Forms the basis for most safety-engineering and reliability engineering analysis:

e,g, FTA, PRA, FMECA, Event Trees, etc.

and design:

e.g., redundancy, overdesign, safety margins, ….

Page 16: The Role of Complexity in System Safety and  How to Manage It

Reason’s Swiss Cheese Model

Page 17: The Role of Complexity in System Safety and  How to Manage It
Page 18: The Role of Complexity in System Safety and  How to Manage It

Swiss Cheese Model LimitationsSwiss Cheese Model Limitations

• Focus on “barriers” (from the process industry approach to safety) and omit other ways to design for safety

• Ignores common cause failures of barriers (systemic accident factors)

• Does not include migration to states of high risk: “Mickey Mouse Model”

• Assumes randomness in “lining up holes”

• Assumes some (linear) causality or precedence in the cheese slices

• Human error better modeled as a feedback loop than a “failure” in a chain of events

Page 19: The Role of Complexity in System Safety and  How to Manage It

Non-Linear Complexity

• Definition: Cause and effect not related in an obvious way

• Systemic factors in accidents, e.g., safety culture

– Our accident models assume linearity (chain of events, Swiss cheese)

– Systemic factors affect events in non-linear ways

• John Stuart Mill (1806-1873): “Cause” is a set of necessary and sufficient conditions

– What about factors (conditions) that are not necessary or sufficient?

e.g., Smoking “causes” lung cancer

– Contrapositive: A → B then ~ B→ ~ A

Page 20: The Role of Complexity in System Safety and  How to Manage It

Implications of Non-Linear Complexity for Operator Error

• Role of operators in our systems is changing

– Supervising rather than directly controlling

– Not simply following procedures

– Non-linear complexity makes it harder for operators to make real-time decisions

• Operator errors are not random failures

– All behavior affected by context (system) in which occurs

– Human error a symptom, not a cause

– Human error better modeled as feedback loops

Page 21: The Role of Complexity in System Safety and  How to Manage It

Dynamic Complexity

• Related to changes over time

• Systems are not static, but we assume they are

• Systems migrate toward states of high risk under competitive and financial pressures [Rasmussen]

• Want flexibility but need to design ways to

– Prevent or control dangerous changes

– Detect when they occur during operations

Page 22: The Role of Complexity in System Safety and  How to Manage It

Decompositional Complexity

• Definition: Structural decomposition not consistent with functional decomposition

• Harder for humans to understand and find functional design errors

• For safety, makes it difficult to determine whether system will be safe

– Safety is related to functional behavior of system and its components

– Not a function of the system structure

• No effective way to verify safety of object-oriented system designs

Page 23: The Role of Complexity in System Safety and  How to Manage It

Human Error, Safety, and Complexity

• Role of operators in our systems is changing

– Supervising rather than directly controlling

– Complexity is stretching limits of comprehensibility

– Designing systems in which operator error inevitable and then blame accidents on operators rather than designers

• Designers are unable to anticipate and prevent accidents

• Greatest need in safety engineering is to

– Limit complexity in our systems

– Practice restraint in requirements definition

– Do not add extra complexity in design

– Provide tools to stretch our intellectual limits

Page 24: The Role of Complexity in System Safety and  How to Manage It

It’s still hungry … and I’ve been stuffing worms into it all day.

Page 25: The Role of Complexity in System Safety and  How to Manage It

So What Do We Need to Do?“Engineering a Safer World”

• Expand our accident causation models

• Create new hazard analysis techniques

• Use new system design techniques

– Safety-driven design

– Integrate safety analysis into system engineering

• Improve accident analysis and learning from events

• Improve control of safety during operations

• Improve management decision-making and safety culture

Page 26: The Role of Complexity in System Safety and  How to Manage It

STAMP(System-Theoretic Accident Model and

Processes)

• A new, more powerful accident causation model

• Based on systems theory, not reliability theory

• Treats accidents as a control problem (vs. a failure problem)

“prevent failures” ↓

“enforce safety constraints on system behavior”

Page 27: The Role of Complexity in System Safety and  How to Manage It

STAMP (2) • Safety is an emergent property that arises when system

components interact with each other within a larger environment

– A set of constraints related to behavior of system components (physical, human, social) enforces that property

– Accidents occur when interactions violate those constraints (a lack of appropriate constraints on the interactions)

• Accidents are not simply an event or chain of events but involve a complex, dynamic process

• Most major accidents arise from a slow migration of the entire system toward a state of high-risk

– Need to control and detect this migration

Page 28: The Role of Complexity in System Safety and  How to Manage It

STAMP (3)

• Treats safety as a dynamic control problem rather than a component failure problem. – O-ring did not control propellant gas release by sealing gap in field

joint of Challenger Space Shuttle

– Software did not adequately control descent speed of Mars Polar Lander

– Temperature in batch reactor not adequately controlled in system design

– Public health system did not adequately control contamination of the milk supply with melamine

– Financial system did not adequately control the use of financial instruments

Page 29: The Role of Complexity in System Safety and  How to Manage It

ExampleSafetyControlStructure

Page 30: The Role of Complexity in System Safety and  How to Manage It

SafetyControl inPhysicalProcess

Page 31: The Role of Complexity in System Safety and  How to Manage It

Safety Constraints

• Each component in the control structure has

– Assigned responsibilities, authority, accountability

– Controls that can be used to enforce safety constraints

• Each component’s behavior is influenced by

– Context (environment) in which operating

– Knowledge about current state of process

Page 32: The Role of Complexity in System Safety and  How to Manage It

Accidents occur when model of process is inconsistent with real state of process and controller provides inadequate control actions

Controlled Process

Model ofProcess

ControlActions

Feedback

Controller

Control processes operate between levels of control

Feedback channels are critical -- Design -- Operation

Page 33: The Role of Complexity in System Safety and  How to Manage It

Relationship Between Safety and Process Models (2)

• Accidents occur when models do not match process and

– Required control commands are not given

– Incorrect (unsafe) ones are given

– Correct commands given at wrong time (too early, too late)

– Control stops too soon

Explains software errors, human errors, component interaction accidents …

Page 34: The Role of Complexity in System Safety and  How to Manage It

Accident CausalityUsing STAMP

Page 35: The Role of Complexity in System Safety and  How to Manage It

Uses for STAMP

• More comprehensive accident/incident investigation and root cause analysis

• Basis for new, more powerful hazard analysis techniques (STPA)

• Supports safety-driven design (physical, operational, organizational))– Can integrate safety into the system engineering process

– Assists in design of human-system interaction and interfaces

Page 36: The Role of Complexity in System Safety and  How to Manage It

Uses for STAMP (2)

• Organizational and cultural risk analysis– Identifying physical and project risks

– Defining safety metrics and performance audits

– Designing and evaluating potential policy and structural improvements

– Identifying leading indicators of increasing risk (“canary in the coal mine”)

• Improve operations and management control of safety

Page 37: The Role of Complexity in System Safety and  How to Manage It

STPA (System-Theoretic Process Analysis)

• Identifies safety constraints (system and component safety requirements)

• Identifies scenarios leading to violation of safety constraints

– Includes scenarios (cut sets) found by Fault Tree Analysis

– Finds additional scenarios not found by FTA and other failure-oriented analyses

• Can be used on technical design and organizational design

• Evaluated and compared to traditional HA methods

– Found many more potential safety problems

Page 38: The Role of Complexity in System Safety and  How to Manage It

5 Missing or wrong communication with another controller

Page 39: The Role of Complexity in System Safety and  How to Manage It

Technical• Safety analysis of new missile defense system (MDA)

• Safety-driven design of new JPL outer planets explorer

• Safety analysis of the JAXA HTV (unmanned cargo spacecraft to ISS)

• Incorporating risk into early trade studies (NASA Constellation)

• Orion (Space Shuttle replacement)

• NextGen (planned changes to air traffic control)

• Accident/incident analysis (aircraft, petrochemical plants, air traffic control, railroad, UAVs …)

• Proton Therapy Machine (medical device)

• Adaptive cruise control (automobiles)

Does it work? Is it practical?

Page 40: The Role of Complexity in System Safety and  How to Manage It

• Analysis of the management structure of the space shuttle program (post-Columbia)

• Risk management in the development of NASA’s new manned space program (Constellation)

• NASA Mission control ─ re-planning and changing mission control procedures safely

• Food safety

• Safety in pharmaceutical drug development

• Risk analysis of outpatient GI surgery at Beth Israel Deaconess Hospital

• UAVs in civilian airspace

• Analysis and prevention of corporate fraud

Social and Managerial

Does it work? Is it practical?

Page 41: The Role of Complexity in System Safety and  How to Manage It

Integrating Safety into System Engineering

• Hazard analysis must be integrated into design and decision-making environment. Needs to be available when decisions are made.

• Lots of implications for specifications:

– Relevant information must be easy to find

– Design rationale must be specified

– Must be able to trace from high-level requirements to system design to component requirements to component design and vice versa.

– Must include specification of what NOT to do

– Must be easy to review and find errors

Page 42: The Role of Complexity in System Safety and  How to Manage It

Intent Specifications

• Based on systems theory principles

• Designed to support

– System Engineering (including maintainance and evolution)

– Human problem solving

– Management of complexity (adds intent abstraction to standard refinement and decomposition)

– Model-Based development

– Specification principles from preceding slide

Leveson, Intent Specifications: An Approach to Building Human Centered Specification, IEEE Trans. on Software Engineering, Jan. 2000

Page 43: The Role of Complexity in System Safety and  How to Manage It
Page 44: The Role of Complexity in System Safety and  How to Manage It

Level 3 Modeling Language: Spectrm-RL

• Combined requirements specification and modeling language. Supports model-based development.

• A state machine with a domain-specific notation on top of it

– Reviewers can learn to read it in 10 minutes

– Executable

– Formally analyzable

– Automated tools for creation and analysis (e.g., incompleteness, inconsistency, simulation)

– Black-box requirements only (no component design)

Page 45: The Role of Complexity in System Safety and  How to Manage It

SpecTRM-RL

• Black-box requirements only (no component design)

• Separates design from requirements

– Specify only black box, transfer function across component

– Reduces complexity by omitting information not needed at requirements evaluation time

• Separation of concerns is an important way for humans to deal with complexity

– Almost all software-related accidents caused by incomplete or inadequate requirements (not software design errors)

Page 46: The Role of Complexity in System Safety and  How to Manage It
Page 47: The Role of Complexity in System Safety and  How to Manage It
Page 48: The Role of Complexity in System Safety and  How to Manage It

Conclusions

• Traditional safety engineering techniques do not adequately handle complexity

– Interactive, non-linear, dynamic, and design (especially decompositional)

• Need to take a system engineering view of safety rather than the current component reliability view when building complex systems

– Include entire socio-technical system including safety culture and organizational structure

– Support top-down and safety-driven design

– Support specification and human review of requirements

Page 49: The Role of Complexity in System Safety and  How to Manage It

Conclusions

• Need a more realistic handling of human errors and human decision-making

• Need to include behavioral dynamics and changes over time

– Consider processes behind events and not just events

– Understand why controls drift into ineffectiveness over time and manage this drift

Page 50: The Role of Complexity in System Safety and  How to Manage It

Nancy Leveson

“Engineering a Safer World”

(Systems Thinking Applied to Safety)

MIT Press, December 2011

Available for free download from:

http://sunnyday.mit.edu/safer-world