sweden day1

Click here to load reader

Post on 19-Nov-2015

17 views

Category:

Documents

1 download

Embed Size (px)

DESCRIPTION

A New Approach toSystem Safety Engineering

TRANSCRIPT

  • A New Approach to

    System Safety Engineering

    Nancy G. Leveson, MIT

    System Safety Engineering: Back to the Future

    http://sunnyday.mit.edu/book2.html

    Copyright Nancy Leveson, Aug. 2006

  • Outline of Day 1

    Why a new approach is needed

    STAMP A new accident model based on system theory

    (control)

    Uses

    Accident and incident investigation

    Hazard Analysis (STPA) and Design for Safety

    Non-Advocate Safety Assessment

    Cultural and organizational risk analysis

    Copyright Nancy Leveson, Aug. 2006

  • Why need a new approach

    Traditional approaches developed for relatively

    simple electro-mechanical systems

    Accidents in complex, software-intensive systems

    are changing their nature

    We need more effective techniques in these new

    systems

    Copyright Nancy Leveson, Aug. 2006

  • Chain-of-Events Model

    Explains accidents in terms of multiple events, sequenced

    as a forward chain over time.

    Simple, direct relationship between events in chain

    Events almost always involve component failure, human

    error, or energy-related event

    Forms the basis for most safety-engineering and reliability

    engineering analysis:

    e,g, FTA, PRA, FMECA, Event Trees, etc.

    and design:

    e.g., redundancy, overdesign, safety margins, .

    Copyright Nancy Leveson, Aug. 2006

  • Chain-of-events example

    Copyright Nancy Leveson, Aug. 2006

  • Accident with No Component Failures

    Copyright Nancy Leveson, Aug. 2006

  • Types of Accidents

    Component Failure Accidents

    Single or multiple component failures

    Usually assume random failure

    System Accidents

    Arise in interactions among components

    Related to interactive complexity and tight coupling

    Exacerbated by introduction of computers and

    software

    New technology introduces unknowns and unk-unks Copyright Nancy Leveson, Aug. 2006

  • Interactive Complexity

    Critical factor is intellectual manageability

    A simple system has a small number of unknowns in

    its interactions (within system and with environment)

    Interactively complex (intellectually unmanageable)

    when level of interactions reaches point where can no

    longer be thoroughly

    Planned

    Understood

    Anticipated

    Guarded against

    Copyright Nancy Leveson, Aug. 2006

  • Tight Coupling

    Tightly coupled system is one that is highly

    interdependent

    Each part linked to many other parts

    Failure or unplanned behavior in one can rapidly affect status

    of others

    Processes are time-dependent and cannot wait

    Little slack in system

    Sequences are invariant, only one way to reach a goal

    System accidents are caused by unplanned and

    dysfunctional interactions

    Coupling increases number of interfaces and potential

    interactions Copyright Nancy Leveson, Aug. 2006

  • Other Types of Complexity

    Non-linear complexity

    Cause and effect not related in an obvious way

    Dynamic Complexity

    Related to changes over time

    Decompositional

    Structural decomposition not consistent with

    functional decomposition

    Copyright Nancy Leveson, Aug. 2006

  • Limitations of Chain-of-Events Model

    Social and organizational factors in accidents

    System accidents

    Software

    Adaptation

    Systems are continually changing

    Systems and organizations migrate toward accidents

    (states of high risk) under cost and productivity

    pressures in an aggressive, competitive environment

    Copyright Nancy Leveson, Aug. 2006

  • Limitations (2)

    Human error

    Define as deviation from normative procedures, but

    operators always deviate from standard procedures

    Normative vs. effective procedures

    Sometimes violation of rules has prevented accidents

    Cannot effectively model human behavior by

    decomposing it into individual decisions and acts and

    studying it in isolation from

    Physical and social context

    Value system in which takes place

    Dynamic work process

    Less successful actions are natural part of search by

    operator for optimal performance Copyright Nancy Leveson, Aug. 2006

  • Mental Models

    Copyright Nancy Leveson, Aug. 2006

  • Exxon Valdez

    Shortly after midnight, March 24, 1989, tanker Exxon Valdezran aground on Bligh Reef (Alaska)

    11 million gallons of crude oil released

    Over 1500 miles of shoreline polluted

    Exxon and government put responsibility on tanker CaptainHazelwood, who was disciplined and fired

    Was he to blame?

    State-of-the-art iceberg monitoring equipment promised by oilindustry, but never installed. Exxon Valdez traveling outside normalsea lane in order to avoid icebergs thought to be in area

    Radar station in city of Valdez, which was responsible for monitoringthe location of tanker traffic in Prince William Sound, had replaced itsradar with much less powerful equipment. Location of tankers nearBligh reef could not be monitored with this equipment.

    Copyright Nancy Leveson, Aug. 2006

  • Congressional approval of Alaska oil pipeline and tanker

    transport network included an agreement by oil corporations to

    build and use double-hulled tankers. Exxon Valdez did not

    have a double hull.

    Crew fatigue was typical on tankers

    In 1977, average oil tanker operating out of Valdez had a crew of 40

    people. By 1989, crew size had been cut in half.

    Crews routinely worked 12-14 hour shifts, plus extensive overtime

    Exxon Valdez had arrived in port at 11 pm the night before. The crew

    rushed to get the tanker loaded for departure the next evening

    Coast Guard at Valdez assigned to conduct safety inspections

    of tankers. It did not perform these inspections. Its staff had

    been cut by one-third.

    Copyright Nancy Leveson, Aug. 2006

  • Tanker crews relied on the Coast Guard to plot their positioncontinually.

    Coast Guard operating manual required this.

    Practice of tracking ships all the way out to Bligh reef hadbeen discontinued.

    Tanker crews were never informed of the change.

    Spill response teams and equipment were not readilyavailable. Seriously impaired attempts to contain and recoverthe spilled oil.

    Summary:

    Safeguards designed to avoid and mitigate effects of an oilspill were not in place or were not operational

    By focusing exclusively on blame, the opportunity to learnfrom mistakes is lost.

    Postscript:

    Captain Hazelwood was tried for being drunk the night theExxon Valdez went aground. He was found not guilty

    Copyright Nancy Leveson, Aug. 2006

  • Hierarchical models

    Copyright Nancy Leveson, Aug. 2006

  • Hierarchical analysis example

    Copyright Nancy Leveson, Aug. 2006

  • The Role of Software in Accidents

  • The Computer Revolution

    Software is simply the design of a machineabstracted from its physical realization

    Machines that were physically impossible orimpractical to build become feasible

    Design can be changed without retooling ormanufacturing

    Can concentrate on steps to be achieved withoutworrying about how steps will be realized physically

    + =General

    Purpose

    Machine

    SoftwareSpecial

    Purpose

    Machine

    Copyright Nancy Leveson, Aug. 2006

  • Advantages = Disadvantages

    Computer so powerful and useful because has eliminated

    many of physical constraints of previous technology

    Both its blessing and its curse

    No longer have to worry about physical realization of

    our designs

    But no longer have physical laws that limit the

    complexity of our designs.

    Copyright Nancy Leveson, Aug. 2006

  • The Curse of Flexibility

    Software is the resting place of afterthoughts

    No physical constraints

    To enforce discipline in design, construction, and

    modification

    To control complexity

    So flexible that start working with it before fully

    understanding what need to do

    And they looked upon the software and saw that it was

    good, but they just had to add one other feature

    Copyright Nancy Leveson, Aug. 2006

  • Abstraction from Physical Design

    Software engineers are doing physical design

    Most operational software errors related to requirements(particularly incompleteness)

    Software failure modes are different

    Usually does exactly what you tell it to do

    Problems occur from operation, not lack of operation

    Usually doing exactly what software engineers wanted

    Autopilot

    ExpertRequirements Software

    Engineer

    Design

    of

    Autopilot

    Copyright Nancy Leveson, Aug. 2006

  • Safety vs. Reliability

    Safety and reliability are NOT the same

    Sometimes increasing one can even decrease theother.

    Making all the components highly reliable will have noimpact on system accidents.

    For relatively simple, electro-mechanical systemswith primarily component failure accidents, reliabilityengineering can increase safety.

    But accidents in high-tech systems are changingtheir nature, and we must change our approaches tosafety accordingly.

    Copyright Nancy Leveson, Aug. 2006

  • Its only a random

    failure, sir! It will

    never happen again.

  • Reliability Engineering Approach to Safety

    Reliability: The probability an item will perform its requiredfunction in the specified manner over a given time period andunder specified or assumed conditions.

    (Note: Most accidents result from errors in specified requirements or functions and deviations from assumed conditions)

    Concerned primarily with failures and failure rate reduction:

    Redundancy

    Safety factors and margins

    Derating

    Screening

    Timed replacements

    Copyright Nancy Leveson, Aug. 2006

  • Reliability Engineering Approach to Safety

    Assumes accidents are caused by component failure

    Positive:

    Techniques exist to increase component reliability

    Failure rates in hardware are quantifiable

    Negative:

    Omits important factors in accidents

    May decrease safety

    Many accidents occur without any component failure

    Caused by equipment operation outside parameters and timelimits upon which reliability analyses are based.

    Caused by interactions of components all operating accordingto specification.

    Highly reliable components are not necessarily safe Copyright Nancy Leveson, Aug. 2006

  • Software-Related Accidents

    Are usually caused by flawed requirements

    Incomplete or wrong assumptions about operation of

    controlled system or required operation of computer

    Unhandled controlled-system states and

    environmental conditions

    Merely trying to get the software correct or to make

    it reliable will not make it safer under these

    conditions.

    Copyright Nancy Leveson, Aug. 2006

  • Software-Related Accidents (2)

    Software may be highly reliable and correct and

    still be unsafe:

    Correctly implements requirements but specified

    behavior unsafe from a system perspective.

    Requirements do not specify some particular behavior

    required for system safety (incomplete)

    Software has unintended (and unsafe) behavior

    beyond what is specified in requirements.

    Copyright Nancy Leveson, Aug. 2006

  • MPL Requirements Tracing Flaw

    SYSTEM REQUIREMENTS

    1. The touchdown sensors shall

    be sampled at 100-HZ rate.

    2. The sampling process shall be

    initiated prior to lander entry to

    keep processor demand

    constant.

    3. However, the use of the

    touchdown sensor data shall

    not begin until 12 m above the

    surface.

    SOFTWARE REQUIREMENTS

    1. The lander flight software shall

    cyclically check the state of each of

    the three touchdown sensors (one

    per leg) at 100-HZ during EDL.

    2. The lander flight software shall be

    able to cyclically check the

    touchdown event state with or

    without touchdown event

    generation enabled.

    ????

    Copyright Nancy Leveson, Aug. 2006

  • Reliability Approach to Software Safety

    Using standard engineering techniques of

    Preventing failures through redundancy

    Increasing component reliability

    Reuse of designs and learning from experience

    will not work for software and system accidents

    Copyright Nancy Leveson, Aug. 2006

  • Preventing Failures Through

    Redundancy

    Redundancy simply makes complexity worse

    NASA experimental aircraft example

    Any solutions that involve adding complexity will not solve

    problems that stem from intellectual unmanageability and

    interactive complexity

    Majority of software-related accidents caused by

    requirements errors

    Does not work for software even if accident is

    caused by a software implementation error

    Software errors not caused by random wear-out failures

    Copyright Nancy Leveson, Aug. 2006

  • Increasing Software Reliability (Integrity)

    Appearing in many new international standards forsoftware safety (e.g., 61508)

    Safety integrity level (SIL)

    Sometimes give reliability number (e.g., 10-9)

    Can software reliability be measured? What does it mean?

    What does it have to do with safety?

    Safety involves more than simply getting thesoftware correct:

    Example: altitude switch

    1. Signal safety-increasing

    Require any of three altimeter report below threshhold

    1. Signal safety-decreasing

    Require all three altimeter to report below threshhold

    Copyright Nancy Leveson, Aug. 2006

  • Software Component Reuse

    One of most common factors in software-related

    accidents

    Software contains assumptions about its environment

    Accidents occur when these assumptions are incorrect

    Therac-25

    Ariane 5

    U.K. ATC software

    Mars Climate Orbiter

    Most likely to change the features embedded in or

    controlled by the software

    COTS makes safety analysis more difficult

    Copyright Nancy Leveson, Aug. 2006

  • Copyright Nancy Leveson, Aug. 2006

    Safety and (component or system) reliability

    are different qualities in complex systems!

    Increasing one will not necessarily increase

    the other.

    So what do we do?

  • A Possible Solution

    Enforce discipline and control complexity

    Limits have changed from structural integrity and

    physical constraints of materials to intellectual limits

    Improve communication among engineers

    Build safety in by enforcing constraints on behavior

    Controller contributes to accidents not by failing but by:

    1. Not enforcing safety-related constraints on behavior

    2. Commanding behavior that violates safety constraints

    Copyright Nancy Leveson, Aug. 2006

  • Example

    (Chemical Reactor)

    System Safety Constraint:

    Water must be flowing into reflux condenser

    whenever catalyst is added to reactor

    Software (Controller) Safety Constraint:

    Software must always open water valve

    before catalyst valve

    Copyright Nancy Leveson, Aug. 2006

  • Conclusion

    The primary safety problem in complex, software-

    intensive systems is the lack of appropriate

    constraints on design

    The job of the system safety engineer is to

    Identify the constraints necessary to maintain safety

    Ensure the system (including software) design

    enforces them

    Copyright Nancy Leveson, Aug. 2006

  • Introduction to System

    Safety Engineering

  • A Non-System Safety Example:

    Nuclear Power (Defense in Depth)

    Multiple independent barriers to propagation of

    malfunction

    Emphasis on component reliability and use of lots of

    redundancy

    Handling single failures (no single failure of any

    components will disable any barrier)

    Protection (safety) systems: automatic system shut-

    down

    Emphasis on reliability and availability of shutdown system

    and physical system barriers (using redundancy)

    Copyright Nancy Leveson, Aug. 2006

  • Why is this effective?

    Relatively slow pace of basic design changes

    Use of well-understood and debugged designs

    Ability to learn from experience

    Conservatism in design

    Slow introduction of new technology

    Limited interactive complexity and coupling

    Copyright Nancy Leveson, Aug. 2006

  • System Safety

    Grew out of ballistic missile systems of 1960s

    Emphasizes building in safety rather than adding it on to a

    completed design

    Looks at systems as a whole, not just components

    A top-down systems approach to accident prevention

    Takes a larger view of accident causes than just component

    failures (includes interactions among components)

    Emphasizes hazard analysis and design to eliminate or

    control hazards

    Emphasizes qualitative rather than quantitative approaches

    Copyright Nancy Leveson, Aug. 2006

  • System Safety Overview

    A planned, disciplined, and systematic approach to

    preventing or reducing accidents throughout the life cycle of

    a system.

    Organized common sense (Mueller, 1968)

    Primary concern is the management of hazards

    Hazard Through

    identification analysis

    evaluation design

    elimination management

    control

    MIL-STD-882

    Copyright Nancy Leveson, Aug. 2006

  • System Safety Overview (2)

    Analysis:

    Hazard analysis and control is a continuous, iterativeprocess throughout system development and use.

    Design: Hazard resolution precedence

    1. Eliminate the hazard

    2. Prevent or minimize the occurrence of the hazard

    3. Control the hazard if it occurs

    4. Minimize damage

    Management:

    Audit trails, communication channels, etc.

    Copyright Nancy Leveson, Aug. 2006

  • System Safety in Software-Intensive

    Systems

    While system safety approach was developed for and

    works for complex, technologically advanced systems, new

    methods are required

    Software particularly stretches traditional methods

    Need new hazard analysis and other approaches for

    complex, software-intensive systems

    Rest of day 1 shows new way to implement the basic

    system safety approach

  • STAMP

    A new accident causation

    model using Systems Theory

    (vs. Reliability Theory)

    Copyright Nancy Leveson, Aug. 2006

  • Introduction to Systems Theory

    Ways to cope with complexity

    1. Analytic Reduction

    2. Statistics

    Copyright Nancy Leveson, Aug. 2006

  • Analytic Reduction

    Divide system into distinct parts for analysis

    Physical aspects Separate physical components

    Behavior Events over time

    Examine parts separately

    Assumes such separation possible:

    1. The division into parts will not distort the

    phenomenon

    Each component or subsystem operates independently

    Analysis results not distorted when consider components

    separately

    Copyright Nancy Leveson, Aug. 2006

  • 2. Components act the same when examined singly as

    when playing their part in the whole

    Components or events not subject to feedback loops and

    non-linear interactions

    3. Principles governing the assembling of components

    into the whole are themselves straightforward

    Interactions among subsystems simple enough that can be

    considered separate from behavior of subsystems themselves

    Precise nature of interactions is known

    Interactions can be examined pairwise

    Called Organized Simplicity

    Analytic Reduction (2)

    Copyright Nancy Leveson, Aug. 2006

  • Statistics

    Treat system as a structureless mass with

    interchangeable parts

    Use Law of Large Numbers to describe behavior in

    terms of averages

    Assumes components are sufficiently regular and

    random in their behavior that they can be studied

    statistically

    Called Unorganized Complexity

    Copyright Nancy Leveson, Aug. 2006

  • Complex, Software-Intensive Systems

    Too complex for complete analysis

    Separation into (interacting) subsystems distorts theresults

    The most important properties are emergent

    Too organized for statistics

    Too much underlying structure that distorts the statistics

    Called Organized Complexity

    Copyright Nancy Leveson, Aug. 2006

  • Copyright Nancy Leveson, Aug. 2006

  • Systems Theory

    Developed for biology (von Bertalanffly) and engineering

    (Norbert Weiner)

    Basis of system engineering and system safety

    ICBM systems of the 1950s

    Developed to handle systems with organized complexity

    Copyright Nancy Leveson, Aug. 2006

  • Systems Theory (2)

    Focuses on systems taken as a whole, not onparts taken separate

    Some properties can only be treated adequately in theirentirety, taking into account all social and technicalaspects

    These properties derive from relationships among theparts of the system

    How they interact and fit together

    Two pairs of ideas

    1. Hierarchy and emergence

    2. Communication and control

    Copyright Nancy Leveson, Aug. 2006

  • Hierarchy and Emergence

    Complex systems can be modeled as a hierarchy of

    organizational levels

    Each level more complex than one below

    Levels characterized by emergent properties

    Irreducible

    Represent constraints on the degree of freedom of

    components at lower level

    Safety is an emergent system property

    It is NOT a component property

    It can only be analyzed in the context of the whole

    Copyright Nancy Leveson, Aug. 2006

  • Example

    Control

    Structure

  • Communication and Control

    Hierarchies characterized by control processes working at

    the interfaces between levels

    A control action imposes constraints upon the activity at a

    lower level of the hierarchy

    Open systems are viewed as interrelated components kept

    in a state of dynamic equilibrium by feedback loops of

    information and control

    Control in open systems implies need for communication

    Copyright Nancy Leveson, Aug. 2006

  • Copyright Nancy Leveson, Aug. 2006

    Process models must contain:

    - Required relationship among

    process variables

    - Current state (values of

    process variables

    - The ways the process can

    change state

    Controlled Process

    Model of

    Process

    Control

    ActionsFeedback

    Controller

    Control processes operate between levels

  • Relationship Between Safety and

    Process Models

    Accidents occur when models do not match processand

    Incorrect control commands given

    Correct ones not given

    Correct commands given at wrong time (too early, toolate)

    Control stops too soon

    (Note the relationship to system accidents and tosoftwares role in accidents)

    Copyright Nancy Leveson, Aug. 2006

  • Relationship Between Safety and

    Process Models (2)

    How do they become inconsistent?

    Wrong from beginning

    Missing or incorrect feedback

    Not updated correctly

    Time lags not accounted for

    Resulting in

    Uncontrolled disturbances

    Unhandled process states

    Inadvertently commanding system into a hazardous state

    Unhandled or incorrectly handled system component failures

    Copyright Nancy Leveson, Aug. 2006

  • Relationship Between Safety and

    Human Mental Models

    Explains most human/computer interaction problems

    Pilots and others are not understanding the automation

    Or dont get feedback to update mental models or

    disbelieve it

    What did it just do? Why wont it let us do that?

    Why did it do that? What caused the failure?

    What will it do next? What can we do so it does

    How did it get us into this state? not happen again?

    How do I get it to do what I want?

    Copyright Nancy Leveson, Aug. 2006

  • Mental Models

    Copyright Nancy Leveson, Aug. 2006

  • Relationship Between Safety and

    Human Mental Models (2)

    Also explains developer errors. May have incorrect

    model of

    Required system or software behavior for safety

    Development process

    Physical laws

    Etc.

    Copyright Nancy Leveson, Aug. 2006

  • STAMP

    Systems-Theoretic Accident Model and Processes

    Accidents are not simply an event or chain of events butinvolve a complex, dynamic process

    Based on systems and control theory

    Accidents arise from interactions among humans,machines, and the environment (not just componentfailures)

    prevent failures

    _

    enforce safety constraints on system behavior

    Copyright Nancy Leveson, Aug. 2006

  • STAMP (2)

    View accidents as a control problem

    O-ring did not control propellant gas release by sealing gapin field joint

    Software did not adequately control descent speed of Mars

    Polar Lander

    Events are the result of the inadequate control

    Result from lack of enforcement of safety constraints

    Copyright Nancy Leveson, Aug. 2006

  • A Broad View of Control

    Does not imply need for a controller

    Component failures and dysfunctional interactions maybe controlled through design

    (e.g., redundancy, interlocks, fail-safe design)

    or through process

    Manufacturing processes and procedures

    Maintenance processes

    Operations

    Does imply the need to enforce the safety constraintsin some way

    New model includes what do now and more

    Copyright Nancy Leveson, Aug. 2006

  • STAMP (3)

    Safety is an emergent property that arises when

    system components interact with each other within a

    larger environment

    A set of constraints related to behavior of system

    components enforces that property

    Accidents occur when interactions violate those

    constraints (a lack of appropriate constraints on the

    interactions)

    Controllers embody or enforce those constraints

    Copyright Nancy Leveson, Aug. 2006

  • Example Safety Constraints

    Build safety in by enforcing safety constraints onbehavior

    Controllers contribute to accidents not by failing but by:

    1. Not enforcing safety-related constraints on behavior

    2. Commanding behavior that violates safety constraints

    System Safety Constraint:

    Water must be flowing into reflux condenser whenever catalyst

    is added to reactor

    Software Safety Constraint:

    Software must always open water valve before catalyst valve

  • STAMP (4)

    Systems are not treated as a static design

    A socio-technical system is a dynamic process

    continually adapting to achieve its ends and to react

    to changes in itself and its environment

    Migration toward states of high risk

    Preventing accidents requires designing a control

    structure to enforce constraints on system behavior

    and adaptation

    Copyright Nancy Leveson, Aug. 2006

  • Example

    Control

    Structure

  • Intelligent Cruise Control

  • Accident Causality

    Accidents occur when

    Control structure or control actions do not enforce

    safety constraints

    Unhandled environmental disturbances or conditions

    Unhandled or uncontrolled component failures

    Dysfunctional (unsafe) interactions among components

    Control structure degrades over time (asynchronous

    evolution)

    Control actions inadequately coordinated among

    multiple controllers

    Copyright Nancy Leveson, Aug. 2006

  • Dysfunctional Controller Interactions

    Boundary areas

    Overlap areas (side effects of decisions and control

    actions)

    Controller 1

    Controller 2

    Process 1

    Process 2

    Controller 1

    Controller 2

    Process

    Copyright Nancy Leveson, Aug. 2006

  • Uncoordinated Control Agents

    Control Agent

    (ATC)

    InstructionsInstructions

    SAFE STATE

    ATC provides coordinated instructions to both planesSAFE STATE

    TCAS provides coordinated instructions to both planes

    Control Agent

    (TCAS)

    InstructionsInstructions

    UNSAFE STATE

    BOTH TCAS and ATC provide uncoordinated & independent instructions

    Control Agent

    (ATC)

    InstructionsInstructions

    No Coordination

  • Root Cause Analysis Example:

    Exxon Valdez

    Congress

    Coast GuardExxon

    Tanker Captain

    And Crew

    Tanker

    Copyright Nancy Leveson, Aug. 2006

    For each component identify:

    Responsibility (Safety Requirements

    and constraints

    Inadequate control actions

    Context in which decisions made

    Mental model flaws

  • Applying STAMP

    To understand accidents, need to examine safetycontrol structure itself to determine why inadequateto maintain safety constraints and why eventsoccurred

    To prevent accidents, need to create an effectivesafety control structure to enforce the system safetyconstraints

    Not a blame model but a why model

  • Modeling Accidents Using STAMP

    Three types of models are used:

    1. Static safety control structure

    2. Dynamic safety control structure

    Shows how control structure changed over time

    3. Behavioral dynamics

    Dynamic processes behind changes, i.e., why the

    system changed over time

    Copyright Nancy Leveson, Aug. 2006

  • Simplified System Dynamics Model of Columbia Accident

    Copyright Nancy Leveson, Aug. 2006

  • STAMP vs. Traditional Accident Models

    Examines inter-relationships rather than linear

    cause-effect chains.

    Looks at the processes behind the events

    Includes entire socio-technical system

    Includes behavioral dynamics (changes over time)

    Want to not just react to accidents and impose controls for

    a while, but understand why controls drift toward

    ineffectiveness over time and

    Change those factors if possible

    Detect the drift before accidents occur

    Copyright Nancy Leveson, Aug. 2006

  • Uses for STAMP

    Basis for new, more powerful hazard analysis techniques

    (STPA)

    Inform early architectural trade studies

    Identify and prioritize hazards and risks

    Identify system and component safety requirements and constraints

    (to be used in design)

    Perform hazard analyses on physical and social systems

    Safety-driven design (physical, operational, organizational)

    More comprehensive accident/incident investigation and

    root cause analysis

  • Uses for STAMP (2)

    Organizational and cultural risk analysis

    Identifying physical and project risks

    Defining safety metrics and performance audits

    Designing and evaluating potential policy and structural improvements

    Identifying leading indicators of increasing risk (canary in the coal

    mine)

    New holistic approaches to security

  • Does it Work? Is it Practical?

    MDA risk assessment of inadvertent launch (technical)

    Architectural trade studies for the space exploration

    initiative (technical)

    Safetydriven design of a NASA JPL spacecraft

    (technical)

    NASA Space Shuttle Operations (risk analysis of a new

    management structure)

    NASA Exploration Systems (risk management tradeoffs

    among safety, budget, schedule, performance in

    development of replacement for Shuttle)

  • Does it Work? Is it Practical? (2)

    Accident analysis (spacecraft losses, bacterial

    contamination of water supply, aircraft collision, oil

    refinery explosion, train accident, etc.)

    Pharmaceutical safety

    Hospital safety (risks of outpatient surgery at Beth Israel

    MC)

    Corporate fraud (are controls adequate?, Sarbanes-

    Oxley)

    Food safety

    Train safety (Japan)

  • Accident/Incident Investigation

    and Causal Analysis

    Copyright Nancy Leveson, Aug. 2006

  • Using STAMP in Root Cause Analysis

    Identify system hazard violated and the system safety design

    constraints

    Construct the safety control structure as it was designed to

    work

    Component responsibilities (requirements)

    Control actions and feedback loops

    For each component, determine if it fulfilled its responsibilities

    or provided inadequate control.

    If inadequate control, why? (including changes over time)

    Determine the changes that could eliminate the inadequate

    control (lack of enforcement of system safety constraints) in

    the future.

    Copyright Nancy Leveson, Aug. 2006

  • Components surrounding

    Controller in Zurich

  • Links degraded due to

    poor and unsafe practices

  • Links lost due to

    sectorization work

  • Links

    lost due

    to

    unusual

    situations

  • Safety Requirements and Constraints

    Must follow TCAS mandate

    Context in which decisions were made

    Flying over Western Europe ( TCAS is mandatory)

    TU Crew doesnt have radio communication with Boeing Crew

    Flight Crew has no simulator experience with TCAS

    Flight Training is unclear on what to do in case of conflict betweenATC/TCAS

    Flying at night

    Inadequate Decisions and Control Actions

    Reliance on optical contact

    Ignores minority report from spare member of the crew

    Follows controller instructions rather than TCAS

    Mental Model Flaws

    Optical illusion of distance

    Belief that ATC is aware of everything that is happening

    Belief that pilot, not TCAS has the last said in the evasion action

    Understanding of TCAS as a backup system rather than a final resort

    Tupulov Crew

  • Zurich ATC Operations

    Safety Requirements and Constraints

    Maintain safe separation between planes in airspace

    Context in which decisions were made

    Phone system prevented communication from other ATCs

    Inadequate radar coverage

    Insufficient personnel (only one controller)

    Unaware of TCAS and/or impact of TCAS during a RA

    Etc.

    Inadequate Decision and Control Actions

    Failure to communicate with DHL plane

    Failure to adequately monitor situation

    Mental Model Flaws

    Unaware of conflicting TCAS procedures between Russian andEuropean pilots

    Etc.

  • Regulatory Agencies (FAA, CAA,

    Eurocontrol)

    (No significant influence on accident according to the report)

    Safety Requirements (Responsibilities)

    Clearly articulate procedures for compliance with TCAS RAs.

    Clearly articulate right of way rules in airspace.

    Define the role of air traffic controllers and pilots in resolving conflicts inthe presence of TCAS.

    Flawed Control Actions

    AIP Germany regulations not up to date for current version of TCAS.

    Procedural instruction for the actions to be taken by the pilots (from AIPGermany) in case of an RA not worded clearly enough.

    LuftVO Air Traffic Order Pilots are granted a freedom of decisionwhich is not compatible with the system philosophy of TCAS II, Version7 use of term recommendation is inadequate.

    Reasons for Flawed Control Actions, Dysfunctional Interactions

    Overlapping control authority by several nations & organizations.

    Asynchronous evolution between regulatory guidance documents andadopted technology.

  • Filnamn: Uberlingen enl STAMP

    (C) FMV 2007

    utg 9,3 2008-03-04

    Bjrn

    Koberstein

    Uberlingen, Operators: STAMP s Static Control Structure (Action Feedback)

    ICAO

    EuroControl

    Aircraft

    authorities

    (incl FAA)

    Flight

    operatorsAircraft

    manufacturerTCAS

    manufacturer

    Pilots

    Radar display

    Missing

    feedback

    ICAO (a UN agency)-Cooperative aviation regulation

    -Standards and recommendations .

    . (including rules of the air,

    . responsibilities of ATCOs & pilots,

    . conflicts btw ATCO & TCAS)

    Swiss Air Navigation Services .(ATC Zrich management)

    -Air traffic control in Swiss airspace + in

    . .delegated airspaces of adjoining states-ATC Safety policy (not fully implemented,. also SMOP in practice not prevented)

    ATC operator-overloaded

    -insufficiently trained

    -unofficial practice deviating from the

    . regular roster (night shift SMOP)

    EUROCONTROL (European org)-Standardization and guidance

    -Certification, education, information

    Missing

    feedback

    Missing

    feedback

    No direct

    feedback

    Aircraft TCASCoC-safety, quality and risk management

    -the implementation of the safety & risk

    . management procedures were delayed

    . due to their in-house development

    -unaware of the sectorisation work

    Radar

    CoC

    ATC-

    operators

    Missing

    feedback

    ATC Station

    (ATC Zurich)

    Missing

    feedback

    TCAS manufacturer-TCAS 2000 Pilots Guide

    . (including TCAS-ATC conflicts)

    Flight operators-TCAS training (B757-200/TU154M) * (obey RA unless/after conflict contact)

    -TU154M Flight Operations (ATC has. precedence over TCAS)

    JAA (Joint Aviation Authorities)-Guidance and training (TCAS-RAs

    . precedence over ATCO instructions)

    * BFU Investigation Report pp 62, 65

  • STPA

    A new hazard analysis technique based

    on the STAMP model of accident

    causation

    Copyright Nancy Leveson, Aug. 2006

  • STAMP-Based Hazard Analysis (STPA)

    Supports a safety-driven design process where

    Hazard analysis influences and shapes early design

    decisions

    Hazard analysis iterated and refined as design evolves

    Goals (same as any hazard analysis)

    Identification of system hazards and related safety

    constraints necessary to ensure acceptable risk

    Accumulation of information about how hazards can be

    violated, which is used to eliminate, reduce and control

    hazards in system design, development, manufacturing,

    and operations

    Copyright Nancy Leveson, Aug. 2006

  • Safety-Driven Design

    Define initial control structure, refining system safety constraintsand design in parallel.

    Identify potentially hazardous control actions by each ofsystem components that would violate system designconstraints. Restate as component safety design requirementsand constraints.

    Perform hazard analysis using STPA to identify how safety-related requirements and constraints could be violated (thepotential causes of inadequate control and enforcement ofsafety-related constraints).

    Augment the basic design to eliminate, mitigate, or controlpotential unsafe control actions and behaviors.

    Iterate over the process, i.e. perform STPA on the newaugmented design and continue to refine the design until allhazardous scenarios are eliminate, mitigated, or controlled.

    Document design rationale and trace requirements andconstraints to the related design decisions.

    Copyright Nancy Leveson, Aug. 2006

  • Step 1: Identify hazards and translate into high-

    level requirements and constraints on behavior

    TCAS Hazards:

    1. A near mid-air collision (NMAC): Two controlled aircraft violateminimum separation standards)

    2. A controlled maneuver into ground

    3. Loss of control of aircraft

    4. Interference with other safety-related aircraft systems

    5. Interference with the ground-based ATC system

    6. Interference with ATC safety-related advisory

    System Safety Design Constraints:

    TCAS must not cause or contribute to an NMAC

    TCAS must not cause or contribute to a controlled maneuverinto the ground

    Copyright Nancy Leveson, Aug. 2006

  • Step 2: Define basic control structure

    Copyright Nancy Leveson, Aug. 2006

  • Component Responsibilities

    TCAS:

    Receive and update information about its own and other aircraft

    Analyze information received and provide pilot with

    Information about where other aircraft in the vicinity are located

    An escape maneuver to avoid potential NMAC threats

    Pilot

    Maintain separation between own and other aircraft using visual

    scanning

    Monitor TCAS displays and implement TCAS escape maneuvers

    Follow ATC advisories

    Air Traffic Controller

    Maintain separation between aircraft in controlled airspace by

    providing advisories (control action) for pilot to follow

    Copyright Nancy Leveson, Aug. 2006

  • Aircraft components (e.g., transponders, antennas)

    Execute control maneuvers

    Receive and send messages to/from aircraft

    Etc.

    Airline Operations Management

    Provide procedures for using TCAS and following TCAS

    advisories

    Train pilots

    Audit pilot performance

    Air Traffic Control Operations Management

    Provide procedures

    Train controllers,

    Audit performance of controllers

    Audit performance of overall collision avoidance system

    Copyright Nancy Leveson, Aug. 2006

  • Step 3a: Identify potential inadequate control

    actions that could lead to a hazardous state.

    In general:

    1. A required control action is not provided or not

    followed

    2. An incorrect or unsafe control action is provided

    3. A potentially correct or inadequate control action is

    provided too late or too early (at the wrong time)

    4. A correct control action is stopped too soon.

    Copyright Nancy Leveson, Aug. 2006

  • For the NMAC hazard:

    TCAS:

    1. The aircraft are on a near collision course and TCAS does not

    provide an RA

    2. The aircraft are in close proximity and TCAS provides an RA that

    degrades vertical separation.

    3. The aircraft are on a near collision course and TCAS provides an

    RA too late to avoid an NMAC

    4. TCAS removes an RA too soon

    Pilot:

    1. The pilot does not follow the resolution advisory provided by TCAS

    (does not respond to the RA)

    2. The pilot incorrectly executes the TCAS resolution advisory.

    3. The pilot applies the RA but too late to avoid the NMAC

    4. The pilot stops the RA maneuver too soon.

    Copyright Nancy Leveson, Aug. 2006

  • Step 3b: Use identified inadequate control

    actions to refine system safety design

    constraints

    When two aircraft are on a collision course, TCAS must

    always provide an RA to avoid the collision

    TCAS must not provide RAs that degrades vertical separation

    The pilot must always follow the RA provided by TCAS

    Copyright Nancy Leveson, Aug. 2006

  • Step 4: Determine how potentially hazardous

    control actions could occur (scenarios of how

    constraints can be violated). Eliminate from design

    or control in design or operations.

    Step4a: Augment control structure with process models for each

    control component.

    Step4b: For each of inadequate control actions, examine parts of

    control loop to see if could cause it.

    Guided by a set of generic control flaws

    Step 4c: Design controls and mitigation measures

    Step4d: Consider how designed controls could degrade over time.

    Copyright Nancy Leveson, Aug. 2006

  • Copyright Nancy Leveson, Aug. 2006

  • Copyright Nancy Leveson, Aug. 2006

  • Generic Control Loop Flaws

    1. Inadequate Enforcement of Constraints (inadequate

    Control Actions)

    - Design of control algorithm (process) does not enforce

    constraints

    Flaws in creation process

    Process changes without appropriate change in control

    algorithm (asynchronous evolution)

    Incorrect modification or adaptation

    - Inadequate coordination among controllers and decision

    makers

    Copyright Nancy Leveson, Aug. 2006

  • - Process models inconsistent, incomplete, or incorrect

    Flaws in creation process

    Flaws in updating (inadequate or missing feedback)

    Not provided in system design

    Communication flaw

    Time lag

    Inadequate sensor operation (incorrect or no information provided)

    Time lags and measurement inaccuracies not accounted

    for

    Expected process inputs are wrong or missing

    Expected control inputs are wrong or missing

    Disturbance model is wrong

    Amplitude, frequency, or period is out of range

    Unidentified disturbance

    Copyright Nancy Leveson, Aug. 2006

  • 2. Inadequate Execution of Control Actions

    Communication flaw

    Inadequate actuator operation

    Time lag

    Copyright Nancy Leveson, Aug. 2006

  • Comparison with Traditional HA

    Techniques

    Top-down (vs bottom-up like FMECA)

    Considers more than just component failure and failure

    events (includes these but more general)

    Guidance in doing analysis (vs. FTA)

    Handles dysfunctional interactions and system accidents,

    software, management, etc.

    Copyright Nancy Leveson, Aug. 2006

  • Comparisons (2)

    Concrete model (not just in head)

    Not physical structure (HAZOP) but control (functional)

    structure

    General model of inadequate control (based on control

    theory)

    HAZOP guidewords based on model of accidents being

    caused by deviations in system variables

    Includes HAZOP model but more general

    Compared with TCAS II Fault Tree (MITRE)

    STPA results more comprehensive

    Included Ueberlingen accident

    Copyright Nancy Leveson, Aug. 2006

  • 1. Identify high-level functional requirements andenvironmental constraints.

    e.g. size of physical space, crowded area

    2. Identify high-level hazards

    a. Violation of minimum separation between mobile base andobjects (including orbiter and humans)

    b. Mobile robot becomes unstable (e.g., could fall over)

    c. Manipulator arm hits something

    d. Fire or explosion

    e. Contact of human with DMES

    f. Inadequate thermal control (e.g., damaged tiles not detected,DMES not applied correctly)

    g. Damage to robot

    Thermal Tile Robot Example

    Copyright Nancy Leveson, Aug. 2006

  • 3. Try to eliminate hazards from system conceptual design.

    If not possible, then identify controls and new design

    constraints.

    For unstable base hazard

    System Safety Constraint:

    Mobile base must not be capable of falling over under worst case operational conditions

    Copyright Nancy Leveson, Aug. 2006

  • First try to eliminate:

    1. Make base heavy

    Could increase damage if hits someone or something.

    Difficult to move out of way manually in emergency

    2. Make base long and wide

    Eliminates hazard but violates environmental constraints

    3. Use lateral stability legs that are deployed when manipulatorarm extended but must be retracted when mobile base moves.

    Two new design constraints:

    Manipulator arm must move only when stabilizer legs are fullydeployed

    Stabilizer legs must not be retracted until manipulator arm isfully stowed.

    Copyright Nancy Leveson, Aug. 2006

  • Define preliminary control structure and refine

    constraints and design in parallel.

    Copyright Nancy Leveson, Aug. 2006

  • Identify potentially hazardous control actions by

    each of system components

    1. A required control action is not provided or not followed

    2. An incorrect or unsafe control action is provided

    3. A potentially correct or inadequate control action is providedtoo late or too early (at the wrong time)

    4. A correct control action is stopped too soon.

    Hazardous control of stabilizer legs:

    Legs not deployed before arm movement enabled

    Legs retracted when manipulator arm extended

    Legs retracted after arm movements are enabled or retractedbefore manipulator arm fully stowed

    Leg extension stopped before they are fully extended

    Copyright Nancy Leveson, Aug. 2006

  • Restate as safety design constraints on components

    1. Controller must ensure stabilizer legs are extended

    whenever arm movement is enabled

    2. Controller must not command a retraction of stabilizer legs

    when manipulator arm extended

    3. Controller must not command deployment of stabilizer legs

    before arm movements are enabled. Controller must not

    command retraction of legs before manipulator arm fully

    stowed

    4. Controller must not stop leg deployment before they are fully

    extended

    Copyright Nancy Leveson, Aug. 2006

  • Do same for all hazardous commands:

    e.g., Arm controller must not enable manipulator armmovement before stabilizer legs are completely extended.

    At this point, may decide to have arm controller and

    leg controller in same component

    Copyright Nancy Leveson, Aug. 2006

  • To produce detailed scenarios for violation of

    safety constraints, augment control structure with

    process models

    Arm MovementEnabled

    Disabled

    Unknown

    Stabilizer LegsExtended

    Retracted

    Unknown

    Manipulator ArmStowed

    Extended

    Unknown

    How could become inconsistent with real state?

    e.g. issue command to extend stabilizer legs but external

    object could block extension or extension motor could fail

    Copyright Nancy Leveson, Aug. 2006

  • Problems often in startup or shutdown:

    e.g., Emergency shutdown while servicing tiles. Stability legsmanually retracted to move robot out of way. When restart,

    assume stabilizer legs still extended and arm movement could be

    commanded. So use unknown state when starting up

    Do not need to know all causes, only safety constraints:

    May decide to turn off arm motors when legs extended or when

    arm extended. Could use interlock or tell computer to power it off.

    Must not move when legs extended? Power down wheel motors

    while legs extended.

    Coordination problems

    Copyright Nancy Leveson, Aug. 2006

  • Some Examples and References

    to Papers on Them

    Copyright Nancy Leveson, Aug. 2006

  • Example: Early System Architecture

    Trades for Space Exploration

    Part of an MIT/Draper Labs contract with NASA

    Wanted to include risk, but little information available

    Not possible to evaluate likelihood when no designinformation available

    Can consider severity by using worst-case analysisassociated with specific hazards.

    Developed three step process:

    Identify system-level hazards and associated severities

    Identify mitigation strategies and associated impact

    Calculate safety/risk metrics for each architecture

    Copyright Nancy Leveson, Aug. 2006

  • Sample

    First identify system hazards and severities

    Copyright Nancy Leveson, Aug. 2006

  • ID# Phase Hazard H M EqG1 General Flamable substance in presence of ignition source (Fire) 4 4 4

    G2 General Flamable substance in presnece of ignition source in confined space (Explosion) 4 4 4

    G3 General Loss of life support (includes power, temperature, oxygen, air pressure, CO2, food, water, etc.) 4 4 4

    G4 General Crew injury or illness 4 4 1

    G5 General Solar or nuclear radiation exceeding safe levels 3 3 2

    G6 General Collision (Micrometeroids, debris, with modules during rendevous or separation maneuver, etc.) 4 4 4

    G7 General Loss of attitude control 4 4 4

    G8 General Engines do not ignite 4 4 2

    PL1 Pre-Launch Damage to Payload 2 3 3

    PL2 Pre-Launch Launch delay (due to weather, pre-launch test failures, etc.) 1 4 1

    L1 Launch Incorrect propulsion/trajectory/control during ascent 4 4 4

    L2 Launch Loss of structural integrity (due to aerodynamic loads, vibrations, etc) 4 4 4

    L3 Launch Incorrect stage separation 4 4 4

    E1 EVA in Space Lost in space 4 4 1

    A1 Assembly Incorrect propulsion/control during rendevous 4 4 4

    A2 Assembly Inability to dock 1 4 3

    A3 Assembly Inability to achieve airlock during docking 1 4 3

    A4 Assembly Inability to undock 4 4 3

    T1 In-Space Transfer Incorrect propulsion/trajectory/control during course change burn 4 4 3

    D1 Descent Inability to undock 4 4 3

    D2 Descent Incorrect propulsion/trajectory/control during descent 4 4 4

    D3 Descent Loss of structural integrity (due to inadequate thermal control, aerodynamic loads, vibrations, etc) 4 4 4

    A1 Ascent Incorrect stage separation (including ascent module disconnecting from descent stage) 4 3 3

    A2 Ascent Incorrect propulsion/trajectory/control during ascent 4 3 3

    A3 Ascent Loss of structural integrity (due to aerodynamic loads, vibrations, etc) 4 3 3

    S1 Surface Operations Crew members stranded on M surface during EVA 4 3 3

    S2 Surface Operations Crew members lost on M surface during EVA 4 3 3

    S3 Surface Operations Equipment damage (including related to lunar dust) 2 3 3

    NP1 Nuclear Power Nuclear fuel released on earth surface 4 4 2

    NP2 Nuclear Power Insufficient power generation (reactor doesn't work) 4 3 3

    NP3 Nuclear Power Insufficient reactor cooling (leading to reactor meltdown) 4 3 3

    RE1 Re-Entry Inability to undock 4 3 3

    RE2 Re-Entry Incorrect propulsion/trajectory/control during descent 4 3 3

    RE3 Re-Entry Loss of structural integrity (due to inadequate thermal control, aerodynamic loads, vibrations, etc) 4 3 4

    RE4 Re-Entry Inclement weather 4 2 2

    Severity

    Identified Hazards and their Severities

    Copyright Nancy Leveson, Aug. 2006

  • For example, not performing a rendezvous in transit reduces hazard

    of being unable to dock

    Copyright Nancy Leveson, Aug. 2006

  • Evaluate Each Architecture and Calculate

    Safety/Risk Metrics

    Create an architecture vector with all parameters for

    that architecture (column C of spreadsheet)

    Compute metric on architecture vector:

    Calculate a Relative Hazard Mitigation Index

    Calculate a Relative Severity Index

    Combine into an Overall Safety/Risk Metric

    Details in http://sunnyday.mit.edu/papers/issc05-

    final.pdf

    Copyright Nancy Leveson, Aug. 2006

  • Sample Results

    Copyright Nancy Leveson, Aug. 2006

  • Ballistic Missile Defense System (BMDS)

    Non-Advocate Safety Assessment using STPA

    A layered defense to defeat all ranges of threats in allphases of flight (boost, mid-course, and terminal)

    Made up of many existing systems (BMDS Element)

    Early warning radars

    Aegis

    Ground-Based Midcourse Defense (GMD)

    Command and Control Battle Management andCommunications (C2BMC)

    Others

    MDA used STPA to evaluate the residual safety risk ofinadvertent launch prior to deployment and test

    Copyright Nancy Leveson, Aug. 2006

  • 8/2/2006 132

    Status

    Track Data

    Fire Control

    Radar

    Operators

    Engage Target

    Operational Mode Change

    Readiness State Change

    Weapons Free / Weapons Hold

    Operational Mode

    Readiness State

    System Status

    Track Data

    Weapon and System Status

    Command Authority

    Doctrine

    Engagement Criteria

    Training

    TTP

    Workarounds

    Early WarningSystem

    Status Request

    Launch Report

    Status Report

    Heartbeat

    Radar Tasking

    Readiness Mode Change

    Status Request

    Acknowledgements

    BIT Results

    Health & Status

    Abort

    Arm

    BIT Command

    Task Load

    Launch

    Operating Mode

    Power

    Safe

    Software Updates

    Flight Computer

    InterceptorSimulator

    Launch Station

    Fire DIsable

    Fire Enable

    Operational Mode Change

    Readiness State Change

    Interceptor Tasking

    Task Cancellation

    Command Responses

    System Status

    Launch Report

    Launcher

    Launch Position

    Stow Position

    Perform BIT

    InterceptorH/W

    Arm

    Safe

    Ignite

    BIT Info

    Safe & Arm Status

    BIT Results

    Launcher Position

    Abort

    Arm

    BIT Command

    Task Load

    Launch

    Operating Mode

    Power

    Safe

    Software Updates

    Acknowledgements

    BIT Results

    Health & Status

    Breakwires

    Safe & Arm Status

    Voltages

    Exercise Results

    Readiness

    Status

    Wargame Results

    Safety Control Structure Diagram for FMIS

    Copyright Nancy Leveson, Aug. 2006

  • Results

    Deployment and testing held up for 6 months because somany scenarios identified for inadvertent launch (the onlyhazard considered so far). In many of these scenarios:

    All components were operating exactly as intended

    Complexity of component interactions led to unanticipatedsystem behavior

    STPA also identified component failures that could causeinadequate control (most analysis techniques consider onlythese failure events)

    As changes are made to the system, the differences areassessed by updating the control structure diagrams andassessment analysis templates.

    Adopted as primary safety approach for BMDS

    Copyright Nancy Leveson, Aug. 2006

  • Safety-driven Design of an Outer Planets

    Explorer Spacecraft for JPL

    Demonstration of approach on the design of a deep

    space exploration mission spacecraft (Europa).

    Defined mission hazards

    Generated mission safety requirements and design

    constraints

    Created spacecraft control structure and system design

    Performed STPA and generated component safety

    requirements and design features to control hazards

    http://sunnyday.mit.edu/papers/IEEE-Aerospace.pdf

    (complete specifications also available)

  • Organizational and Cultural Risk

    Analysis

    Copyright Nancy Leveson, Aug. 2006

  • Cultural and Organizational Risk

    Analysis and Performance Monitoring

    Apply STAMP and STPA at organizational level plus

    system dynamics modeling and analysis

    Goals:

    Evaluating and analyzing risk

    Designing and validating improvements

    Monitoring risk (canary in the coal mine)

    Identifying leading indicators of increasing or

    unacceptable risk

    Copyright Nancy Leveson, Aug. 2006

  • System Dynamics

    Created at MIT in 1950s by Forrester

    Used a lot in Sloan School (management)

    Grounded in non-linear dynamics and feedback control

    Also draws on

    Cognitive and social psychology

    Organization theory

    Economics

    Other social sciences

    Use to understand changes over time (dynamics of a

    system

    Copyright Nancy Leveson, Aug. 2006

  • Copyright Nancy Leveson, Aug. 2006

  • People who

    know

    People who

    don't knowrate of sharing

    the news

    Probability of Contact

    with those in the know

    Contacts between peoplewho know and people who

    don't

    +

    ++

    +

    100

    75

    50

    25

    00 1 2 3 4 5 6 7 8 9 10 11 1

    Time (Month)

    People who know

    peo

    ple

    People who don't knowRate of sharing the news

    Copyright Nancy Leveson, Aug. 2006

  • Copyright Nancy Leveson, Aug. 2006

  • Copyright Nancy Leveson, Aug. 2006

  • Copyright Nancy Leveson, Aug. 2006

  • Copyright Nancy Leveson, Aug. 2006

  • Copyright Nancy Leveson, Aug. 2006

  • Risk Analysis Process

    for Independent Technical Authority

    1. Preliminary

    Hazard Analysis

    2. Modeling the ITA

    Safety Control

    Structure

    3. Mapping

    Requirements to

    Responsibilities

    4. Detailed Hazard

    Analysis using STPA

    ? System hazards

    ? System safety requirements

    and constraints

    ? Roles and

    responsibilities

    ? Feedback mechanisms

    ? Gap analysis ? System risks

    (inadequate

    controls)

    5. Categorizing &

    Analyzing Risks

    6. System Dynamics

    Modeling and Analysis

    7. Findings and

    Recommendations

    ? Immediate and

    longer term risks

    ? Sensitivity

    ? Leading

    indicators

    ? Risk Factors

    ? Policy

    ? Structure

    ? Leading indicators

    and measures of

    effectiveness

    Copyright Nancy Leveson, Aug. 2006

  • 1. Preliminary Hazard Analysis

    System Hazard: Poor engineering and management decision-

    making leading to an accident (loss).

    System Safety Requirements and Constraints:

    1. Safety considerations must be first and foremost in technical

    decision-making.

    2. Safety-related technical decision-making must be done by

    eminently qualified experts with broad participation of the full

    workforce.

    3. Safety analyses must be available and used starting in the early

    acquisition, requirements development, and design processes

    and continuing through the system lifecycle.

    4. The Agency must provide avenues for full expression of technical

    conscience and a process for full and adequate resolution of

    technical conflicts as well as conflicts between programmatic and

    technical concerns. Copyright Nancy Leveson, Aug. 2006

  • Each of these was refined, e.g.,

    1. Safety considerations must be first and foremost in technicaldecision-making.

    a. State-of-the art safety standards and requirements for NASA missionsmust be established, implemented, enforced, and maintained thatprotect the astronauts, the workforce, and the public.

    b. Safety-related technical decision-making must be independent fromprogrammatic considerations, including cost and schedule

    c. Safety-related decision-making must be based on correct, complete,and up-to-date information.

    d. Overall (final) decision-making must include transparent considerationof both safety and programmatic concerns.

    e. The Agency must provide for effective assessment and improvementin safety-related decision-making.

    To create a set of system safety requirements and constraintssufficient to eliminate or mitigate the hazard

    Copyright Nancy Leveson, Aug. 2006

  • 2. Model the ITA Control Structure

    Copyright Nancy Leveson, Aug. 2006

  • For each component specified:

    Inputs, outputs

    Overall role and detailed responsibilities (requirements)

    Potential inadequate control actions

    Feedback requirements

    For most added:

    Environmental and behavior-shaping factors (context)

    Mental model requirements

    Controls

    Copyright Nancy Leveson, Aug. 2006

  • Example from System Technical

    Warrant Holder

    1. Establish and maintain technical policy, technical

    standards, requirements, and processes for a

    particular system or systems.

    a. STWH shall ensure program identifies and imposes

    appropriate technical requirements at

    program/project formulation to ensure safe and

    reliable operations.

    b. STWH shall ensure inclusion of the consideration of

    risk, failure, and hazards in technical requirements.

    c. STWH shall approve the set of technical

    requirements and any changes to them

    d. STWH shall approve verification plans for the

    system(s)

    Copyright Nancy Leveson, Aug. 2006

  • 3. Map System Requirements to

    Component Responsibilities

    Took each of system safety requirements and

    traced to component responsibilities

    (requirements)

    Identified omissions, conflicts, potential issues

    Recommended additions and changes

    Added responsibilities when missing in order for

    risk analysis to be complete.

    Copyright Nancy Leveson, Aug. 2006

  • 4. Hazard Analysis using STPA

    General types of risks for ITA:

    1. Unsafe decisions are made by or approved by ITA

    2. Safe decisions are disallowed (overly conservative decision-making that undermines the goals of NASA and long-termsupport for ITA)

    3. Decision-making takes too long, minimizing impact and alsoreducing support for ITA

    4. Good decisions are made by ITA, but do not have adequateimpact on system design, construction, and operation

    Applied to each of component responsibilities

    Identified basic and coordination risks

    Copyright Nancy Leveson, Aug. 2006

  • Example from Risks List

    CE Responsibility: Develop, monitor, and maintain technical

    standards and policy

    Risks:

    1. General technical and safety standards and

    requirements are not created (IC)

    2. Inadequate standards and requirements are created

    (IC)

    3. Standards degrade as changed over time due to

    external pressures to weaken them. Process for

    approving changes is flawed (LT).

    4. Standards not changed or updated over time as the

    environment changes (LT).

    Copyright Nancy Leveson, Aug. 2006

  • 5. Categorize and Analyze Risks

    Large number resulted so:

    Categorized risks as

    Immediate concern

    Longer-term concern

    Standard Process

    Used system dynamics models to identify which risks

    were most important to assess and measure

    Provide most important assessment of current level of risk

    Most likely to detect increasing risk early enough to prevent

    significant losses (leading indicators)

    Copyright Nancy Leveson, Aug. 2006

  • Risk

    Shuttle Agingand

    MaintenanceSystem Safety

    Efforts &Efficacy

    PerceivedSuccess by

    Administration

    System SafetyKnowledge,

    Skills & Staffing

    Launch RateSystem Safety

    ResourceAllocation

    Incident Learning& Corrective

    Action

    ITA

    Copyright Nancy Leveson, Aug. 2006

  • 6. System Dynamics Modeling

    Modified our NASA manned space program model

    to include Independent Technical Authority (ITA)

    Independently tested and validated the nine models,

    then connected them

    Ran analyses:

    Sensitivity analyses to investigate impact of various

    parameters on system dynamics and risk

    System behavior mode investigation

    Metrics evaluations

    Additional scenarios and insights

    Copyright Nancy Leveson, Aug. 2006

  • Example Result

    ITA has potential to significantly reduce risk and to

    sustain an acceptable risk level

    But also found significant risk of unsuccessful

    implementation of ITA that needs to be monitored

    200-run Monte-Carlo sensitivity analysis

    Random variations of +/- 30% of baseline exogenous

    parameter values

    Copyright Nancy Leveson, Aug. 2006

  • Successful vs. Unsuccessful ITA

    ImplementationIndicator of Effectiveness and Credibility of ITA

    1

    0.5

    0Time

    1

    2

    1

    0.5

    0 1

    2

    Time

    System Technical Risk

  • Self-sustaining for short period of time if conditions in

    place for early acceptance.

    Provides foundation for a solid, sustainable ITA

    program implementation under right conditions.

    Successful scenarios:

    After period of high success, effectiveness slowly

    declines

    Complacency

    Safety seen as solved problem

    Resources allocated to more urgent matters

    But risk still at acceptable levels and extended period

    of nearly steady-state equilibrium with risk at low levels

    Successful Scenarios

    Copyright Nancy Leveson, Aug. 2006

  • Unsuccessful Implementation Scenarios

    Effectiveness quickly starts to decline and reaches

    unacceptable levels

    Limited ability of ITA to have sustained effect on system

    Hazardous events start to occur, safety increasingly

    perceived as urgent problem

    More resources allocated to safety but TA and TWHs have

    lost so much credibility they cannot effectively contribute to

    risk mitigation anymore.

    Risk increases dramatically

    ITA and safety staff overwhelmed with safety problems

    Start to approve an increasing number of waivers so can

    continue to fly.

    Copyright Nancy Leveson, Aug. 2006

  • Unsuccessful Scenario Factors

    As effectiveness of ITA decreases, number of problems

    increase

    Investigation requirements increase

    Corners may be cut to compensate

    Results in lower-quality investigation resolutions and

    corrective actions

    TWHs and Trusted Agents become saturated and cannot attend

    to each investigation in timely manner

    Bottleneck created by requiring TWHs to authorize all safety-

    related decisions, making things worse

    Want to detect this reinforcing loop while interventions

    still possible and not overly costly (resources, downtime)

    Copyright Nancy Leveson, Aug. 2006

  • Identification of Lagging vs. Leading

    Indicators

    Number of waivers issued

    good indicator but lags rapid

    increase in risk

    Incidents under investigation

    is a better leading indicator

    System Technical Risk Risk UnitsOutstanding Accumulated Waivers Incidents

    Time

    System Technical Risk Risk UnitsIncidents Under Investigation Incidents

    Time

  • Modeling Exploration Enterprise (ESMD)

    Built a large STAMP plus systems dynamics model

    of the Project Constellation

    Development-oriented vs. operations oriented

    Space Shuttle model

    Demonstrating how it can be used for risk

    management decision-making

    Copyright Nancy Leveson, Aug. 2006

  • Risk Management in NASAs New

    Exploration Systems Mission Directorate

    Created an executable model, using input from the NASAworkforce, to analyze relative effects of managementstrategies on schedule, cost, safety and performance

    Developed scenarios to analyze risks identified by theAgencys workforce

    Performed preliminary analysis on the effects of hiringconstraints, management reserves, independence of safetydecision-making, requirements changes, etc.

    Derived preliminary recommendations to mitigate andmonitor program-level risks

    Copyright Nancy Leveson, Aug. 2006

  • Structure of System Dynamics Model

    Congress and White HouseDecision -Making

    NASA Administration and ESMDDecision -Making

    OSMA OCE

    Exploration Systems Engineering Management

    Technical

    Personnel

    Resources and

    Experience

    System Development and

    Safety Analysis

    Completion

    Efforts and

    Efficacy of Other

    Technical

    Personnel

    Engineering

    Procurement

    NESC

    Safety and Mission

    Assurance

    SMA Status , Efficacy ,

    Knowledge and Skills

    Exploration Systems Program /Project Management

    Task Completion and Schedule Pressure Resource Allocation

    Copyright Nancy Leveson, Aug. 2006

  • Design Work

    Remaining

    Design Work

    Completed

    Pending Technology

    Development Tasks

    Completed Technology

    Development Tasks

    Design Task

    Completion Rate

    Technology Development

    Task Completion Rate

    Technologies used in

    DesignTechnology

    Utilization Rate

    Pending Hazard

    Analyses

    Incoming Program

    Design Work

    Incoming Hazard

    Analysis Tasks

    Incoming Technology

    Development Tasks

    Completed Hazard

    Analyses

    Hazard Analyses

    used in DesignHA Completion

    Rate

    HA Utilization

    Rate

    Hazard Analyses

    unused in Design

    Decisions

    HA Discard Rate

    Abandoned

    Technologies

    Technology

    Abandonment Rate

    Design Task AllocationRate (from P/P

    Management)

    Technology Development Task

    Allocation Rate (from P/P

    Management)

    Capacity for Performing

    System Design Work 0

    Capacity for

    PerformingTechnology

    Development Work 0

    Design SchedulePressure from

    Management

    Fraction of HAs Too

    Late to Influence Design

    Average Hazard

    Analysis Quality

    Average Quality ofHazard Analyses used in

    Design

    Fraction of Design Tasks with

    Associated Hazard Analysis

    Technology Available to

    be used in Design

    Additional Incoming

    Design Work

    Progress Report to

    Management

    Additional Incoming Work

    from Changes (from P/P

    Management)

    Design Work

    Completed with

    Undiscovered Safety

    and Integration

    Problems

    Design Work Completion

    Rate with Safety and

    Integration Flaws

    Total Design Work

    Completion Rate

    Work Discovered with

    Safety and Integration

    Problems

    Flaw Discovery

    Rate

    Design Work with

    Accepted Problems or

    UnsatisfiedRequirements

    Acceptance Rate

    Unplanned Rework

    Decision Rate

    Additional Operations Cost for Safety

    and Integration Workaround

    Efficacy of Safety

    Assurance (SMA)

    Safety Assurance

    Resources

    Time to Discover

    Flaws

    Incentives to

    Report Flaws

    Efficacy of System

    Integration

    Quality of Safety

    Analyses 0

    Maximum System Safety

    Analysis Completion Rate

    System

    Performance

    Apparent Work

    Completed

    Desired Design Task

    Completion Rate

    Safety of

    Operational System

    System Design

    Overwork

    Desired Safety Analysis

    Completion Rate

    Ability to Perform

    Contractor Safety

    Oversight 2

    Fraction of Design TasksCompleted with Safety and

    Integration Flaws

    Engineering - System Development Completion and Safety Analyses

    Safety Rework

    Rework Cycle

    Integrated

    Product

  • NASA ESMD Workforce Planning

    ESMD Employee Gap4,000

    2,950

    1,900

    850

    - 2000 37.5 75 112.5 150

    Time ( Month)

    Transfers from

    Shuttle

    +

    Limits on Hiring

    +

    (A)

    5 0

    Simulation varied:

    Initial experience distribution of ESMD civil servant workforce

    Maximum civil servant hiring rates

    Transfers from Shuttle ops during Shuttle retirement

    Important Issues:

    - Increase in retirements

    - Hiring limits

    - Transfers

  • Example: Schedule Pressure and Safety Priority

    in Developing the Shuttle Replacement

    1. Overly aggressive schedule

    enforcement has little effect

    on completion time (

  • Using Model for Policy Decisions

    The results of the analyses can be used to make policydecisions, for example:

    Reduce limitations (external and internal) that will impede civilservant hiring in the next few years

    Monitor management reserves and use them to alleviate overwork

    Enhance, monitor, and maintain influence of safety analysts ondecision-making

    Rotate Rising Stars in the Agency through the safety organization

    Monitor overwork of SE&I and safety engineering, as they controlthe rework cycle (safety, cost, and schedule impact)

    Continue planning to minimize downstream requirements changes,and allow for on/off ramps (technologies and designs) to reducenegative impact

  • Exploring Limits of Use

    Medical error and medical safety (risk analysis of

    outpatient surgery at Beth Israel Deaconess Hospital)

    Safety in pharmaceutical testing and drug development

    Food safety

    Control of corporate fraud

  • For More Information

    New book draft on STAMP

    http://sunnyday.mit.edu/book2.html

    (link to CER Early Trades paper also here)

    NASA ITA Risk Analysis Final Report

    http://sunnyday.mit.edu/ITA-Risk-Analysis.doc

    NASA ESMD Risk Management Demonstration

    http://sunnyday.mit.edu/ESMD-Final-Report.pdf

  • Summary and Conclusions

    A more powerful approach to hazard analysis and systemsafety engineering

    Based on a new, more comprehensive model of accidentcausation

    Includes what do now but also much more

    Works for the complex, software-intensive systems (andsystems-of-systems) we are building

    Considers the entire socio-technical system

    Can be used early in concept formation and development toguide design for safety

    Has been validated and is being used on real systems

    Potential for very powerful automated tools and assistance

    Copyright Nancy Leveson, Aug. 2006

  • Differences with Traditional Approaches

    More comprehensive view of causality

    A top-down systems approach to preventing losses

    Includes organizational, social, and cultural aspects of

    risk as well as physical system

    Emphasizes non-probabilistic and qualitative approaches

    Combines static (structural) and behavioral models

    Looks at dynamics and changes over time

    Migration toward states of increasing risk

    Includes human decision making and mental models

    Handles much more complex systems than traditional

    safety engineering approaches