issues and ideas in software reliability for fcs joe loyall bbn technologies

Issues and Ideas in Issues and Ideas in Software Reliability Software Reliability

for FCSfor FCS

Joe Loyall

BBN Technologies

5/18/2004 Joe Loyall 2

General Issues Affecting Reliability of FCSGeneral Issues Affecting Reliability of FCS

• Size and complexity - Very large, complex systems– Many interoperating parts, developed by different people, including legacy– Unreliability of any one part can affect the system, but reliability of any one part may have little effect

on the reliability of the entire system

• Large mission requirements that decompose into distributed (and some local) requirements – Too easy to decompose poorly

• One can verify, validate, and unit test individual pieces– However, reliability of the whole is not the sum of the reliability of the parts

• Abstracting away the details can help one to understand some of the high-level design– However, putting back in the details later can put back in the complexity and the bugs

• Some things can’t be put back in later, because they are pervasive– Trying to insert some things after the fact can greatly increase the fragility of the system– QoS, security, fault tolerance are examples

• Tying too tightly to a hardware platform can lead to future brittleness; Tying too loosely can lead to bugs associated with lack of control

– Motivates the need for a middle layer

• Reliability of the system can be limited by the quality of the least capable programming group

– Motivates the need for strong processes, tools, patterns, etc.


Topic 1: Building Reliable FCS Software Topic 1: Building Reliable FCS Software with Managed Quality of Service (QoS)with Managed Quality of Service (QoS)

• Managed QoS in DRE systems is crucial– Providing managed QoS currently complicates application development significantly especially in

distributed environments– Has traditionally been handled with static provisioning– Recent research has developed the ability to handle QoS at runtime with control and adaptation

• New advances are needed to develop reliable FCS software– Can’t move backward to only static provisioning because FCS is too dynamic– Runtime QoS control, however, is only one part of software reliability

• Need to continue to build upon the advances of recent years…– Separate programming of QoS and functionality– Design-time specification and runtime enforcement of QoS– Predictable end-to-end QoS in dynamic environments– Component sized units for encapsulation, reuse, and composition

• While moving forward to support the design and implementation of reliable QoS managed FCS software

– Modeling of QoS aspects separately from, but alongside, functional and component modeling– Programming to well-defined QoS interfaces and standard protocols– Reusable encapsulated, but configurable, QoS behaviors that can be assembled with reliability– Models, tools, patterns, and processes


Area of Focus: SoS QoSArea of Focus: SoS QoS

Designing SoS must consider several dimensions of QoS• QoS for each individual end-to-end string

(SDMS/W)• QoS for multiple end-to-end application strings

competing for resources• Doing this for non-fixed, changing numbers of

application strings• Handling it dynamically, where conditions

change over time

Technologies and processes to make it feasible to handle QoS at the System of Systems (SoS) level of abstraction

• Modeling tools that support design of QoS aspects of SoS separate from, but alongside, functional components

• QoS interfaces and patterns of use that enforce managed assembly and disciplined composition of QoS and functional components (ala type checking and IDL)

• Multi-layer QoS design and management– Mission layer coordinates missions and mission-

level policies– Coordination layer manages QoS for logically or

physically related sets of components– Resource layer manages QoS for individual

resources or mechanisms

• Reusable, validated QoS components• Assembly, deployment, and configuration with

validated behavior– Validated QoS behaviors assembled into

validated patterns using enforcing interfaces


Topic 2: Processes and Methods for Topic 2: Processes and Methods for FCS Software ReliabilityFCS Software Reliability

• Modeling is important, but not a silver bullet, and can be dangerous– Models can diverge from implementation over time (is incorrect documentation worse than no

documentation?)– Models frequently are higher level, and more abstract, to capture the top-down design, but

introducing the details later introduces bugs and complexity (Need proper abstractions and correct/complete decomposition support)

– Modeling can introduce more opportunities for errors• Models can be incorrect (need for model validation)• Code synthesizers can be incorrect• Interaction with legacy or handwritten code can introduce errors

• Well-defined interfaces and “type” enforcement– Component interfaces and type enforcement have reduced many instances of common errors– With some attention and research, could the QoS, security, fault tolerance, etc equivalent be

developed

• Verification and constraint concepts might provide partial solutions to FCS reliability– Constraints and verification at many levels (higher abstract design level down through each

decomposition) – the only way to scale the idea to the size and complexity of FCS– “Proof carrying code”-like enforcement constraints for functionality and QoS, for assembly,

deployment, configuration, and runtime• Can prevent errors in some cases• Earlier detection (in the life cycle) of other problems• Aid in software correctness over the system’s lifetime


Topic 3: Open-Standards, Open-Source, Topic 3: Open-Standards, Open-Source, and Alternative Modelsand Alternative Models

• Open standards and open-source are trends that are unlikely to reverse– Economic benefits – no single vendor for a technology; longer lived technology bases

– Fewer stove-piped, one-of-a-kind systems

– Pushes the technology up• System developers can assume the existence of infrastructure and the programmers that

understand it• Enables the development of systems with greater capability because they don’t have to be built

from the ground up

– However, they make integration more important and more frequent

• Program development models that increase reliability– There are domains in which software is well-engineered and reliable

– For example, many business applications (which previously were developed by professional programmers) are developed today by domain experts (e.g., accountants) in well-established, reliable tools (e.g., spreadsheet programs)

– Are there parts of FCS SoS building that can likewise, with the proper tool support, be turned over to domain experts and what would be needed to enable it?

• Patterns of use, idioms that lead programmers to producing correct software• Modeling or other programming tool environments with domain-friendly interfaces• These tools could be highly constraining to allow production of only well-behaved, reliable

software because their focus is narrow and domain-specific


Topic 4: Certification of FCS SoS Software Topic 4: Certification of FCS SoS Software

• Certification is already a difficult issue and the highly distributed, heterogeneous, and dynamic nature of large SoS software makes certification with current processes more difficult

– However systems of greater scale, distribution, interaction, and dynamism are inevitable and need to be certified

– The nature of the systems being certified and the nature of the certification process might need to evolve simultaneously

– Certification of individual components or participants is unlikely to scale well to certification of the entire system

– Can certification of individual behaviors contribute to certification of a system that can change its behavior• We can provide techniques that support the certification of dynamic systems

– Increase the ability to certify dynamic systems by constraining their dynamism• Critical subsystems limited to dynamically choosing from a set of certified static choices

– If we can’t certify exactly correct behavior for highly dynamic systems, perhaps we can certify their limits• For example, certify that an adaptive system can do no harm; while we might not be able to certify exactly how it

can adapt, we can certify how much, or within what limits, it can adapt or that its adaptation can affect the rest of the system

– Can we certify the adaptive mechanisms that delimit behavior, recover, protect, or keep software operating within a “safe” subset of possibilities

• In a highly dynamic, distributed system even if we cannot certify that it is free from defects, perhaps it is sufficient to certify that the system would gracefully handle, recover from, or fix defects

• How do we certify the adaptive mechanisms – useful if we can presume this is simpler than certifying the full system behavior

Some Additional Technical Ideas Some Additional Technical Ideas Relevant to Reliable FCS SoftwareRelevant to Reliable FCS Software

Survivability for FCS


Defense Enabling: Dynamism for SurvivabilityDefense Enabling: Dynamism for Survivability

• Survival of critical systems, as much as security, is crucial• Adaptation is essential to survive organized, malicious attack

– Tolerate and recover from failures induced by the attack– Compensate (e.g., graceful degradation) if attacker succeeds in

preventing use of required resources– Introduce artificial diversity to increase attacker work factor

• Adaptive response involves dynamic management of system resources and properties– Integration of system properties (e.g., real-time, security, dependability)

and the associated tradeoffs– Strategies for coordinated, distributed, but secure adaptation and

management• Adaptive response is supported by

– Redundancy (eliminate single point failures)– Heterogeneity (prevent common mode failures)– Uncertainty (slow staged attacks)

5/18/2004 Joe Loyall 10

Architecting Survivability into FCS (and Architecting Survivability into FCS (and other SoS)other SoS)

Reliability requires architecting in multiple dimensions

Even more so, when the goal is to be resilient not only against errors, but also against attacks….

Diversity: Avoid common mode vulnerabilities

Layers of protection

Both HWand SW

Design Principles,

Architecturalconstrains

High barrier to intrusion

Adaptive response

Adaptivemiddleware

Rapid andcoordinatedresponse

Isolation, recovery,Graceful

degradation

Redundancy: No single point of failure in critical functionality

Weak assumptions

Less susceptible to attacker’s manipulation of environment

Detection and correlationEmbeddedsensors

Mix of IDSand Policy violation

Advanced, distributedcorrelation

General principles for survivability

• Protect as best as possible

• Improve chances of detection

• Adapt to manage gaps

5/18/2004 Joe Loyall 11

Use of Modeling for Validation of Use of Modeling for Validation of Integrated SurvivabilityIntegrated Survivability

PIP requirements 1 – 4

JBI survivability requirements

Initialized JBI provides essential services

Authorized publish is processed successfully

ConfidentialityDataflow

Timeliness Integrity

(from functional model execution)

Component Model Assumptions Hold

JBI intrusion detection requirements

PA1: Client-Core

Communication I & C

PA2: Alternate Path

Availability

QA1: QIS Incorruptibility

QA2: QIS Communication

Cutoff

QA3: QIS Input

Integrity

QA4: QIS Function

Correctness

AA1: AP Function

Correctness

AA2: AP Application-

layer Integrity

AA3: AP Application-layer

Confidentiality

DA1: DC Communications

SA1: IO Integrity in

PSQ Server

SA2: Client Confidentiality in PSQ Server

SA3: IO Authenticity

SA4: Network-layer I & C

SeA1: Sensor False Alarm

Rate

SeA2: Sensor Detection Delay

SeA3: Sensor Detection Probability

CoA1: Corrleator

False Alarm Rate

MA1: SM Byzantine Agreement

PsA1: ADF Policy Server

Input Correctness


SynchronizationSystem Connectivity

Physical Topology

Network TopologyRestricted RoutingNo Tunneling Attacks

SELinux Solaris Windows

Type Enforcement Hardened Kernel IKENA StormWatch

Platform Mechanisms Process Domain Policies

Private Key Confidentiality

No Unauthorized Direct Access

Keys Protected from Theft

DoD Common Access Card (CAC)

PKCS #11 Tamperproof

Keys Not Guessable

Algorithmic Framework

Key Length Key Lifetime

No Unauthorized Indirect Access

Physical Protection of CAC device

Protection of CAC Authentication Data

No Compromise of Authorized Process

Accessing CAC

No Cryptography in Access Proxy

Not Preconfigured

Not Reconfigurable

ADF NIC services protected

ADF Correctness

ADF NIC Physical Security

ADF NIC Firmware Initialization

ADF Key Initialization

ADF Agent Initialization

ADF Protocol Correctness

ADF Host Independence

ADF Agent Correctness

VPG Integrity VPG Confidentiality

Policy Server Integrity

ADF Policy Correctness

Correctness of Registration

Protocol

Correctness of Reattachment

Protocol

Hard-wired Configuration

Electrically Isolated

Physically Protected

Connectivity

Physical Integrity

Electrical Integrity

Gate Configuration and

Truth Table

Proxy Protocol Configuration

Can Identify Malformed Traffic

Correctness of Rate Control Mechanisms

Correctness of Certificate Exchange

IDS Experimental Evaluation

Correctness of Modified ITUA Protocols

Functional model faithful to design

IDS / Correlation requirements

IO Confidentiality (end-to-end)

IConfidentiality of Network

Communications

Confidential info is not exposed

Unauthorized activity is properly rejected

Authorized join/leave is processed successfully

Authorized query is processed

successfully

Authorized subscribe is processed successfully

JBI is properly initialized

Design Team Review

Attack Model Assumptions Hold

Functional Model Assumptions Hold

Infrastructure Attack

Propagation

Data Attack Propagation

Attacks Originate

Outside the Platform

No Data Attacks


Initial Targets of

Infrastructure Attacks

Isolation of Intruded Process Domains

Targets for Loss of IO

Confidentiality

No Compromise or Failure of

QIS

DoS Causes Processing

Delays

DoS Does Not Corrupt

Other Components

DoS Attacks Do Not

Propagate from Clients to Core

Design Faithfully

Implemented

Absence of Insider Threat

Attack Model Parameter Selection

CERT Vulnerability DB Analysis

Variation over Anticipated

Ranges

Correctness of Managed Switch

IO Confidentiality in Transit

IO Confidentiality in Storage

Confidentiality of Application-layer

Messages

PIP requirements 1 – 4

JBI survivability requirements

Initialized JBI provides essential services

Authorized publish is processed successfully

ConfidentialityDataflow

Timeliness Integrity

(from functional model execution)

Component Model Assumptions Hold

JBI intrusion detection requirements

PA1: Client-Core

Communication I & C

PA2: Alternate Path

Availability

QA1: QIS Incorruptibility

QA2: QIS Communication

Cutoff

QA3: QIS Input

Integrity

QA4: QIS Function

Correctness

AA1: AP Function

Correctness

AA2: AP Application-

layer Integrity

AA3: AP Application-layer

Confidentiality

DA1: DC Communications

SA1: IO Integrity in

PSQ Server

SA2: Client Confidentiality in PSQ Server

SA3: IO Authenticity

SA4: Network-layer I & C

SeA1: Sensor False Alarm

Rate

SeA2: Sensor Detection Delay

SeA3: Sensor Detection Probability

CoA1: Corrleator

False Alarm Rate

MA1: SM Byzantine Agreement


Input Correctness


SynchronizationSystem Connectivity

Physical Topology

Network TopologyRestricted RoutingNo Tunneling Attacks

SELinux Solaris Windows

Type Enforcement Hardened Kernel IKENA StormWatch

Platform Mechanisms Process Domain Policies

Private Key Confidentiality

No Unauthorized Direct Access

Keys Protected from Theft

DoD Common Access Card (CAC)

PKCS #11 Tamperproof

Keys Not Guessable

Algorithmic Framework

Key Length Key Lifetime

No Unauthorized Indirect Access

Physical Protection of CAC device

Protection of CAC Authentication Data

No Compromise of Authorized Process

Accessing CAC

No Cryptography in Access Proxy

Not Preconfigured

Not Reconfigurable

ADF NIC services protected

ADF Correctness

ADF NIC Physical Security

ADF NIC Firmware Initialization

ADF Key Initialization

ADF Agent Initialization

ADF Protocol Correctness

ADF Host Independence

ADF Agent Correctness

VPG Integrity VPG Confidentiality

Policy Server Integrity

ADF Policy Correctness

Correctness of Registration

Protocol

Correctness of Reattachment

Protocol

Hard-wired Configuration

Electrically Isolated

Physically Protected

Connectivity

Physical Integrity

Electrical Integrity

Gate Configuration and

Truth Table

Proxy Protocol Configuration

Can Identify Malformed Traffic

Correctness of Rate Control Mechanisms

Correctness of Certificate Exchange

IDS Experimental Evaluation

Correctness of Modified ITUA Protocols

Functional model faithful to design



IO Confidentiality (end-to-end)

IConfidentiality of Network

Communications








successfully


successfully

Authorized subscribe is processed successfullyAuthorized subscribe is processed successfully



Design Team Review

Attack Model Assumptions Hold

Functional Model Assumptions Hold

Infrastructure Attack

Propagation

Data Attack Propagation

Attacks Originate


No Data Attacks


Initial Targets of

Infrastructure Attacks

Isolation of Intruded Process Domains

Targets for Loss of IO

Confidentiality

No Compromise or Failure of

QIS

DoS Causes Processing

Delays

DoS Does Not Corrupt

Other Components

DoS Attacks Do Not

Propagate from Clients to Core

Design Faithfully

Implemented

Absence of Insider Threat

Attack Model Parameter Selection

CERT Vulnerability DB Analysis

Variation over Anticipated

Ranges

Correctness of Managed Switch

IO Confidentiality in Transit




Messages


Messages

Requirements decomposition

Executable model of the system (probabilistic or logical)

Model assumptions

Supporting arguments and experimentation

Fraction of successful publishes versus MTTD_A (min)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10 100 1000 10000 100000

MTTD_A (min)

Fra

cti

on

of

Su

ccessfu

l P

ub

lish

es

12 hour mission 24 hour mission 48 hour mission

• Survivability results obtained through modeling– Critical functionality available with high probability

even when under heavy successful attack– 98% of all functions successful even with

vulnerabilities discovered daily, or faster– Operating system diversity bolsters reliability of

critical functionality when under attack– With the current architecture, attackers are more

effective compromising functionality than crashing components

Total number of intrusions versus MTTD_A (min)

0

100

200

300

400

500

600

10 100 1000 10000

MTTD_A (min)

To

tal

Nu

mb

er

of

Intr

usio

ns

12 hour mission 24 hour mission 48 hour mission

issues and ideas in software reliability for fcs joe loyall bbn technologies

Documents