knowledge centered assessment patterns an effective tool...

Knowledge Centered Assessment Patterns 9. Sept. 2002 ERAU/Guidant Lab Software Architecture Team

-1-

Knowledge Centered Assessment Patterns An Effective Tool for Assessing Safety Concerns

in Software Architecture

ERAU/Guidant Lab Software Architecture Team

May 10, 2002


-2-

1 Abstract In software based systems, the notion of software failure is magnified if the software in question is a component of a safety critical system. Hence, to ensure a required level of safety, the product must undergo rigorous testing and verification/validation activities. Consequently, most often the organizations experience a high cost associated with ensuring that the software system is defect free or doesn’t operate in such a way to violate the over system safety. To minimize the cost of quality (COQ) associated with the development of safety critical systems, it becomes imperative that the assessment of intermediate artifacts (e.g., requirement, design documents or models) is done efficiently and effectively to maximize early defect detection and/or defect prevention. However, as a human centered process, the assessment of software architecture for safety critical systems relies heavily on the experience and knowledge of the assessment team to ensure that the proposed architecture is consistent with the software functional and safety requirements. The Knowledge Centered Assessment Pattern (KCAP) is an effective tool that assist the assessment team by providing key information on what architectural elements should be assessed, why they should to be assessed, and how should they be assessed. Further more, use of KCAP will highlight cases where the software architecture has been properly, over, under or incoherently engineered.


-3-

2 Table of Contents 1 ABSTRACT.................................................................................................................... 2 2 TABLE OF CONTENTS .............................................................................................. 3 3 OVERVIEW................................................................................................................... 5

3.1 Intended Audience .................................................................................................... 5 3.2 ERAU/Guidant Lab Software Architecture Team.................................................... 5 3.3 Acknowledgments..................................................................................................... 5 3.4 Abbreviations and Acronyms .................................................................................... 5 3.5 List of Figures ........................................................................................................... 5 3.6 List of Tables ............................................................................................................ 6

4 RESEARCH OBJECTIVE ........................................................................................... 7 4.1 Overview................................................................................................................... 7 4.2 Principles................................................................................................................... 7 4.3 Current focus............................................................................................................. 8

5 KNOWLEDGE CENTERED ASSESSMENT PATTERN...................................... 12 5.1 Overview................................................................................................................. 12 5.2 Rational ................................................................................................................... 14 5.3 Knowledge Centered............................................................................................... 15

6 REVIEW OF THE RELATED WORK..................................................................... 17 6.1 Active Design Review ............................................................................................ 17 6.2 Hazard Analysis ...................................................................................................... 17 6.3 Fault Tree Analysis ................................................................................................. 17 6.4 Failure Mode and Effects Analysis......................................................................... 18 6.5 Checklist ................................................................................................................. 18

7 FUTURE EVOLUTION OF THE RESEARCH....................................................... 21 7.1 Research Direction.................................................................................................. 21

8 CASE STUDIES........................................................................................................... 23 8.1 Case Study #1: A Basic Example ........................................................................... 23

8.1.1 Problem Description ........................................................................................ 23 8.1.2 Analysis............................................................................................................ 24 8.1.3 Assessment Results .......................................................................................... 26 8.1.4 Improvements................................................................................................... 27

8.2 Case Study #2: Destruction System for the VS-40X Sounding Rocket ................. 29 8.2.1 Problem Description ........................................................................................ 29 8.2.2 Analysis............................................................................................................ 31 8.2.3 Assessment Results .......................................................................................... 32 8.2.4 Improvements................................................................................................... 32

8.3 Case Study #3: Industrial Robot ............................................................................. 36 8.3.1 Problem Description ........................................................................................ 36 8.3.2 Analysis............................................................................................................ 38 8.3.3 Assessment Results .......................................................................................... 40 8.3.4 Improvements................................................................................................... 41

9 REFERENCES............................................................................................................. 43


-4-

10 APPENDIX A............................................................................................................. 45 10.1 Terminology.......................................................................................................... 45 10.2 Modeling Notations............................................................................................... 47

11 APPENDIX B ............................................................................................................. 48 11.1 Pattern Description................................................................................................ 48 11.2 Knowledge Centered Assessment Patterns ........................................................... 49

11.2.1 Automatic Failure Detection.......................................................................... 49 11.2.2 Managing Component Interactions ................................................................ 58 11.2.3 Failure Isolation ............................................................................................. 67 11.2.4 Automatic Failure Recovery.......................................................................... 75 11.2.5 Reconfiguration.............................................................................................. 84 11.2.6 Testability....................................................................................................... 94


-5-

3 Overview 3.1 Intended Audience The intended audience is software engineering professionals and anybody interested in learning about Knowledge Centered Assessment Patterns. 3.2 ERAU/Guidant Lab Software Architecture Team This report is the result of research work performed during Fall 2001and Spring 2002 academic semesters at the ERAU/Guidant Lab. This research was conducted under direction and mentorship of Dr. Soheil Khajenoori. The following ERAU/Guidant Lab team members participated on the project:

?? Lorenz Prem, ?? Karen Stevens and, ?? Ban Seng Keng

3.3 Acknowledgments The research team wishes to express its appreciation to the Guidant corporation for the opportunity provided to them for this research project through the ERAU/Guidant Lab. Continues support and technical insights provided by Nader Kameli of the Guidant corporation has been invaluable to the team. In addition, encouragement and interest expressed by Nader Kameli and John Schmidt for the success of the research team have been positive inspirations through out of the project. Last but not the least, the research team wishes to echo the statement made by the Albert Einstein “II ccaann sseeee ffuurr tthheerr bbeeccaauussee II ssttaanndd oonn sshhoouullddeerr ooff ggiiaannttss”” ;; tthhiiss wwoorrkk hhaass bbeenneeffiitteedd aa ggrreeaatt ddeeaall ffrroomm tthhee wwoorrkk ooff ootthheerrss ssppeecciiaallllyy tthhee wwoorrkk ooff ii nnddiivviidduuaallss lliisstteedd iinn tthhee rreeffeerreennccee sseeccttiioonn.. 3.4 Abbreviations and Acronyms

Term Definition ATAM Architecture Tradeoff Analysis Method DOD Department of Defense ERAU Embry Riddle University, Daytona Beach, Florida FMEA Failure Modes and Effects Analysis FTA Fault Tree Analysis GDT Guidant Cooperation IEEE Institute of Electrical and Electronics Engineers KCAP Knowledge Centered Assessment Pattern KCDP Knowledge Centered Design Pattern SC Safety Critical SEI Software Engineering Institute at Carnegie Mellon University 3.5 List of Figures Figure 1 – Conceptual View of the current Research Focus............................................... 9 Figure 2 – The Assessment Process.................................................................................... 9 Figure 3 – Multi-Level Safety Concerns........................................................................... 10


-6-

Figure 4 – Conceptual View of Knowledge Centered Assessment Pattern...................... 12 Figure 5 – Static View of Logical Redundancy................................................................ 13 Figure 6 – Knowledge Centered Assessment Patterns and Dimensions of Knowledge ... 16 Figure 7 – Evolution of the research................................................................................. 21 Figure 8 – Conceptual View of the current Research Focus............................................. 22 Figure 9 – Proposed Architecture ..................................................................................... 23 Figure 10 – Scenario 1 Reconfiguration of the Proposed Architecture ............................ 28 Figure 11 – Scenario 3 Reconfiguration of the Proposed Architecture ............................ 28 Figure 12 – The proposed software architecture for the VS-40X sounding rocket .......... 30 Figure 13 – The component interactions for the proposed software architecture............. 31 Figure 14 – Revised software architecture for the VS-40X sounding rocket ................... 33 Figure 15 – The component interactions for the revised software architecture ................ 34 Figure 16 – The component interactions for the revised software architecture ................ 35 Figure 18 - Side view of the robot and press operations .................................................. 37 Figure 19 – The proposed architecture for the robotics metal process cell ...................... 38 3.6 List of Tables Table 1 – Checklist for Automatic Failure Detection....................................................... 14 Table 2 – Result of identifying building blocks in the proposed architecture .................. 26 Table 3 – Results of applying Managing Component Interaction Checklist .................... 32 Table 4 – Results of applying Managing Component Interaction Checklist .................... 38


-7-

4 Research Objective 4.1 Overview In general, system safety is impacted by the decisions embodied in the system architecture, design and implementation. These decisions are the most difficult to change as the system progress through its different stages of development during its life cycle. However, the process of assessing if the proper decisions are made through out the system development can be enhanced if

?? The appropriate level of abstraction is provided for modeling the system. and

?? Effective tools and techniques are provided to aid the assessment process. The long term goal of this research project is to Design and develop a knowledge-centered framework for effective detection and prevention of defect in software design of safety critical systems. In our view, the software development process would consist of three levels of design abstractions and activities. The design process starts with defining an architecture for the software system, that is Architectural Design. Then, design activity proceeds to include the internal design of each architectural element, that is Component Design. The last level of design is focused on the specifics of each class or function depending on the methodology used during the previous stage, that is Class/Function Design. This view on the design process is consistent with the ones reported Bhandari [9], Chillarege [4], Jones [10 ] and Mays [14 ]. A critical and important objective of the research project is to demonstrate both technical and economical value of the framework in the process of developing software for safety critical systems. We envision the use of framework will result in:

?? Reducing cost of quality while maintaining or improving product field quality by ?? Early defect detection and resolution ?? Defect prevention

?? Improve productivity ?? Potential improvement in cycle time ?? Potential improvement in software processes

Appendix A provides definitions for the terms used throughout this paper. 4.2 Principles In achieving our research objective we adhere to the following guiding principles

?? Focus on practical applications and solve “real” problems that are of interest to industry and the solution can be assessed and validated by the industry

?? Identify, reuse and synthesize existing research results and separate findings into coherent whole, then architect, design and integrate the missing parts

?? Focus on safety-critical systems


-8-

?? Foster, in research team, the ability to integrate the discipline of learning into their very being. Such a learning must be marked by strong self-direction, willingness to take risks, and integration of the learning that life teaches outside of academic environment [24] ?? “a journey of exploration that corrects its course as it proceeds”

?? Utilize processes and practices of effective knowledge exchange and technology transition between academia and industry

4.3 Current focus Our research effort is currently focusing on developing tools and techniques to assist verification team and/or safety specialist, assessing system’s software architecture, to ensure that the system safety concerns allocated to software have been effectively incorporated in the proposed system software architecture. Stated more concisely:

Given a software architecture model for a safety system, how can one assess that safety concerns allocated to software are/are not effectively addressed by the proposed architecture?

Figure 1 represents the conceptual view of the current research focus. As depicted in the Figure, the current usage of the framework will guide the assessment of a proposed software architectural model for a safety critical system. The framework consists of Assessment Patterns and their associated usage Metrics. The Assessment Patterns currently developed are used during the assessment of software architecture. Each Assessment Pattern is structured to address a particular Safety Concern. An assessment patterns includes sections for Strategy, a general solution approach that addresses the concern and Techniques that implement the strategy. Moreover, each technique is broken down to its basic Building Blocks and the technique building blocks form a basis of the Assessment Checklist. The Building Blocks and Assessment Checklist are also part of the Assessment Pattern structure. The assessment patterns will be described in more detail in the following sections.

Metrics

UsageData

Collect()

SafetyCriticalSoftwareProposedArchitecturalDesignModel

Assessment()

Framework

Guides

SafetyConcernArchitecturalDesign

Identify()Classify()

AssessmentChecklist

ArchitecturalDesign

Develop()

AssessedBy

Strategie

ArchitecturalDesign

Identify()Develop()

AddressedBy

Technique

ArchitecturalDesign

Identify()Develop()

AssessmentPattern

ArchitecturalDesign

Develop()

BuildingBlock

RealizedBy

FormTheBasisOf


-9-

Figure 1 – Conceptual View of the current Research Focus The Assessment Patterns are used during the process of assessing a proposed software architecture for a safety critical system. Figure 2 represents the key elements of the assessment process.

Figure 2 – The Assessment Process

The Architectural Assessment process is a human centered process, which attempts to verify the consistency and accuracy of the proposed software architecture with respect to the product concerns allocated to the software. The assessment process could be conducted in the manner similar to inspection or verification activity common in most software life cycle models. The inputs to the Architectural Assessme nt process are the proposed software architecture, safety concern(s) and Assessment Patterns. The outputs from the Architectural Assessment process are the Assessment Results, Usage Metrics and recommendation for Improvement. In the following paragraphs we present general description of the inputs and outputs. Meanwhile, we defer presenting detailed examples of enacting the Architectural Assessment process to the follow up sections of this paper. The emergence and importance of software architecture as a new discipline in engineering of software systems has been noted by many authors, including Shaw&Garlen [26], Hofmeister [25], Leveson [13], Lutz [28,29] and Storey [20]. Moreover, there have been published research reports and text on development of architectural description language [26, 25] and architectural development process and standards [27]. In this work, we assume that the proposed software architecture model, which is an input to the assessment process, is described and developed using the notation provided by Hofmeister [25] in conjunction with the IEEE recommended guideline for software architecture description [27]. This assumption is based on our key requirement that the architectural expression style should be understood by software engineers, who developed the system, verification engineers, who verify the system, and safety

ArchitecturalAssessment

Proposed Software Architecture

Assessment Patterns Usage Metrics

Assessment Results

Improvement

Safety ConcernsAllocated to system software


-10-

specialists (advocates), who need to ensure that the safety concerns have been effectively incorporated. Hofmeister [25] provides a fairly reach and intuitive set of notation for modeling architectural elements and concerns. Appendix A provides a summary of Hofmeister’s [25] notation related to this research project. The safety concerns allocated to system software represent a set of safety requirements that must be met by the system software. These requirements are typically identified and produced during requirements engineering process using techniques such as Hazard Analysis [13] and/or Failure Mode Effects Analysis [13]. To manage the complexity associated with system safety concerns and promote reusability of the assessment patterns across different domains and applications, our research has defined a multi-level hierarchical structure for safety concerns as represented in Figure 3. At the highest level there are generic safety concerns, which represent a set of issues pervasive in development of any safety critical system regardless of specific domain. The following are some examples of such concerns [23]

?? The failure of safety critical system functions must be detected, isolated, and recovered from, such that catastrophic and critical hazardous events are prevented from occurring,

?? System shall perform Automatic Failure Detection, Isolation, and Recovery for identified safety critical functions,

Figure 3 – Multi-Level Safety Concerns Please note that it is not the intention of the authors to suggest that all possible concerns must be implemented in every system every time. These are a set of concerns that their

Generic Safety Concerns

Domain SpecificSafety Concerns

Domain Specific

Knowledge and Constraints

Specify Specify

Specify

Application SpecificSafety Concerns

Product SpecificSafety Concerns

Product Specific

Knowledge and ConstraintsSpecify Specify

Application Specific

Knowledge and ConstraintsSpecify



Domain SpecificSafety ConcernsDomain SpecificSafety Concerns

Domain Specific


Domain Specific



Specify Specify

Specify



Product SpecificSafety ConcernsProduct SpecificSafety Concerns

Product Specific


Product Specific


Knowledge and ConstraintsSpecify Specify

Application Specific


Knowledge and ConstraintsSpecify


-11-

implementation in a safety critical system will be determined by the requirements engineering activity and then subsequently will be allocated to different system elements including software. The second level is domain specific safety concerns. These concerns are specialized generic concerns by incorporating domain specific constraints and knowledge. The following example illustrates the concept.

Consider the case of “System shall perform Automatic Failure Detection, Isolation, and Recovery for identified safety critical functions” and more specifically, let’s focus on “Automatic Failure Detection”. There are a number of design solutions that have been identified and reported in literature as a possible approach for implementing this concern. Further more, let’s assume that the concern of “Automatic Failure Detection” has been determined to be a critical functionality that must be implemented in a safety critical system. Under this scenario consider the implementation of “Automatic Failure Detection” in two possible systems such as Cardiac Pacemakers and Aircraft Engine Control. It is obvious that the generic solution to implement the functionality will be impacted by domain constraints such as size, speed, performance and possibly others.

The domain specific constraints will be used to specialize the generic concerns based on each domain specific requirements. Subsequently, the domain specific concerns will be specialized into application specific concerns by incorporating application specific knowledge and concerns. At the lowest level are the product specific concerns. These concerns are derived by specializing the application specific concerns using the product specific constraints and knowledge. Currently, our research has been focused on the development of assessment patterns related to generic safety concerns. The structure of the Assessment Patterns was described briefly in the previous section. Examples of the Assessment Patterns are provided in the Appendix B of this paper. Assessment Results represent the recommendation of the architectural assessment team. The recommendations are based on assessing the proposed architecture against the safety concerns allocated to system software. Use of the Assessment Patterns will assist the team to more objectively verify if the proposed software architecture is properly, over, under or incoherently engineered with respect to the safety concerns. Usage Metrics include data on effort expended on the assessment process, number of defects detected. Other data may include severity and/or impact of the detected defects on the system safety. Improvement will contain feedback from the assessment team on usage of the Assessment Patterns. The feedback will be used to improve the structure and content of the Assessment Patterns.


-12-

5 Knowledge Centered Assessment Pattern This section presents in detail the concepts associated with the Knowledge Centered Assessment Pattern. We explain what is the Knowledge Centered Assessment Pattern (KCAP), why is it important and how its use is envisioned. 5.1 Overview The KCAP is a structure composed of following elements:

?? Concern - problem statement ?? Strategy - general solution ?? Techniques - specific solutions ?? Building Blocks - elements of specific solutions ?? Checklists - guidelines for usage

Each KCAP is developed to address a specific safety concern, for example, Automatic Failure Detection. Figure 4 illustrates the conceptual view of the KCAP.

Figure 4 – Conceptual View of Knowledge Centered Assessment Pattern

As mentioned previously, a safety concern represents a safety requirement that must be met by the system and in this case by the system software. In KCAP, each safety concern is documented using three attributes of name, description and example. Appendix B contains examples of KCAPs. Each safety concern is uniquely identified using its name. The description of the safety concern represents the problem statement that KCAP is to address. And the example is to provide more illustration and insight to improve problem understanding. Each safety concern is addressed by one or more strategies. A strategy represents a general solution approach for the safety concern addressed by it. For example, for safety concern of Automatic Failure Detection, a strategy would be:

Safety Concern

Strategy

Checklist

Techniques

Building Block

1

1..*

Addressed by

1..*

1..*

RealizeForm

Assessed by

1

1


-13-

A common approach to failure detection is to compare a result of a computation with either a known value or redundant computation result(s).

KCAP provides a description for each strategy along with attributes of limitation and constraints associated with the strategy. Techniques are known solutions to implement the strategy. There could be one or more techniques to implement each strategy. In KCAP, techniques are considered part of strategy. For example, techniques such as Logical Redundancy, Physical Redundancy and Acceptance Test are identified as known and specific solution approaches to meet the above mentioned strategy. Each technique is documented by its name and list of other possible names that the technique might have been referenced in the literature. Each technique is described both in textual format and graphical format using the architectural modeling notation. KCAP presents both static and high level behavioral views for each technique. In addition, limitation or constraints associated with each technique is documented in the KCAP. Techniques limitation and constraints are important piece of knowledge to verify if the technique has been selected properly Building blocks are the fundamental elements that make up each technique. Moreover, a building block could be associated with more than one technique. However, each technique is defined based on specific configuration and behavior of its building blocks. KCAP provides textual description for each building block. For example, consider the static view of Logical Redundancy technique presented in Figure 5 below.

Figure 5 – Static View of Logical Redundancy

The fundamental elements or building blocks associated with the Logical Redundancy technique are two Computational units (Computations Version A and B) and one Comparator. Note that, connectors, ports and roles are part of the architecture modeling language and are not fundamental elements of the technique. Building blocks associated with the techniques documented in a KCAP form the basis of the checklist that is used to guide the assessment or verification of the proposed architecture. The checklist contains a of list building block, a short description for each building block, association of the building blocks with the techniques, and a section to

Data

ComputationVersion A

ComputationVersion B

Data

Data Data

Control

Comparator

Source Dst.

Source

Source SourceDst.

Dst.

Dst.

ReceiverSender


-14-

record the result of assessment or verification. The assessment results are captured in the checklist by marking if the building block is present (Met), partially present (Partially Met) or not present (Not Met) in the proposed architecture. Table 1 presents the checklist for the safety concern, Automatic Failure Detection.

Architectural Verification and Evaluation Checklist for Automatic Failure Detection

# Building Block

Description Met Partially Met

Not Met

Related Technique

1 Computation A unit of functionality with defined inputs and outputs.

T1, T2, T3

2 Multiple CPUs The system is capable of executing more than one task at a time on different CPUs.

T2

3 Comparator Compares output from various units of functionality and merges them into one ‘correct’ result.

T1, T2, T3

4 N-Version Programming

One unit of functionality is implemented in different ways (e.g. use different algorithms) to provide redundancy.

5 Oracle An entity, which validates input as correct within a given domain.

T3

Table 1 – Checklist for Automatic Failure Detection

5.2 Rational It is important to explain the role that KCAP and the checklist in particular play during the assessment process. In our view, the assessment process is a “guided discovery” process where KCAP and its associated elements provide the necessary guidance to the assessment team for verifying the proposed architecture. In this context, the checklist is a mechanism that highlights where the assessment team should first devote their focus while verifying a safety concern. The detail application of the checklists and KCAP will be discussed and presented in the follow up case studies. In KCAP, all the strategies associated with a pattern are general solution approach to the same concern. In turn all techniques associated with each strategy are specific solution to the same strategy. However, each strategy and technique differ from other strategies and techniques based on their implementation approach and constraints. We believe that there are two important benefits associated with the KCAP’s structure and the relationship between strategies and techniques provided by that structure. First, the fact that in addition to their own specific constraints, techniques associated with a particular strategy are bounded by the higher level constraints associated with the


-15-

strategy. Therefore, the constraints on a strategy can be used to eliminate or include a whole set of techniques from the list of candidate solution techniques. Moreover, since all techniques related to a strategy are functionally equivalent, constraints and limitations of a strategy and its techniques will help to determine the most appropriate technique for a given system based on its critical factors such a size, cost, and environment. Second, it is important to note that, strategies listed in a particular KCAP could positively or negatively impact strategies listed in other KCAPs. Indeed, the interplay among strategies is very likely as almost every system has to deal with multiple safety concerns. Hence, assessment and verification of interactions of multiple safety concern is a critical responsibility of the assessment team. In this case, KCAPs constraints and limitation associated with strategies and techniques will provide a mechanism to objectively assess and ensure that compatible strategies and techniques are utilized in the proposed architecture. Indeed, this is very critical as architectural mismatch, over engineering and compounded complexity can result from use of contradicting strategies. On the other hand, when compatible strategies are used, there will be opportunities for sharing units of functionality among the associated techniques. A particular building block used in one technique can be used in another technique in a different configuration. However, for implementation purposes it could be of interest to produce only one instance of a building block and share it between the techniques. Structure provided by KCAP and breaking up the techniques into their building blocks are our approach to promote reuse and integration of coherent sets of techniques to handle different concerns in the final product. In case of multiple safety concern, it would be least likely that the techniques would appear in their singularity form in the proposed software architecture model. Hence, identification of the techniques in the proposed architecture would be a complicated task due to the potential integration of techniques. To remedy this complication, KCAP checklist is based on searching and identifying the building blocks in the proposed architecture. Once the building blocks have been identified, the assessment team would use the documented static configuration and behavioral aspect of the techniques to verify their proper representation in the proposed architecture. As mentioned earlier, the collection of the KCAP is a tool to assist the assessment or verification team with their task of assessing software architecture for a safety critical system. It is the aim of this research to develop tools and techniques that make the assessment process more objective and provides benefit, in particular, to less experienced teams. However, it should be emphasized that, ultimately it is the assessment team who should use their collective experience to decide and build confidence that the risk to system failure is minimized by the proposed architecture. 5.3 Knowledge Centered As the knowledge and experience are perhaps the most critical elements of this decision making process; the software engineering community has been searching for methods to document the knowledge of experienced teams in order to provide objective guidance to less experienced teams [13].


-16-

The Knowledge Centered Assessment Pattern represents an approach to document the required knowledge for an objective assessment and verification of safety concerns in software for safety critical systems. The attribute of being knowledge centered is the most differentiating characteristic of our work as compare to the related works reported in the literature. The assessment patterns are designed to provide the know-what, the know-why and the know-how a person needs to possess with respect to assessment and verification of safety concerns in a software architecture. Figure 6 represents the structure of the assessment pattern with respect to the three dimensions of what, why and how. The KCAP provides guideline for the assessment or verification team in terms of what elements are required in the proposed architecture, why the elements are required and how the required elements should have been configured. For example, consider the case where the assessment team is to verify, if the proposed architecture has handled the Automatic Failure Detection safety concern. The checklist contains the list of required elements, which represents What to look for. Safety concern and strategy represent Why the elements are required for the checklist elements. Lastly, techniques and their associated building blocks provide How the required elements should be configured to respond to the safety concern.

Figure 6 – Knowledge Centered Assessment Patterns and Dimensions of Knowledge Case studies in this paper will elaborate on and demonstrate the usage of the Knowledge Centered Assessment Patterns in more detail.

Safety Concerns

Strategy

Checklist

Techniques

Building Block

1

1..*

Addressed by

1..*

1..*

Realize Form

Assessed by

1

1

How?

Why?

What?

Safety Concerns

Strategy

Checklist

Techniques

Building Block

1

1..*

Addressed by

1..*

1..*

Realize Form

Assessed by

1

1

How?

Why?

What?


-17-

6 Review of the related work 6.1 Active Design Review The active design review techniques proposed by David Parnas [17, 18] attempts to overcome some of the common deficiencies found in conventional review methods. These deficiencies include developer unfamiliarity with product design goals, developers dealing with areas outside of their knowledge scope, procedure reviews incoherent with respect to structure, and improper assumptions [17, 18]. By applying active design review techniques, it is possible to more effectively discover defects. Although this technique may result in increased costs in the early stages of the process, it will usually result it overall savings later. Active design review is commonly conducted by first identifying the desired properties of the system to be reviewed. Next, questionnaires are prepared by those possessing required domain knowledge that addresses the desired properties to be reviewed. Active design review technique not only requires that the questionnaires be answered; but also the sources of these answers need to be documented. Parnas [17, 18] proposed the use of the Function Tables in order to facilitate this additional documentation. The Function Table is based on structured use of formal language [Parnas2] and it covers the “What” aspect of the required knowledge to effectively assess the work product. 6.2 Hazard Analysis Hazard Analysis begins in the requirement phase. An initial list of possible hazards for the system is produced using techniques like fault tree analysis, Cause–Consequence analysis, fault hazard analysis and others. Hazard Analysis techniques in all subsequent lifecycle phases assess that these hazards are handled within all artifacts of the product and sometimes append new hazards, specific to the current phase, to the list of hazards [13]. Hazard Analysis identifies the need for certain actions or the process that need to be followed in each phase. However, Hazard Analysis does not describe how to conduct the activities. Various techniques, such as Checklists, Fault Tree Analysis, Hazards and Operability analysis, Failure Mode and Effects Analysis, and Interface analysis, have been created to fill this gap. Nonetheless, majority of these techniques are applicable to the requirement phase and techniques for subsequent phases are scarce or none existent. Hazard Analysis usually requires a team of people with a wide variety of knowledge; hence for abstract concepts and models, like software architecture, there is a need for more effective techniques to supplement the team’s skill set [13]. 6.3 Fault Tree Analysis Fault Tree Analysis (FTA) is used as a means to predict the potential causes of failures in systems. It consists of four steps: system definition, tree construction, qualitative analysis, and quantitative analysis [13]. The system definition activity is mainly concerned with the identification of system events. Once the system has been identified, a fault tree is constructed based on the relationships between events [13].


-18-

This process yields a better understanding of the system. However, it is only useful when applied to a detailed design, for it requires in-depth knowledge of a system’s construction and performance [13]. Therefore, although this increased understanding may result in the identification of requirement defects, its late application in the development process reduces its overall usefulness [13]. 6.4 Failure Mode and Effects Analysis Failure Mode and Effects Analysis (FMEA) is a technique to evaluate potential ways that a system may fail. The FMEA technique is an iterative process that will be applied and updated many times during the design phase [30]. "The analyst needs a detailed design that includes schematics, functional diagrams, and information about the interrelationship between component assemblies" [13]. The FMEA process starts by first identifying the components in the system and their known potential failures. Then these known failures are placed into a table, prioritized by the probability and impact of the failure’s occurrence [13]. This provides a means to predict the reliability of the system, and it shows how the design must be improved in order to extend the operational life of the system. This technique has deficiencies noted in [Motorola]: “the FMEA is relatively weak in failure mode identification, as it does not provide a systematic method of evaluating system deviations”. 6.5 Checklist Checklist contains a list of items that are to be verified or assessed. Checklist may contain, project-specific elements that are intended for a particular project, or general elements intended for more extended use [32, 28,29]. Checklists are applicable to different stages of the software development lifecycle and are one of the most common forms of assessment tools. Numerous checklists have been reported and purposed in literature for safety critical systems. For example, Robyn Lutz [Lutz] supports the use of checklists as a valuable verification technique. She has proposed the addition of a specific safety critical checklist such as the one reported by Jaffe [33], to complement already existing generic system checklists. Lutz suggests that, a checklist provides a first step towards the formal specification of safety constraints, and it may be formatted using a variety of descriptive languages, such as mathematical predicates. Some examples of the safety checklist elements include redundant resources, multiple processors, simultaneous processes, and systems with timing constraints. The formal inspection method used at NASA is a formal process that applies to all stages of the software lifecycle, including architecture. In this process to verify architecture, a checklist is used during the inspection in order to determine if the architectural description meets requirements, contains correct interfaces between architectural components, contains fault checking, and contains fault recovery [31]. The Department of Defense (DoD) also uses the concept of checklist to communicate safety criteria to the contractors. For example, a criteria item may be “the system must be able to recover from any error”. However, in this case the checklist is not intended to be


-19-

the primary tools for verification. But it is used as lists of items the chosen verification method has to include in order to be acceptable [32]. However, there are disadvantages noted with usage of checklists.

Generally, a checklist is presented informally in common language (e.g., English). However, as common languages are ambiguous, the list is subject to a variety of interpretations. Therefore, such a list may lead to misinterpretations of intent, and potential errors. Lutz indicates that this lost of information may mostly be recaptured by re-examination of the criteria that the original list was based on. However, the entire understanding of the criteria may never be entirely re-established. In case studies applying Lutz’s checklist to the systems testing of the spacecraft Voyager and Galileo, “192 safety-related errors were documented during integration and systems testing” [28,29]. However, of these 192 errors, only 149 were able to be caught through the use of the checklists. The most common errors relating to safety critical systems in Voyager and Galileo projects have resulted from requirements that did not match desired system functionality, erroneous data ranges, invalid system input ranges, flawed timing constraints, data overflow, and misunderstandings of how a section of software is to integrate with the rest of the system [28,29].

Most checklists have rather a large number of items listed for verification and often the user has little idea why the item should be checked [13]. Checklists also fail to consider the economic factors associated with the assessment process. For example, it is likely to over engineer a product by strictly following a checklist, specially when the checklist has grown out of multiple projects. In addition, when checklist consist of large number of items, it is not practical to verify and assess every element, however, the checklist doesn’t provide any guidance on what is necessary and sufficient for verifying and assessing safety concerns. This can be very difficult for the user, who is often pressed for time and has to decide how much effort is enough. For these reasons it becomes necessary for checklists to have a defined range of problems that checklist can be effectively applied to and also information on the scope of checklist should provide as part of checklist documentation. Unfortunately many current implementations of checklists fail to include this vital information [13].

Nonetheless, checklist is mostly an effective tool in communicating ‘what has to be done’, but often there is little or no information about ‘why it is done’ and ‘how much of it is enough’. Architecture Tradeoff Analysis Methods The Architecture Tradeoff Analysis (ATAM) is another tool used to assess software architecture. It reveals how well certain quality goals (such as performance or modifiability) are implemented in a proposed architecture and provides insight into how


-20-

the quality goals interact with each other [2]. The purpose of the ATAM is to assess the consequence of architectural decisions in light of quality attribute requirements [2].


-21-

7 Future Evolution of the Research

7.1 Research Direction As described, the long-term objective of our research project is to develop tools and techniques to develop and improve the engineering skill for more efficient detection and prevention of defects during the design phase of safety critical systems. In particular, our research focuses on software subsystem and safety related defects. We view this research project as a journey of incremental and iterative learning and development. Figure 7 depicts the origin and destination of this journey and many possible paths between the two.

Figure 7 – Evolution of the research It is interesting to note that there are two different but fairly related viewpoints associated with this journey. The verification viewpoint is related to the reactive process of detecting existing defects in the work products. However, a complementary viewpoint is the proactive process of preventing the defect injection, which we believe, in part, can be achieved through better knowledge and skills in design. Our current research focus has been on early defect detection and removal during the architectural design activity. We are the first to admit that there is much more to be discovered which would provide more depth to the product of our current focus. However, an alternate argument could be that by switching the viewpoint the research team will be exposed to new subjects which can result in development of complementary products, that is knowledge centered design patterns, in addition to enhancing the existing knowledge centered assessment patterns. In addition, another compelling argument is the exposure of the research team to the critical activity of design.

Verification View Point (Early Defect Detection)

Des

ign

Vie

w P

oint

(Def

ect P

reve

ntio

n)

Architectural Design Component Design Class/Function Design

Arc

hite

ctur

al D

esig

nC

ompo

nent

Des

ign

Cla

ss/F

unct

ion

Des

ign

Origin

Destination

Verification View Point (Early Defect Detection)

Des

ign

Vie

w P

oint

(Def

ect P

reve

ntio

n)

Architectural Design Component Design Class/Function Design

Arc

hite

ctur

al D

esig

nC

ompo

nent

Des

ign

Cla

ss/F

unct

ion

Des

ign

Origin

Destination


-22-

Hence, our future direction on this research project is to focus on developing knowledge centered design patterns (KCDP). Figure 8 represents the conceptual view of our future direction in conjunction with our current focus.

ModelingLanguage Syntax Semantics

ResearchVision

Goals Personal Project Business Assessment()

Metrics UsageData Collect()

SafetyCriticalSoftware PurposedArchitecturalDesignModel Assessment() Design()

DesignModel ArchitecturalDesignModel

Methods Guidlines Process

Framework IsOrganizedBy Guides

AssessmentChecklist ArchitecturalDesign Develop()

Strategie ArchitecturalDesign Identify() Develop()

Technique ArchitecturalDesign Identify() Develop()

BuildingBlock

RealizedBy

FormTheBasisOf

SafetyConcern ArchitecturalDesign Identify() Classify()

AddressedBy

AssessedBy SoftwareDesignActivity

ArchitecturalDesign ProduceModels()

Produce

UtilizedIn

AssessmentPattern ArchitecturalDesign Develop()

DesignPattern ArchitecturalDesign Develop()

AddressedBy

UtilizedIn

ReUsedIn

ViewPoints

Constraints

Figure 8 – Conceptual View of the current Research Focus


-23-

8 Case Studies The following section uses three case studies to highlight the usage of the Knowledge Centered Assessment Patterns in the context of the Architectural Assessment Process. The first case study is designed by the research team to highlight specific benefits of using KCAP. The other two case studies have been extracted from literature and reused here to demonstrate application of KCAP during the architectural assessment process. 8.1 Case Study #1: A Basic Example In this case study the software verification team enacts the architectural assessment process presented in Figure 2. The main objective of the assessment team is to verify if the proposed architecture has effectively incorporated a mechanism for automatic failure detection functionality. The following paragraphs describe the inputs to the process. 8.1.1 Problem Description The portion of the proposed software architecture dealing with automatic failure detection is illustrated in the Figure 9.

Data

Computation

Computation

Data

Data Data

Control

Comparator

Source Dst.

Source

Source SourceDst.

Dst.

Dst.

ReceiverSender

Comparator

Data

SourceDst.

Data

Source Dst.

Oracle Result

Source Dst.

Control

Sender Receiver

Figure 9 – Proposed Architecture

As illustrated, the input data enters the system in the top left corner of the diagram. This data is fed into two identical computational components. The functionality provided by the computational components is part of safety critical aspect of the system performance. Hence, it is critical for the system to automatically detect erroneous output produced by this safety critical functionality. The result from these two computations is compared in a comparator, which assesses the result as true, if the two computation units produced identical results or false if the output of the computations are different. Then, the information about the correctness of output, along with the result of computation, is feed into a second comparator. This comparator takes the result and compares it to a value provided by an oracle. An Oracle is an entity, which can verify a result as true or false based on predefined time tested logic. The output of the second comparator is the result of the computation along with the control data that indicate its state of correctness.


-24-

The safety concern allocated to the system software is the ability for “Automatic Failure Detection”. The assessment team has to verify a proper architectural implementation of this concern. The assessment pattern used in this case is the pattern for automatic failure detection given in the Appendix B. The assessment pattern for automatic failure detection provides a high level description of the concept and an example. This information can help the assessment team to better understand the functionality that must be verified. The assessment pattern defines automatic failure detection as the practice of automatically finding faults in the output of a function. The example provided shows how this is done for a function computing the factorial of its input. More over, the pattern documents a strategy, general solution approach, and three known techniques, specific implementation instances of the strategy, for automatic failure detection. Techniques are listed along with their associated constraints. Constraints are important as to verify the appropriateness of the selected techniques in light of system requirements such as memory size or required response time. Techniques listed in the pattern are ‘Logical Redundancy’, ‘Physical Redundancy’ and ‘Acceptance Test’. ‘Logical Redundancy’ uses multiple, redundant computations to produce multiple results. These results are compared and if they match the result is voted to be correct. To increase the probability of the fault detection ‘N-Version Programming’ approach can be used in logical redundancy technique. ‘N-Version-Programming’ is the practice of implementing the same computation multiple times using a different logic or algorithm. ‘Physical Redundancy’ is similar in concept to ‘Logical Redundancy’ but it uses multiple CPUs to execute multiple processes at the same time. The other aspects remain the same as for the technique of ‘Logical Redundancy’. ‘Acceptance Test’ uses an oracle to determine the correctness of a result. An oracle is an entity, which can determine the correctness of a result based on its time tested predefined logic. The result of a computation is compared to the result provided by the oracle and if they match, the computation result is judged to be true. The following sections describe the process envisioned to verify the proposed architecture. To assess the proposed architecture and verify the automatic failure detection, the assessment team utilizes the checklist provided by the pattern. The checklist direct the assessment team to search for building blocks associated with the techniques that implement automatic failure detection. The checklist provides a short description for each building block along with their association with each defined technique. 8.1.2 Analysis Using the assessment pattern and its provided checklist, the assessment team will verify if the necessary building blocks are present, partially present or not present in the proposed architecture. Subsequently, each building block is marked, in the appropriate column in


-25-

the checklist, based on their status in the proposed architecture. Table 2 represents the result of identification of the building blocks in the proposed architecture. In this case the assessment team identifies two computational components, no multiple CPU, since the proposed architecture is based on single CPU and two comparators. After further investigation, the assessment team verifies that the two computational components implement the same functionality. Hence they represent redundant computation in the proposed architecture. However, both computations are based on identical logic, that is, they do not represent ‘N-Version Programming’. It is important to note that as described in the assessment pattern, when the Logical Redundancy technique is used without N-Version programming, the technique would not be effective to identify implementation faults. However, hardware glitches are still identifiable under this implementation scenario. To minimize checklist entries, in the cases that redundant computations are not based on ‘N-Version Programming’, we suggest marking “N-Version Programming” in the checklist as partially met. This also indicates the fact that the N-Version Programming can be included without major structural changes to the proposed architecture. Lastly, the assessment team identifies the presence of an oracle in the proposed architecture.

Building Blocks Name Description

B1) Computation A unit of functionality with defined inputs and outputs. B2) Multiple CPU The system is capable of executing more than one task at one moment in time. B3) Comparator Compares output from various units of functionality. Its output is one result with

an indication whether the results match or not. B4) N-Version Programming


B5) Oracle An entity, which validates an input as correct in a given domain Architectural Verification and Evaluation

# Item Description Met Partially Met

Not Met

Related Technique


X X T1, T2, T3

2 Multiple CPUs The system is capable of executing more than one task at one moment in time.

X T2


X X T1, T2, T3



X T1,T2


X T3

Techniques Logical Redundancy B1, B3 Physical Redundancy B1, B2, B3


-26-

Acceptance Test B1, B3, B4

Table 2 – Result of identifying building blocks in the proposed architecture

Next, to analyze the result, the assessment team will refer to the ‘Techniques’ section of the checklist. This section groups building blocks according to their related techniques. Recall that, building blocks are basic elements of each technique. A particular building block can be found in multiple techniques, but it is always used in a different configuration and it is based on behavioral aspects unique to each technique. The techniques section of the checklists list the three techniques associated with the Automa tic Failure Detection concern. These techniques are ‘Logical Redundancy’, ‘Physical Redundancy’ and ‘Acceptance Test’. Alongside each technique, the building blocks used by each technique are listed. At this time, it is important to note that the checklist and its content should be used by the assessment team as a tool that guide and direct the team’s attention to potential problem areas related to the safety concern in the proposed architecture. The process proceeds by reviewing and analyzing the proposed architecture against the techniques documented in the assessment pattern. This is a process based on “guided discovery” where the assessment patterns are considered as a guiding tool and source of knowledge to analysis and assess the proposed architecture. Given the information in the checklist, the assessment team concludes that

?? The Logical Redundancy technique is almost in place with the exception of the N-Version Programming

?? The Physical Redundancy technique is missing both N-Version Programming and Multiple CPU requirements.

?? The Acceptance Testing technique has all the required building blocks, that is, the computations, the comparators and the oracle building blocks.

Hence, the proposed architecture contains the required building blocks for the Acceptance Testing technique. However, building blocks need to meet the required configuration and behavior for the technique as well. This can be verified by comparing the configuration and behavior provided in the proposed architecture with the one documented in the assessment pattern. 8.1.3 Assessment Results The proposed architecture is capable of perform Automatic Failure Detection. This capability is based on utilizing the technique of “Acceptance Test’. However, the proposed architecture can be improved. Cost and resource would be saved by eliminating additional components that are not required for Acceptance Test technique.


-27-

As an alternative, the proposed architecture can also be changed to implement the technique of ‘Logical Redundancy’. A comparator and the oracle can be eliminated at the cost of adding ‘N-Version-Programming’ for the computation. The resulting architecture will have timing constraints, because all redundant processes have to be executed after each other. The technique of Physical Redundancy can remove this constraint, but to do so it requires multiple CPUs. This case study also illustrates the capability of the Architecture Assessment Patterns to identify and illustrate cases where the proposed architecture is “Over Engineered”. Over engineering take place when the proposed architecture meets it required concern, however, it contains additional unnecessary components. This is in particular possible when the architecture is evolving from one system version to another. Obviously, the Architecture Assessment Patterns are also capable of identifying “Under Engineered” architectures. The under engineered cases can be detected by analyzing the checklist. In this case the identified building blocks can’t compromise any technique listed in the Assessment Pattern. This is an indicator of problems and work that has to be done to produce a correctly engineered solution. It is important to note that the Architectural Assessment Patterns are living documents and they need to be improved when a new technique is identified. It is conceivable that there will be situation, where no technique in the pattern will match the proposed architecture. In this case the checklist will indicate the solution as ‘under engineered’. However, it is conceivable that this could be indicator of a new technique that is not yet cataloged in the pattern. In this case, the information provided by the Architecture Assessment Patterns can be used to aid in verifying the new technique. In this case, the new technique can be added to the pattern once its correctness is verified. 8.1.4 Improvements In addition, the assessment pattern can be used as a tool to improve the proposed architecture. The following section describes potential improvement scenarios that can be applied on the proposed architecture. Scenario 1: Using the assessment pattern, we can realize that the proposed architecture utilizes the Acceptance Testing technique to meet the Automatic Failure Detection concern. However, there are additional components, a comparator and a computational unit that are not necessary. Given this, one can eliminate the extra units and route the output of the remaining computation unit to the reaming comparator. Figure 10 represents the architectural reconfiguration for this scenario Scenario 2: Since the Physical Redundancy is identical to Logical Redundancy with the exception of additional CPU, scenario 2 could be more preferred as provide more stable architecture in case of adding additional CPU to increase system throughput or detecting failure.


-28-

Figure 10 – Scenario 1 Reconfiguration of the Proposed Architecture

Scenario 3: Since the only element missing for proper implementation of Logical Redundancy technique is N-Version Programming, one can change the logic of the computational units to implement the N-Version Programming. If the N-Version Programming is incorporated then, second comparator and the oracle no longer are needed and can be safely removed. Figure 11 represents the architectural reconfiguration for this scenario

Figure 11 – Scenario 3 Reconfiguration of the Proposed Architecture

Data



Data

Data Data

Control

Comparator

Source Dst.

Source

Source SourceDst.

Dst.

Dst.

ReceiverSender

Data Computation Data

Source Dst. Source Dst.

Comparator

Data

Control

Source Dst.

ReceiverSender

Data

SourceDst.

Verdict

Source Dst.

Oracle


-29-

8.2 Case Study #2: Destruction System for the VS-40X Sounding Rocket This case study has been taken from the paper entitled “Safety Analysis of an Evolving Software Architecture,” by Rogério de Lemos [12]. Lemos’ focus had been on demonstrating the importance, especially in safety critical system, of separating interactions among system components from computational or functional components of the system. Lemos introduces the concept of co-operative action (CO action) object that encapsulates the collaborative activities between objects. Moreover, Lemos defines the architectural elements and present the system architecture based on Co-operative Object-Oriented Style. The Lemos’ paper demonstrates the application of the Co-operative Object-Oriented Style in the evolving design of destruction system for the VS-40X Sounding Rocket.

We borrowed and used the concept of separating and encapsulating component interactions from other system functionality and add it to our architecture assessment patterns in modified form. Our architectural elements and presentation of the case study differs from Lemos’ originals. We followed our recommended architectural description language to model the problem presented in the case study. In our model architectural elements are components, connectors, ports, roles and protocols, as defined by Hofmeister, rather than objects. Our architectural model can be implemented in object or non-object oriented approach as the other system requirements unfold. In addition, our usage focus for the case study is different than that of Lemos. He uses the case study to demonstrate design of evolving system where we focus on verifying and assessing the last evolution of the system.

8.2.1 Problem Description “The purpose of sounding rockets is to carry scientific instruments into space. Their sub-orbital flight follows a parabolic trajectory that is appropriate for performing scientific experiments. The VS-40X is a two-stage sounding rocket which has dual purpose within the Brazilian Space Program: apart from performing scientific experiments, it will be used as an experimental platform for the new Brazilian Satellite Launcher.” [12] In this case study we will use the Architectural Assessment process presented in Figure 2 to verify and assess the purposed architecture for the VS-40X sounding rocket. The objective of the assessment is to verify if the proposed architecture supports component interactions in a way, which minimizes the probability of a system failure and provides confidence that the remaining risks are acceptable. Inputs to the process are the proposed software architecture, safety concerns allocated to software and the assessment patterns. The following paragraphs elaborate the process inputs. The safety concerns allocated to software are to ensure the proper software components interactions. The related assessment pattern in this case is ‘Managing Component Interactions’ (see Appendix B for detailed description of this pattern). The proposed software architecture is presented in the Figure 12 and it is described in the following paragraphs.


-30-

Figure 12 – The proposed software architecture for the VS-40X sounding rocket The operator console is a conceptual component that is responsible for visual trajectory and radar displays, and it allows the user to initialize the system. The vehicle is represented as a subsystem, containing the conceptual components of the safety box, protection system, and trajectory. It refers to the physical rocket itself, which includes components such as the safety box, the protection system, and a trajectory calculation system. The safety box is responsible for determining if it is safe to destroy the rocket. This is done to prevent disasters such as the rocket destroying itself on the launch pad. The trajectory system is responsible for the flight trajectory calculations of the rocket system; it obtains data from an external inertial reference system (IRS). The protection system detects violations of the safety plan, such as the rocket moving out of its predefined flight envelope. Figure 13 represents the interaction of the components using UML sequence diagram.

1. The Operator notifies the Safety Box that self-destruction is safe.

Operator Console

Safety Box

1. Unlock Safety Box

Operator Console

Safety Box

TrajectoryProtection System

VS-40X System

Vehicle

Control

ReceiverSender

IRS

User Input

Data

Receiver Sender

Data

Receiver Sender

Radar

Ground ControlOperator Console

Safety Box


VS-40X System

Vehicle

Control

ReceiverSender

IRS

User Input

Data

Receiver Sender

Data

Receiver Sender

Radar

Ground Control


-31-

Figure 13 – The component interactions for the proposed software architecture Notice that in the proposed architecture each component is responsible for its own system interactions. The Operator notifies the Safety Box that self-destruction is safe. The Safety Box then sends an unlock code to Protection System. The Trajectory continuously checks for safety plan violations. Upon violation, the protection system is activated to destroy the rocket. This architecture clearly serves its purpose, but are its interactions safe? To verify this aspect of the architecture the verification team uses the architecture assessment pattern of ‘Managing Component Interaction’. 8.2.2 Analysis The verification team starts the process by first assessing the proposed architecture for proper component interactions. The team uses the checklist associated with the assessment pattern Managing Component interaction and attempts to identify the recommended building blocks in the proposed architecture. Table 4 represents the result of identifying the building blocks in the proposed architecture.

Managing Component Interactions # Item Description Met Partially

Met Not Met

Related Technique

1 Controller An component encapsulating decision making related to component to component transitions

X T1

2 Worker Component

A self-contained component with a defined purpose performing a task.

4X T1

3 Breakpoint A module giving access to the interaction it is used in.

X T2

4 Interaction Checker

A component responsible to check interactions guarded by breakpoints for validity and to correct any failures.

X T2

1. Safety Box sends unlock code to Protection System.

2. Trajectory continuously checks for safety plan violations.Upon violation, the protection system is activated to destroy the rocket.

Safety Box Protection System

Trajectory

1. Self-Destruction Status

2. Guarding Active Protection System


-32-

5 Interaction Rules

A component defining the correctness criteria for interactions and how to recover form a fault in these interactions.

X T2

Table 3 – Results of applying Managing Component Interaction Checklist

The checklist completed by searching and matching the building blocks listed in the checklist against the components in the proposed architecture. In this case only one type of component is present. There are 4 worker components present in the proposed architecture. They are the operator console, the safety box, the protection system and the trajectory system. As indicated by the checklist, technique 1 (T1), “Cooperative Components’” of the assessment pattern is partially exists. The worker components are present, but there are no controller components to manage the interaction between the worker components in the proposed architecture. Further more, as indicated by the checklist, there is no evidence of the building blocks associated with the Technique 2 (T2), “Breakpoints”, in the proposed architecture. As mentioned previously, the checklist is a tool that directs the attention of the assessment team in an objective manner to a potential area of deficiency in the proposed architecture. This will reduced the effort required to verify the proposed architecture by improving the efficiency of searching for problematic areas in the architecture. Once the problematic areas have been identified and indeed the deficiencies have been assessed and verified, the assessment patterns provide potential solutions and techniques to remedy the problem. By providing all the key and necessary information to identify and correct problems in the proposed architecture, the assessment patterns assist the verification team to improve its efficiency as well as objectivity of their decision. 8.2.3 Assessment Results Based on the information provided by the assessment pattern, the analysis process concludes that the proposed architecture does not manage component interactions properly and this is a risk to the system safety. To remedy the potential problem and improve system safety the assessment pattern provides two possible techniques. Of the two techniques, the technique of ‘cooperative components’ is preferable because one of its required components is already present in the proposed architecture. Hence, a redesign according to fully implement technique 1 would be less radical than a redesign according to technique 2. 8.2.4 Improvements The proposed architecture was redesigned according to the information provided in the architecture assessment pattern for the technique of ‘Managing Component Interactions’. In order to safeguard interactions the proposed architecture was revised using technique of cooperative components. Based on this technique, the interactions of each feature, such as enable-destruction, are encapsulated into two components specifically responsible for managing component interactions. Figure 14 represents the revised software architecture.


-33-

Figure 14 – Revised software architecture for the VS-40X sounding rocket The controller component Enable-Destruction is responsible for managing the component interactions to accomplish task of enabling and disabling the rocket’s destruction mechanism. The sole purpose of Enable-Destruction component is to manage the interactions between the Operator Console component and the Safety Box component for safely accomplishing the task enabling and disabling rocket’s destruction. At the first glance this change might be viewed as unnecessary and would add to the system complexity and perhaps cost. However, the engineering value of the proposed change becomes more obvious when other system components need to interact with the safety box. In addition, as the system evolves over time and new features and functionalities are added, encapsulating the logic and rules associated with component interactions in one unit will definitely prove a valuable design decision by reducing the cost of changing and modifying interaction logics distributed among many components. In case of the safety critical systems, this will compromises safety which indeed is the most critical and important quality. In the revised architecture, to enable the self-destruction mechanism of the rocket a request is sent by the Operator Console component to the Enable-Destruction component. The Enable-Destruction component then performs the necessary safety checks and only activates the self-destruction mechanism when all the safety checks are verified. Introducing the Enable-Destruction component effectively allows all safety checks related to the ground station side to be encapsulated in one location, which greatly simplifies interactions and improves safety. Figure 15 represents the interactions of the Operator Console, Enable-Destruction and Safety Box components using UML sequence diagram.

Operator Console

EnableDestruction

SelfBDestruction

Safety Box


VS-40X System

Vehicle

DataReceiverSender

Control

ReceiverSender

ControlReceiver Sender

Data

ReceiverSender

Data

Receiver Sender

Data

Receiver Sender

IRS

User Input

Ground Control

Radar

Operator Console

EnableDestruction

SelfBDestruction

Safety Box


VS-40X System

Vehicle

DataReceiverSender

Control

ReceiverSender

ControlReceiver Sender

Data

ReceiverSender

Data

Receiver Sender

Data

Receiver Sender

IRS

User Input

Ground Control

Radar


-34-

Figure 15 – The component interactions for the revised software architecture Likewise, the Self-B-Destruction manages the interactions that are related to the actual destruction of the rocket as a result of a violation of the safety plan. It is a controller component in the schema of the cooperative components technique. The three worker components from the proposed architecture remain in the new architecture, but the interaction logic has been removed from their responsibility and encapsulated in the Self-B-Destruction component. The logic associated with when is safe to interact and with what other component interact is the responsibility of Self-B-Destruction. IN this new configuration, Self-B-Destruction interacts with the Safety Box in order to detect if the system is safe to destroy, with the Protection System to know if the system should be destroyed and with the Trajectory system to detect if the rocket is outside of its flight plan. Figure 16 represents the interactions of the Self-B-Destruction with Safety Box, Trajectory and Protection System using UML sequence diagram. The benefit of introducing the Self-B-Destruction component comes in two forms. First it greatly simplifies the worker components and their interactions. These components are now solely responsible for one special task and have their own, isolated safety requirements. Simpler interactions provide less room for failure. Reducing complexity increases safety. The second benefit is achieved through the controller components. They organize worker components in a pyramid like structure rather than a chain structure. In a chain structure failure in one link breaks the chain. In a pyramid structure one brick can fail, but it usually needs failures in more than one brick to bring the pyramid down. This is best seen in the capabilities of the Self-B-Destruction component.

1. The Operator conceptual component notifies the co-operative object that it is safe to enable the system's self-destruction capability.

2. The co-operative object notifies the Safety Box that self-destruction is safe.

Enable Destruction

Operator Console

Safety Box

1. Enable Destruction

2. Unlock Safety Box


-35-

Figure 16 – The component interactions for the revised software architecture As mentioned earlier the Self-B-Destruction component calls worker components to accomplish the task that needs to be accomplished. The worker components perform the safety checks associated with their responsibility. However, Self-B-Destruction performs safety checks associated with the interactions between the components. Hence, the two safety checks complement each other and are aliened with the allocated responsibilities.

SelfBDestruction

Safety Box Protection System

Trajectory

1) SelfBDestruction queries Safety Box for the status of the destruction mechanism

2) SelfBDestruction queries Trajectory for current trajectory

3) SelfBDestruction activates Protection System if the rocket is on an invalid trajectory

Get Safety Status

Get Trajectory

Activate Protection System


-36-

8.3 Case Study #3: Industrial Robot This case study has been taken from the paper entitled “A Case Study in Developing Complex Safety Critical Systems” by Bernd J. Krämer [11]. It represents a typical robotics application in metal processing process in a factory in Germany. This case study also demonstrates usage of multiple assessment pattern during the assessment process. 8.3.1 Problem Description Figure 17 represents the basic configuration of the metal processing cell. It is assumed that an external gadget deposits individual blanks on the left end of the feed belt one after the other in arbitrary time intervals. The belt transfers the blanks to the rotary table at the belt’s other end. Once the blank has been passed over, the table moves up and slightly rotates to bring the blank into a position and onto a level from where robot arm 1 can pick it up.

Figure 17 – Metal processing robotics cell configuration After arm 1 has been loaded, the robot rotates counter-clockwise into a position in which arm 1 points at the press to introduce the blank into the press where it is forged while the press is closed. To increase the use of the press, the robot is equipped with a second arm. This arm picks up the last forged part from the press during the previous robot processing cycle, the robot rotates counter-clockwise until arm 2 points towards the deposit belt, and arm 2 unloads the piece of metal to the right end of the deposit belt. This belt moves processed parts to the other end from where they are removed one by one from a second gadget in the environment. Arm 1 and 2 are mounted on different vertical levels and access the press while its movable cheek is in two corresponding vertical levels. Figure 18 shows the intended interaction of the two robot arms and the press and the processing cycle of the press in a side view. The arms can operate their grippers independently. They can also be extended or retracted in the horizontal direction independently, but must always rotate together as they are strongly coupled.


-37-

Figure 18 - Side view of the robot and press operations Photocells monitor the load and unload zones of the conveyer belts, while switches and other types of sensors signal the actual horizontal or vertical position of robot arms, table, and press cheek. The system is safety critical in the sense that improper operations in the system can cause harm to the environment and the machine itself. The work pieces are very heavy objects. If one of the work pieces falls to the ground either because one of the belts does not stop when it is supposed to or the robot releases its grip to early, serious harm can come to any object or person in the area. Another area of concern is the press. The robot extends its arms into the working area of the press. If the press closes too early serious damage to the robot will occur.


-38-

Figure 19 represents the proposed software architecture by Kraemer for the metal processing cell. The safety concerns associated with this application are to ensure 1) Failure Isolation and 2) proper interactions among software components.

Figure 19 – The proposed architecture for the robotics metal process cell 8.3.2 Analysis The verification team starts the process by first assessing the proposed architecture for proper component interactions. The team uses the checklist associated with the assessment pattern Managing Component interaction and attempts to identify the recommended building blocks in the proposed architecture. Table 5 represents the result of identifying the building blocks in the proposed architecture.

Managing Component Interactions # Item Description Met Partially

Met Not Met

Related Technique


X T1

2 Worker Component


5X T1


X T2



X T2

5 Interaction Rules


X T2

Table 4 – Results of applying Managing Component Interaction Checklist


-39-

As indicated in the checklist, all the components present in the proposed architecture are classified as worker components, which include the press, the two belts, the robot and the table. Hence, there are no special components to manage component interaction and provide safeguard for the interactions. The problem can be resolved by including a proper technique that safeguards the interaction. The pattern provides two techniques to remedy the problem, the techniques are Cooperative Components (T1), and Breakpoints (T2). As indicated in the checklist none of technique two’s building blocks are met. Breakpoints technique provides safe interactions by inserting breakpoint component in each interaction. The breakpoint components intercept interaction data and perform validation on it. Depending on the outcome of this validation the breakpoint component allows the interaction to take place or cancels it and resumes operations elsewhere. Since there are no breakpoints components in the proposed architecture, implementing technique two would have to be done from scratch. According to the checklist, Cooperative Components technique (T1) is partially present in the architecture. The five worker components are there, but the controller component is missing. Since the T1 is partially implemented and its constraints are satisfied it is more cost effective to add the missing controller. By including T1 in the proposed architecture we can meet the safety concern related to proper interactions among software components. Next, the verification team proceeds to assess the proposed architecture for Failure Isolation concern. The team uses the checklist associated with the assessment pattern Failure Isolation and attempts to identify the recommended building blocks in the proposed architecture. Table 5 represents the result of identifying the building blocks in the proposed architecture.

Failure Isolation # Item Description Met Partially

Met Not Met

Related Technique

1 Presence of Modules with Low Cohesion

Group functions into specialized, independent objects. Each object can perform all its actions without relying on others. High coupling should only be present in the lowest layer.

X T3

2 Presence of Modules with High Cohesion

All sub-units are built using the methodology of High Cohesion. Similar functionality is grouped into the same unit.

X T1, T3

3 Presence of Modules

Is the architecture in question is dominated by models? Each module includes

X T1, T3

4 Hardware Has the capability to execute more than one process at one instance.

X T2


-40-

5 Presence of Modules with Low Coupling

Units easily retain their functionality when separated from other units.

X T1

6 Controller An object encapsulating decision making related to object to object transitions

X T4

7 Worker Object A self-contained component with a defined purpose performing a task.

X T4

Examining the checklist for each of the four techniques in the pattern we can identify that technique associated with introducing layers (T1) is present in the architecture. However, after further analysis and referencing the description of the layering technique documented in the pattern it becomes evident that the layering technique has not been implemented properly. The description of the layering technique clearly indicates that the concept of layering for failure isolation is effective when the system is structured in more than one layer. Since the proposed architecture is structured using one layer the benefits of the technique cannot be utilized for failure isolation. Further analysis of checklist results indicates that technique 2 is present in the proposed architecture. Technique 2 is based on the concept of partitioning the system into subsystems in an effort to split the problem into smaller ones. The physical structure of the application lends itself to partitioning the software architecture according to hardware configuration, that is robot, table, conveyer belts and press. Each one of these elements have its own controller software. And an overall system controller interfaces with each of the hardware controllers and synchronizes the actions. According to the checklist technique 3 is not present in the architecture since the building block of ‘modules with low cohesion ‘ is not present. This technique isolates possible sources of failures by intentionally introducing low cohesion into the system. It appears that the proposed architecture was built on the principle of high cohesion.

The last technique on the checklist is the technique of ‘separation of concern’. This technique also appears in the pattern for managing interactions, which was applied earlier. It has already been established that the technique is not present because controller components are missing. 8.3.3 Assessment Results The application of the two architecture patterns to the proposed architecture has produced the following results. The architecture does not provide a proper means to manage the component interactions. Worker components are present, but they interact directly with each other. The assessment patterns suggest adding controller components to the architecture to manage the component interactions. In addition, Failure Isolation concern can be further improved by adding additional techniques such as ‘Controller/Worker’ and ‘Isolation’


-41-

8.3.4 Improvements The results from the architecture assessment patterns guide the redesign of the system architecture. The Figure 20 shows the revised architecture, which incorporates all changes identified by the architecture assessment patterns. All information required to build this architecture is provided by the patterns. Specifically the necessary information can be found in the description sections of the techniques Controller/Worker and Isolation. Rational behind the revised Architecture: The most apparent change in the new architecture is the addition of a system controller and controllers for each of the physical entities in the system. It is the responsibility of these controllers to control the system in a safe way.

Figure 20 – The revised architecture for the robotics metal process cell

The actual functionality of the system is located in the components that correspond to the physical entities. These components include Arm1, Arm2, Press, Table, Feed Belt and Deposit Belt. Each of these components encapsulates a physical object’s capabilities. For example, a robot arm should be capable of moving the arm, opening and closing the grip. Each controller has predefined behavior encapsulated in it. In case of the robot controller this includes moving the arm to fetch a work piece from the feed belt, move the arm to deposit a work piece in the press, and depositing work piece on the deposit belt. To complete these actions the robot controller depends on other lower level controllers; the controllers for arm one and two. These controllers are more specialized. Their possible actions are move and open and close grip. To actually accomplish these tasks they are connected to the worker components of Arm1 and Arm2.

RobotController

System Control

PressController

TableController

Feed BeltController

Dep. BeltController

Arm 1Controller

Arm 2Controller

Arm 1 Arm 2 Press

Table

Feed BeltSensor

Dep. BeltSensor

Feed Belt Dep. Belt

Table Sensor

PressSensor


-42-

This configuration accomplishes the following objective. All task-related information, like what to do next, is located in the controllers. All capabilities, like how to do something, are located in the worker components. The architecture also uses this separation strategy to further enhance safety. Among the worker components actual system functionality is kept separate from data gathering components. Data gathering components include all kinds of sensor, which sense System State. These components are kept separate to simplify system interactions and isolate failures. For example the sensor, which tells whether or not the press is closed or open, is not only of interest to the press itself but also to the robot arms which have to interact with the press. In the proposed architecture this state information traveled through the press component to the robot arms. Hence, in the proposed architecture failures within the press component can influence the sensor data. It is much safer to send the data directly to the interested component without detours. This practice is the implementation of the technique of ‘Isolation’.

Another benefit of applying the separation strategy is the simplification of the safety checks that the controllers have to do before they initiate an action. The controller for either one of the robot arms is only interested in the value of the press sensor. It has no need to interact with the feed-belt sensor. This detail of the system is hidden from the robot controller. The revised architecture effectively allows for redundant safety checks. The system controller initiates actions according to his own plan. The controllers at lower levels each perform their own safety checks. Only safe actions will be carried out. For the example of the press and robot interaction two safety checks complement each other. The press will not close when a robot arm is in the press, and the robot will not insert its arm when the press is closing or closed. Should the press sensor fail, the robot will insert its arms into the press, but the press will not close since it knows from the robot where the arms are. As far as reuse is concerned, the parts, which are most likely to be reused, are loosely coupled. All worker components are totally self sufficient and only connected to controllers. They can be reused with ease. The controllers on the other hand can only be reused in combination with the worker components they use. For lower level controllers like the arm controller reuse is certainly an option, but the higher the level of a controller, the lower its reuse potential. Reusing the robot controller would mean the robot would move the same way it does in this problem. In total the new architecture provides interaction management by separating behavior from capability. Controllers are used to perform redundant safety checks at multiple levels before each action of the system takes place. The environment sensors are extracted into separate components to further simplify interactions, isolate failures and increase safety.


-43-

9 References

1. Bowen , Jonathan, et al, High-Integrity System Specification and Design, Springer-Verlag, 1999.

2. Carnegie Mellon software Engineering Institute, Architecture Tradeoff Analysis Method, Carnegie Mellon software Engineering Institute, Pittsburgh, PA, 1999

3. Carnegie Mellon software Engineering Institute, Capability Maturity Model for Software v1.1, Carnegie Mellon software Engineering Institute, Pittsburgh, PA, 1993

4. Chillarege, Ram, et al, Orthogonal Defect Classification: A Concept for In-Process Measurements, IEEE Transaction on Software Engineering, Vol. 18, No 11, November 1992

5. Chillarege, Ram, Software Testing Best Practices, IBM Research, http://www.chillarege.com/authwork/TestingBestPractice.pdf

6. Digital Labs Software Confidence for the Digital Age, Testability, http://www.cigitallabs.com/resources/definitions/testability.html

7. Herrmann, Debra S., Software Safety and Reliability: Techniques, Approaches, and Standards of Key Industrial Sectors, IEEE Computer Society Press, 2000.

8. IBM, Testability Analysis, http://www-3.ibm.com/chips/services/testbench/ta 9. Inderpal, Bhandari, et al, In-Process improvements through defect data

interpretation, IBM Systems Journal Vol 33, No 1, 1994 10. Jones Carole, A Process-Integrated Approach to Defect Prevention, IBM Systems

Journal, Vol. 24, No. 2, 1985 11. Krämer, Bernd, A Case Study in Developing Complex Safety Critical Systems,

Fern Universität Hagen, IEEE1060-3425/97, 97 12. Lemos, Rogério de, Safety Analysis of an Evolving Software Architecture,

University of Kent at Canterbury, UK, unk. 13. Leveson, Nancy, Safeware: System Safety and Computers. Addison-Wesley

Publishing Company, Inc, 1995. 14. Mays Robert, et al, Experiences with Defect Prevention, IBM Systems Journal,

Vol. 29, No. 1, 1990. 15. Nelson, Victor P., Tutorial: Fault-Tolerant Computing, Bill D.Carroll, IEEE

Computer Society Press, NY, 1987 16. Neumann, Peter G, Computer Related Risks, SRI, ACM Press Books (ACM

Press/Addison-Wesley), 1995. 17. Parnas, David L, Inspection of Safety-Critical Software Using Program-Function

Tables. Communication Research Laboratory, Department of Electrical and Computer Engineering. McMaster University, 1994.

18. Parnas, David L., Software Inspection We Can Trust, Communication Research Laboratory, Department of Electrical and Computer Engineering. McMaster University. October, 1998.

19. Siewiorek, Daniel, Reliable Computer Systems, Carnegie Mellon University, Pittsburgh, PA, 1998

20. Storey, Neil, Safety-Critical Computer Systems, Addison-Wesley Publishing Company, Reading, MA, 1996

21. Thomas J. Coughlin, Designing for Testability, http://members.aol.com/prpca/designof.htm


-44-

22. University of Toronto, ECE1767: Design for Test and Testability,

http://www.eecg.toronto.edu/~ece1767/ 23. USA Military, System Safety Engineering, Software System Safety –

Requirements, http://www.monmouth.army.mil/cecom/safety/sservice/sssr.htm 24. Vaill, Peter B., Learning As A Way Of Being: Strategies for Survival in a World

of Permanent White Water, Jossey-Bass, A wiley Company, 1996 25. Hofmeister, et al, Christine, Applied Software Architecture, Addison Wesley Pub

Co, 1998. 26. Shaw, Mary, Garlan, David, Software Architecture: Perspective on an Emerging

Discipline, Prentice Hall, 1996. 27. IEEE, IEEE Standard for Software Architecture Descriptions (Std 1471-2000),

IEEE, New York, NY, 2000 28. Lutz, Robin, Targeting Safety-Related Errors during Software Requirements

Analysis, Jet Propulsion Laboratory, California Institute of Technology. 29. Lutz, Robin, Evolution of Safety-Critical Requirements Post-Launch, 5th

IEEE International Symposium on Requirements Engineering, August 2001. 30. Sematech, Failure Mode and Effect Analysis (FMEA): A guide for Continues

Improvement for the Semiconductor Industry http://www.sematech.org/public/docubase/document/0963aeng.pdf

31. NASA, Software Formal Inspections Standard, Office of Safety and Mission Assurance, NASA Headquarters, Washington D.C.

32. Russo, Leonard L., Software System Safety Guide, Department of Defense, May 92.

33. Jaffe, Matthew S., et al, Software requirements analysis for real-time process control systems, IEEE Transactions of Software Engineering, March 1991


-45-

10 Appendix A 10.1 Terminology Knowledge-Centered:

?? Ability to provide the know-what, the know-why and the know-how a person needs to possess with respect to a given subject

Framework:

?? A holistic approach for integrating processes, techniques and people skills Effective:

?? Accepted and utilized by the practitioners ?? Provide critical knowledge of subject to the practitioners ?? Ability to maximize defect detection as early as possible ?? Ability to maximize defect prevention

Detection:

?? Act of identifying and removing defect from the product ?? A reactive behavior ?? Based on some form of assessment process (e.g., inspection, verification)

Prevention:

?? Act of preventing defect to enter the product ?? A proactive behavior ?? Based on identification and analysis of defect root causes and subsequent

improvement of the engineering practices and practitioners knowledge and skills Defect:

?? Any thing that jeopardize or compromise the safety and secondarily the economics of the product

Design: The software development process consist of three levels of design abstractions and activities: ?? The design process starts with defining an architecture for the software system. The

architecture development is guided by a modeling language with defined and precise syntax and semantics. The focus of this design activity is on identification of architectural elements, the externally visible properties of these elements and their interrelationship (interface, use-dependencies, etc). At this level an architectural element would consist of a number of classes, or functions to be realized.

?? The software design activities include the internal design of each architectural

element. At this level the design modeling techniques such as Object-Oriented or


-46-

Structured design techniques can be used to design the architectural element consistent with its proprieties defined during the architectural design activity

?? The last level of design is focused on the specifics of each class or function

depending on the methodology used during the previous stage. The focus at this point is on class or function logic and data structure design and coding practices.


-47-

10.2 Modeling Notations

Element Notation Attributes Associated Behavior

Component Resource budget Component behavior

Port -- --

Connector Resource Budget Connector behavior

Role -- --

Protocol --

Legal sequence of Interactions

Component: ?? A unit of functionality, which interacts with its surroundings through ports.

Port:

?? The interaction point for components. Ports define both incoming and outgoing messages.

Connector: ?? Mediates an interaction among components. Interactions are in terms of data

and control. Connectors have roles associated with it. Role:

?? A connector’s role defines the behavior of the participants in an interaction. The roles have an associated protocol.

Protocol: ?? A protocol describes how components and connectors coordinate their

interaction and communicate with each other.


-48-

11 Appendix B 11.1 Pattern Description

Concern Concern Name

Mentions the name of the concern.

Context A short statement answering the following questions: What does the concern address? What is its rational and intent? This bulletin is the description of a problem.

Example

To show an example of concern begin solved by a technique, this sections picks one technique from the concern and solves a problem using it.

Strategy Name

Mention the name of the strategy.

Description This bullet is a description of the strategy. It puts the strategy into relation to the concern. It answers the following question: In what special way does the strategy address the concern?

Limitations and Constraints

This item lists all limitations and constraints of the strategy. It answers the following questions: Are there any problems with the strategy? Are there any tradeoffs? Are there circumstances where the strategy should not be used?

Technique Name

This bullet mentions the name of the technique. Known As

If the technique has more than one name, the additional names are listed here.

Description This section describes the technique. It explains what the technique is and how it addresses the concern.

Example This section provides an example of the use of the technique on a problem.


-49-

Collaborations This is the dynamic view of the pattern. It describes the interactions between the different components of the pattern.

Building Blocks This bullet lists all components participating in the pattern.

Limitations and Constraints This bulletin lists any special limitations or constraints the technique might have. It answers the following questions: Are there any tradeoffs? Are there circumstances where the technique cannot be used? How well does the technique work? If a technique cannot be used in combination with another, this section includes a warning not to use the two techniques together and the reason for this.

Checklist

Checklist – Concern Building Blocks

Name Description Architectural Verification and Evaluation # Item Description Met Partially

Met Not Met

Related Technique

Techniques Notes 11.2 Knowledge Centered Assessment Patterns The following collection of Knowledge Centered Assessment Pattern presents the all patterns available from the ERAU/GDT lab on the 10th of May 2002. 11.2.1 Automatic Failure Detection

Automatic Failure Detection Concern Name

Automatic Failure Detection


-50-

Context Automatic Failure Detection is a method used to identify imminent failures in a system. Identification of a failure is localizing it, determining when it happened and what it has affected.

Example

In this example failures in a computation of the factorial of number are detected using the technique of logical redundancy. There are two redundant computations implemented using n-version programming. The first implementation uses an iterative solution and the second version uses recursion.

Data



Data

Data Data

Control

Comparator

Source Dst.

Source

Source SourceDst.

Dst.

Dst.

ReceiverSender

The results produced by the two computations are compared in the comparator. The comparator produces a result and a control signal stating whether the result is valid or not, depending on the match of the results produced by the two calculations.

Strategy Name

Comparison (Redundancy) Description

One very common approach to failure detection is to compare a result of a computation to a known true result. The problem however is, how does one get this true-result. All techniques in this strategy use the comparison method. They differ in the way they obtain the true-result.


If it were possible to obtain a 100% true result, there would be no need for error detection. All true-results used in the different techniques of this strategy are imperfect. They are obtained at run-time, so they are susceptible to hardware failures. They are also at one point code written by engineers and have the same problems any other piece of code does. However, they are not useless at all. All techniques are proven to work to with a certain degree. In other words, these techniques will detect failures, but how effective they are depends on the implementation.


-51-

T1: Logical Redundancy Name

Logical Redundancy Known As

None

Description In order to check for logic errors and others sources of failures, two separate units of functionality are developed. Each unit uses the same input and produces the same output. Both execute on the same CPU in sequence. Ideally the two redundant computations are implemented using n-version programming. N-Version programming works on the principal, that a specific error in one implementation will most likely not be present in the implementation of the other version(s). So when a unit of functionality is implemented incorrectly, its output would differ form the output of the other unit. Even if both units contain errors there is a chance these errors do not affect the result in the same way, enabling the comparator to identify the failure. If n-version programming is not used, the technique will not catch implementation errors, but it will still catch hardware glitches. If a glitch happens in the first computation, the likelihood of another glitch happening during the execution of the second computation affecting the result in exactly the same way as the first is remote.

Example

Data



Data

Data Data

Control

Comparator

Source Dst.

Source

Source SourceDst.

Dst.

Dst.

ReceiverSender


-52-

Dynamic View

Perform Computation A withInput X

Perform Computation B withInput X

Compare Result

Input X Enters the Component

Return ResultReturn Valid

Return ResultReturn Invalid

Results match?Yes

No

Building Blocks ?? Computation ?? N-Version Programming for second Computation ?? Comparator

Limitations and Constraints This technique is relatively expensive, because one unit of functionality has to be implemented multiple times. Since all computations execute on one CPU in sequence, there might be timing issues. In research n-version programming is a controversial topic. Some believe errors in one implementation are likely to appear in another implementation. At any rate, the effectiveness of n-version programming depends on the quality of implementation of the algorithms.

T2: Physical Redundancy Name

Physical Redundancy


-53-

Known As

None Description

This technique is identical to logical redundancy with the addition of a CPU. As a result different constraints apply to this technique.

Example

Data



Data

Data

Data

Control

Comparator

Source Dst.

Source

Source

Source

Dst.

Dst.

Dst.

ReceiverSender

Data

Source Dst.

CPU 1

CPU 2

Dynamic View

Perform Computation A withInput X

Perform Computation B withInput X

Compare Result

Input X enters theComponent



Results match?Yes

No


-54-

Building Blocks ?? Computation ?? N-Version Programming for second Computation ?? Comparator ?? Processing Hardware

Limitations and Constraints This technique is relatively expensive, because one unit of functionality has to be implemented multiple times. The second CPU adds hardware cost and the need to implement additional synchronization constructs. In research n-version programming is a controversial topic. Some believe errors in one implementation are likely to appear in another implementation. At any rate, the effectiveness of n-version programming depends on the quality of implementation of the algorithms.

T3: Acceptance Test Name

Acceptance Test

Known As None

Description This technique uses an oracle to determine the correctness of a result. An oracle is an entity, which can determine whether a result is valid or not in every system state. How the oracle actually does this has to be determined at the design stage. Most commonly the oracle is an executable model of the system, which produces an output for every input. This technique not exactly a form of redundancy, because the capability of the oracle usually exceed the capability of the computation. If multiple computations are checked for errors using this technique, there would be only one oracle, which gets used in all of them. The oracle knows about everything, the computations are specific. The actual failure detection happens in the comparator, who indicates disagreement between the computation's result and the oracle. In some implementations of this technique in the event of a disagreement the oracle is given priority. It is considered to be infallible.

The oracle can also be used to keep the system out of unsafe states. In this case the oracle does not known about everything. It only knows about safe system states. Only errors putting the system in an unsafe state are detected. Others pass through unhandled. This can greatly simplify the oracle. For some problems this method is sufficient.


-55-

Example



Comparator

Data

Control

Source Dst.

ReceiverSender

Data

SourceDst.

Verdict

Source Dst.

Oracle

Dynamic View

Perform Computation withInput X

Validate Result using theOracle

Input X enters the Component



Oracle Verdict?True Result

False Result

Building Blocks

?? Computation ?? Comparator ?? Oracle


-56-

Limitations and Constraints The oracle offers cost savings when multiple computations are to be checked for failures. If only a few have to be checked this technique is expensive. When used for multiple computations coupling is increased, because all computation modules depend on the presence of the oracle. A fully capable oracle might not always be possible or unpractical. Keep in mind though, that an oracle keeping the system out of ‘unsafe’ states might serve the intended purpose.

Checklist

Checklist – Automatic Failure Detection Building Blocks

Name Description B1) Computation A unit of functionality with defined inputs and outputs. B2) Multiple CPU The system is capable of executing more than one task at one

moment in time. B3) Comparator Compares output from various units of functionality. Its output is

one result with an indication whether the results match or not. B4) N-Version Programming


B5) Oracle An entity, which validates an input as correct in a given domain Architectural Verification and Evaluation # Item Description Met Partially

Met Not Met

Related Technique


T1, T2, T3

2 Multiple CPUs

The system is capable of executing more than one task at one moment in time.

T2


T1, T2, T3




T3


-57-

Techniques Logical Redundancy B1, B3 Physical Redundancy B1, B2, B3 Acceptance Test B1, B3, B4 Notes


-58-

11.2.2 Managing Component Interactions

Managing Component Interactions Concern Name

Managing Component Interactions Context

Research by numerous respected Computer Scientists has shown that one of the most common sources of failures in safety critical systems is improper interaction between components. Individual components function as designed, but the sequence in which they interact breaks down. Therefore it is imperative for safety-critical systems to manage component and component interactions.

Example

A robot on an assembly line has to perform three different tasks in sequence. The following architecture based on the strategy of ‘Separation of Concern’ represents the software controlling the robot:

Controller

Co-operativeObject

Task 1

Task 2

Task 3

Sensor

Data

Source Dst.

Data

Source Dst.

In this setup the controller functions as the brain of the robot. It decides what to do next. When the robot starts its work, the controller will first call the Sensor worker-component to get data about the environment. Then the controller will call all three task-components in sequence. In between each task the controller decides what the next best step is. The benefit of a controller/worker architecture is the isolation of behavior in one component; the controller. It decides what to do and the task components perform the actions needed. The interactions between the two become simple, because they are one-way and do not change with system state. It is the controller, who decides what to do at a given system state. Hence any decision related failures originate in the controller and no longer in the interactions.


-59-

Behavior is encapsulated inside of the controller, capability inside of the task components.

Strategy Name

Separation of Concern Description

During software development there are many concerns, which have to be address system wide and also component wide. At the component level these concerns include functionality, behavior and others. For safeguarding interactions certain concerns are more important than others are. These important concerns need special attention. Usually concerns appear all over an architecture and are intertwined with each other. To focus on one separate concern is very difficult. The strategy of separation of concern promotes the idea to separate concerns as best as possible. When a single concern is isolated it can be managed much more effectively. For safeguarding interactions this means to isolate and capture the concern of ‘system control’ from the rest of the system, since it is the origin of all component interactions.


Separation of concern has to adhere to the principles of low coupling and high cohesion. If it does not do so the result will have the same problems as a highly coupled architecture with low cohesion. Some concerns are highly related and next too impossible to separate. The strategy’s capability is directly related to how well a concern can be isolated. If it can’t be done properly the strategy will not work properly.

T1: Cooperative Components Name

Cooperative Components

Known As Separating Capability from Behavior

Description One attempt to make interactions between components more rigid and safe is to completely extract behavior from all components and encapsulate it in controller components. These components become responsible to control the actions of all system components. The rest of the components, which remain after extracting the


-60-

control part, encapsulate the capability of the components. These components are called worker components. The controller components are responsible to call the worker components in a correct and safe sequence to accomplish the component’s goal. The benefit of this architecture for interaction management is the fact that it reduces the complexity of the interactions. Since all logic related to deciding what transition to take is encapsulated inside of the controller components and the worker components only perform work, the interactions between them are straightforward. They are just requests from the controller to the worker components and do not change with system state. The problem area of deciding what to do next is encapsulated in a single location, the controller, effectively reducing the complexity of component interactions. This architecture pattern also isolates changes to behavior from changes to capability. If a task has to be changed, only one module has to be altered, while all other modules remain untouched. If the behavior of the system changes, the controller components are changed. The worker components remain the same. Another benefit of this technique is the encapsulation of behavior in one location; the control components. Since all actions originate from this one place, it is also the location to implement any safety checks involved with these actions. This is convenient, because in normal architecture these checks are distributed among the components and hard to keep track of. If the system uses this technique, individual safety conditions can be checked in one place and do not have to synchronize themselves with other conditions happening in other components. For every function call inside of the controller component safety conditions exist. In the event these conditions do not hold the controller initiates alternative actions. There can be more than one controller component. Controller components should only control as much of a system as is appropriate. For example, there should not be a single do-it-all-controller. Instead there should be a separate controller for each logical unit. In practice this means there will be a controller component inside each component to control component interactions. On the next level there will be a controllers to handle to module interactions. The benefit of this technique is also achieved when it is used for only a part of the system. In some cases it might be sufficient to safeguard only a selected area of the system as opposed to the whole.


-61-

Example

Task 4 Task 5Task 3

Controller

Sub-Controller Sub-Controller

Task 2Task 1

This diagram shows the technique used in a very basic setup. There is a master controller, which initiates general software behavior. Lower level behavior is implemented in the Sub-controllers. These controllers use task components to accomplish different things. The worker components do not necessarily have to components. They can also be modules or even subsystems.

Dynamic View

The dynamic view depends on the functionality of the problem, but there is a general pattern. Behavior is always initiated by a controller component. The controller either calls another controller or a worker component to accomplish a task. When the task is done the controller component decides on the next best move and initiates the appropriate behavior. The key concept is that a worker component can never call another. It is the controllers business to delegate. The worker components just perform work.

Building Blocks ?? Controller - An component encapsulating decision making related to component

to component transitions ?? Worker Component - A self contain component with a defined purpose.

Limitations and Constraints This technique works best, if the worker components are completely autonomous; That is, do no depend on anything to accomplish their tasks. This way they can be accessed by a controller as a logical unit. If components have dependencies on each other, they should only be used as a worker component as a whole. This means if there is a component, which is shared between tasks, the tasks this component is used in should be merged into one task. If this cannot be done, these tasks have an extra interaction in addition to the one with the controller component. The system is still functional in this configuration, but the extra interaction weakens the capability of the technique to safeguard against interaction failures.


-62-

Strategy Name Monitoring Interactions Description

One approach is to design the system as one always would and then include special external components to safeguard the interactions in the system. This strategy usually takes the form of a separate component dedicated to monitoring the interactions. This monitor component is connected to the system. It senses interactions and the data these interactions carry and validates them against a safety protocol, which it has access to. In the event of a failure the monitoring component initiates actions to handle the failure.


The somewhat external nature of the monitor component can cause computational overhead in the system. For systems with very strict time limitations this strategy can cause problems. This strategy is meant to be a safety device. Ideally it only monitors and never has a need to correct a failure. Under no circumstances should it be used to handle expected events. In this case it would become a part of the system functionality and not a safety construct for the system. A safety construct for the safety construct would be needed.

T2: Breakpoints Name

Breakpoints

Known As None

Description One way of safeguarding the interactions is to check the validity of the interaction and the data sent through it at the time the interaction is active. If a failure is detected there are two options. One choice would be to cancel the interaction and resume the program flow elsewhere. The second option is to change the data of the interaction to a set, which is known to produce safe behavior in the system. The technique of breakpoints allows both for redirection of program flow and changing of interaction data. A breakpoint is an access point to data within a communication link between two components. Normally a component interacts with another component by calling its functions directly. With this technique a breakpoint component is introduced in the middle of this interaction. The first component calls a function of the breakpoint component. The breakpoint component then calls the second component.


-63-

When used in the simplest mode, a breakpoint component will take the data it receives from the first component and pass it directly to the second component without altering it. In other modes the breakpoint component modifies the data before it passes it to the second component or does not pass it at all. Breakpoints alone only give access to the data of the interaction. What to do with this data is determined by the “Interaction Checker’ component. It is connected to all breakpoints in the system. Whenever a breakpoint is triggered it sends a signal to the Checker. It is the checker’s responsibility to determine if an interaction guarded by one of the breakpoints is valid and safe. It does so by using the interaction-rules component. This component holds the judging criteria for good interactions and the behavior for the event of a bad interaction. For a basic component to component interaction the setup does the following: First the component triggers the breakpoint by calling its entry function. The breakpoint activates the checker, which retrieves rules for judging the particular interaction from the rule component. If the checker judges the interaction to be valid, program flow returns to the breakpoint, which calls the second component. For the case where the interaction is judged to be invalid, the interaction-rules component also holds information on what to do. There are two possible ways to react to the failure. First the data of the interaction can be changed to fix the interaction. How to do this is included in the interaction-rules component. After fixing the data the breakpoint injects the new data into the system by handing it to the second component. The second possibility when handling failures is to cancel the interaction and resume program flow elsewhere, possibly with different data. In this case the checker does not return to the breakpoint it was activated by. It uses the interaction-rules component to find another breakpoint where it is safe and appropriate to continue execution. If need be the interaction-rules component can also include rules to change the data of the interaction before passing it to the breakpoint where execution is to be resumed.


-64-

Example

InteractionChecker

Breakpoint

Data

Source Dst.

Breakpoint

Breakpoint

InteractionRules

Component 2 Component 1

Component 3

In this basic example three breakpoints are used to guard the interactions between three components. There are breakpoints at all interactions between the components. The interaction-rules component holds the rules for the three interactions. The checker has the power to change the data at the three breakpoints and resume execution at any of them.

Dynamic View

Breakpoint activates the InteractionChecker

The Checker retrieves informationabout the interaction from theinteraction-rules component

Checker checks Interaction

Breakpoint is triggered

Return to originalbreakpoint

Modify InteractionData

Interaction Valid?Yes

No

Activate componentconnected to breakpoint

Return to alternateBreakpoint

Activate componentconnected to breakpoint

Action?Correct Interaction Data

Alternate Breakpoint

Building Blocks

?? Breakpoint - A module giving access to the interaction it is used in.


-65-

?? Interaction Checker – A component responsible to check interactions guarded by breakpoints for validity and to correct any failures.

?? Interaction Rules – A component defining the correctness criteria for interactions and how to recover form a fault in these interactions.


Breakpoints can be dangerous by themselves, because they increase degree of logical coupling of the system. With logical coupling we mean how dependent certain parts of the functionality are on others. Breakpoints add the capability to jump form one breakpoint to the other during execution time. This effectively connects all parts of the system to all others. Therefore a complete safety check must include almost all aspects of the system, since it can be accessed from anywhere within the system via a breakpoint. Care must be taken to make jumps with fatal results to the system impossible. The checker can only change the data in the interactions it controls and has no access to any state information in the components of the system. As a result the checker is be able to redirect the execution of the program to another breakpoint, but the state of the system remains the same except for what part can be influence by the data of the interaction. This limitation can either be worked around, or the checker can be given the capability to influence the state of important components.

Checklist

Checklist – Managing Component Interactions Building Blocks

Name Description B1) Controller An component encapsulating decision making related to component

to component transitions B2) Worker Component


B3) Breakpoint A module giving access to the interaction it is used in. B4) Interaction Checker


B5) Interaction Rules


Architectural Verification and Evaluation # Item Description Met Partially

Met Not Met

Related Technique


T1


-66-

2 Worker Component


T1


T2



T2

5 Interaction Rules


T2

Techniques Separation of Concern B1, B2 Breakpoints B3, B4, B5 Notes


-67-

11.2.3 Failure Isolation

Failure Isolation Concern Name

Failure Isolation Context

In the event of a failure occurring in the system, it is desirable to minimize the impact of the fault. Failure Isolation attempts to do so by isolating a failure within one area of the system, leaving all other parts to operate normally.

Example

The way partitioning and layering of the software product is done is critical. Safety critical routines should be as small as possible and on the lowest layer possible. They should interact with as little other modules as possible and if they do, the other module should be at the same or at a lower layer than the SC module itself. A simple example illustrates this point:

Contro

Switch Vent.

Control Layer

Intermediate

I/O Routines

Intermediate

Figure 1

Suppose a ventilator is turned on whenever the temperature in a room becomes too hot. For maintenance purposes an override switch exists, which keeps the ventilator from turning on even if the room is too hot. Let’s suppose a worker is servicing the ventilator. The override switch is OFF (the ventilator should not turn on), and the room becomes to hot. In common layered architecture the control logic receives a signal from a temperature sensor that the room is too hot. It then queries the switch if it is on ON or not. If it is OFF, it tells the ventilator to turn on. This relationship is shown in figure 1. This approach is dangerous because errors in any of the intermediate layers will affect the outcome of the operation. All modules in the calling chain become safety critical themselves, because they carry the important safety information of the switch setting. An intermediate module could possibly modify the switch data through a fault or a glitch and cause the ventilator to turn on, even with the override switch on.


-68-

To remedy this situation consider figure 2.

Contro

Switch Vent.

Control Layer

Intermedia

I/O

Intermedia

Figure 2

In this model, if the room becomes too hot, the control sends a signal to the ventilator. The ventilator queries the switch and, given the switch is not ON, turns itself on. The operation sent by the control unit ‘ventilator, turn yourself on’ becomes ‘ventilator, turn yourself on, if you think it is safe’. The intermediate modules are still needed to complete the operation, but they are not safety critical anymore. At worst they can swallow the message and not send it to the ventilator. The ventilator will not turn on if the room gets too hot and the safety switch is OFF, but at least the technician will be absolutely save when the safety switch is ON. Harm will only be done, if the switch and ventilator modules include errors. The SC aspect becomes isolated in the lowest layer. The SC parts of the system become smaller

Strategy Name

System Partitioning Description

One of the first steps when doing software architecture is to partition the system into smaller parts. These parts can be sub-systems, layers, components, modules and others. The way this partitioning is done influences the capability of the architecture to isolate error. The techniques listed in this strategy describe how to best partition a system. They do not provide a solution by themselves. In fact none of them is a complete solution. They only increase the likelihood of a failure being contained locally. Ideally, all of these strategies are present within an architecture.


The principal problem here is that failure isolation only concerns itself with failures of unknown origin. If the cause for the failure was known, then it could be removed, eliminating the need for failure isolation altogether. The proposed architecture therefore has to provide a mechanism to isolate failures of unknown type and origin. A direct solution to this problem would include handlers for all possible errors. Clearly, this is not possible. The next best solution is to at least try to catch some failures. All strategies provided addressing this concern are in line with this thought.


-69-

They all accomplish some level of failure isolation, but they do not guarantee total isolation.

T1: Layering Name

Layering

Known As None

Description The common pattern of layering can be used to separate the SC layer from the rest of the system. If SC-layer is the lowest layer, it is protected, to a certain degree, from failures happening in higher layers. Also, since there is no layer below, a source of faults is removed all together.

Example Control Layer

Intermediate

Domain Intermediate

Dynamic View The dynamic view depends on the requirements. One general rule exists. Layers can only call components in the next higher or next lower layer. Never can a component call another component in a layer not adjacent to it.

Building Blocks ?? Presence of Modules – system is built in logical blocks. ?? Presence of Modules with Low Coupling – module to module dependencies are as

few as possible. ?? Presence of Modules with High Cohesion – all functionality and attributes of a

module are strongly related.

Limitations and Constraints None

T2: Multiple Systems Name

Multiple Systems


-70-

Known As None

Description Physically split the system into totally separate systems. Each separate system becomes a SC-system with diminished complexity. Failure isolation can now be accomplished at a different granularity within each sub-system.

Example

S1 S2 S3 Sn …

Control System

Dynamic View The dynamic view depends on the requirements and the system constraints. A major concern in this setup is to manage system to system communication. This factor should be included as a possible hazard in the list of hazards for the system.

Building Blocks ?? Hardware – multiple inter-linked platforms are available.


Since all systems ideally execute on their one hardware this method increases hardware cost. The communication between the systems needs to be managed. This adds code and also a potential source for problems.

T3: Isolation Name

Isolation

Known As None

Description Isolation promotes a different method for picking functionality for an object. According to isolation, objects should still be picked by grouping similar functionality and attributes together. However, when the constraint is added, that the


-71-

object must be self-contained. This means the object cannot use other object to help it perform some action. All the object needs to perform an action has to be contained within the object right from the start.

Example Control

Switch Vent.

Control Layer

Intermediate

I/O

Intermediate

Dynamic View The dynamic view depends on the requirements.

Building Blocks ?? Presence of Modules – system is built in logical blocks. ?? Presence of Modules with Low Coupling – module to module dependencies are as

few as possible. ?? Presence of Modules with High Cohesion – functionality and attributes of a

module are strongly related. ?? Presence of Modules with Low Cohesion – all functionality is grouped into an all

powerful object Limitations and Constraints

None

T4: Controller/Worker Name

Controller/Worker

Known As Separating Capability from Behavior

Description Research by numerous respected Computer Scientists has shown that one of the most common sources of failures in safety critical systems is improper interaction between components. Individual components function as designed, but the sequence in which they interact breaks down. One attempt to make interactions between components more rigid and safe is to completely extract behavior from all components and encapsulate it in one component, the controller. The controller component becomes responsible to control the actions of the other components. The rest of the components, which remains after extracting the control part, encapsulates the capability of the component. These


-72-

Worker components perform the systems functions without concern to the control aspects. The Controller component is responsible to call the Worker components in a correct and safe sequence to accomplish the system goals. The benefit of this architecture is isolating changes to behavior from changes to capability. If a task has to be changed, only one module has to be altered, while all other modules remain untouched. If the behavior of the system changes, the Controller components are changed. The Worker components remain the same. Another benefit of this technique is the encapsulation of behavior in one location; the Controller component. Since all actions originate from this one place, it is also the location to implement any safety checks involved with these actions. This is convenient, because in normal architecture these checks are distributed among the objects and hard to keep track of. If the system uses this technique, individual safety conditions can be checked in one place and do not have to synchronize themselves with other conditions happening in other objects. For every function call inside of the Controller component safety conditions exist. In the event these conditions do not hold the Controller initiates alternative actions. There can be more than one Controller component. Controller components should only control as much of a system as is appropriate. For example, there should not be a single do-it-all-Controller. Instead there should be a separate controller for each logical unit. In practice this means there will be a controller object inside each component to control object interactions. On the next level there will be a controller to handle to module interactions. The benefit of this technique is also achieved when it is used for only a part of the system. In some case it might be sufficient to safeguard only a selected area of the system as opposed to the whole.

Example

Task 4 Task 5Task 3

Controller

Sub-Controller Sub-Controller

Task 2Task 1

This diagram shows the technique used in a very basic setup. There is a main controller, which initiates general software behavior. Lower level behavior is implemented in the Sub-controllers. They used task objects to accomplish different things. Keep in mind that the task objects do not necessarily have to objects. They can be module or even subsystems.


-73-

Dynamic View The dynamic view depends on the functionality of the problem, but there is a general pattern. Behavior is always initiated by a controller object. The controller either calls another controller or a worker object to accomplish the task. When the task is done the controller object decides on the next best move and initiates the appropriate behavior. The key is that a worker object can never call another. It is the controllers business to but worker objects to work.

Building Blocks ?? Controller - An object encapsulating decision making related to object to object

transitions ?? Worker Object - A self contain component with a defined purpose.


This technique works best, if the worker objects are completely autonomous; That is, do no depend on anything to accomplish their tasks. This way they can be accessed by a controller as a logical unit. If objects have dependencies on each other, they should only be used as a worker object as a whole. This means if there is an object, which is shared between tasks, the tasks this object is used in should be merged into one task. If this cannot be done, these tasks have an extra interaction in addition to the one with the controller object. The system is still functional in this configuration, but the extra interaction weakens the capability of the technique to safeguard against interaction failures.

Checklist

Checklist – Failure Isolation Building Blocks

Name Description B1) Modules with Low Cohesion

Modules built according to the principle of Low Cohesion have to be present. Functions are grouped into specialized, independent objects. Each object can perform all its actions without relying on others. Low Cohesion should only be present in the lowest layer.

B2) Modules with High Cohesion

All sub-units are built using the methodology of High Cohesion. Similar functionality is grouped into the same unit. Each function group becomes an object.

B3) Presence of Modules

Is the architecture in question is dominated by models?

B4) Hardware Use more than one set of hardware. B5) Modules with Low Coupling

Modules easily retain their functionality when separated from other modules.

B6) Controller An object encapsulating decision making related to object to object transitions


-74-

B7) Worker Object



Met Not Met

Related Technique

1 Presence of Modules with Low Cohesion

Group functions into specialized, independent objects. Each object can perform all its actions without relying on others. High coupling should only be present in the lowest layer.

T3

2 Presence of Modules with High Cohesion

All sub-units are built using the methodology of High Cohesion. Similar functionality is grouped into the same unit.

T1, T3

3 Presence of Modules

Is the architecture in question is dominated by models? Each module includes

T1, T3

4 Hardware Use more than one set of hardware.

T2

5 Presence of Modules with Low Coupling

Units easily retain their functionality when separated from other units.

T1

6 Controller An object encapsulating decision making related to object to object transitions

T4

7 Worker Object


T4

Techniques Layering B2, B3, B5 Multiple Systems B4 Isolation B1, B2, B3 Separation of Concerns B6, B7 Notes


-75-

11.2.4 Automatic Failure Recovery

Automatic Failure Recovery Concern Name

Automatic Failure Recovery Context

Once a failure is detected, the next logical step is to recover from the failure. This recovery can happen in many ways. The preferred one would be to re-run the calculation to get the correct result. But also less complete methods like aborting the calculation and moving the system to a safe-state is also a type of failure recovery. Depending on the situation a different method with a different set of capabilities can be chosen.

Example

Data Error Handler

Error Detector

Output

Output

Source Dst. Source

Source Dst.

Dst.

Input

Source Dst.

For the situation, where the proper way to recover completely from a failure is known, the standard method of error handling is used. When an error is detected, the path of execution is moved to a special error handler. This code is responsible for removing the fault from the system and preparing it to resume normal operations. Once it is done, execution returns to the path normally taken when no failure occurs. A classic example would be to provide a safe division algorithm. If a division by 0 is requested, the error handler is activated. In it the output is changed to the default indicator value for an invalid result. The result is fed to the next unit down the execution path. Since the result is now the invalid result indicator, the unit does not crash or behave unpredictably, because it has defined behavior for the event.

Strategy Name

State Based Recovery Description

In the event of a failure the system does not attempt directly attempt to recover from failure. It moves itself to a predefined system state, effectively masking the occurrence of the failure.


-76-

Limitations and Constraints All of the techniques described below have to be supported by some error detection method. A system can only recover from a failure, if it knows about its presence. As a result, the effectiveness of all failure recovery techniques listed here depends heavily on the error detection technique used.

T1: Continuous System State Archiving Name

Continuous System State Archiving

Known As Frequent system state backup

Description This technique uses a state manager to continuously take snapshots of the system state during normal execution. These are stored in a state repository for later use. In the event of a failure the state manager activates one of the archived states and moves the system back in time to this state. Normal operations resume from there. The way the state manager decides when and where to take snapshots, how many to store and to which state to revert to in the event of a failure is problem specific. This technique is very well suited for enabling the system to abort actions in the event of a failure. At the beginning of the action a state snapshot is taken. If a failure happens, the system reverts to the initial state. However, this capability to abort an action should not be used to re-run the action. If the failure encountered during the first execution is a permanent one like a coding error, it will re occur every time the action is performed. In such a scenario an infinite life-loop is possible effectively blocking the system. As a result this technique is best suited for user initiated actions. In the event of recovery it is the user’s choice to re-run the action. If this technique is used for software initiated action, there has to be a upper limit on the re-runs the state manager is allowed to perform before it has to handle the failure in a different manner (the technique of a safe-state lends itself to do this).


-77-

Example

State Manager

Error Detector Output

Source Dst.

Input

Source Dst.

State Snapshot Repository

Functionality

Dynamic View

State-Manager findsappropriate state

State-Manager inserts thestate into the system

State Manager activates

Return from function block

Continue according to theinserted state

Error Detected?

Yes

No

Building Blocks ?? Error Detector - A module or component, which detects the presence of a failure

in the output of another module. ?? State Manager - A module controlling the systems state. Capable of changing it. ?? Snapshot Repository - A container module of state snapshots of the system.


-78-

Limitations and Constraints The system should have a limited number of times to try to recover from the system failure. This is done in order to prevent the system from getting into an infinite loop.

T2: Safe-State Name

Safe-State

Known As None

Description This technique is a variation of technique 1 ‘Continuous System State Archiving’. The basic idea behind both techniques is that in the event of a failure the system state is changed, rather than performing a direct recovery. In case of the technique of a ‘Safe-state’ this action starts with the error detector encountering a failure. It notifies the state manager module of the event. The state manager goes to work and moves the system into a predefined safe-state. A safe-state is defined to be a state, where the system will not harm anybody through its actions or in-actions. At this point the failure recovery completed. With the system in the safe-state, some other method assumes control and offers the user actions to move back out of the safe-state and resume normal operations. Usually a safe-state provides degraded capabilities when compared to the system in normal operation. All non-essential activities of the system are stopped to exclude them as possible sources of failures. The state manager is responsible for this and it communicates its reconfiguration actions to the rest of the system through the capability protocol module. This module contains a description of the system’s current capabilities. The process charged with moving the system out of the safe-state uses this information to provide only actions to the user the system can perform in its degraded state. This technique is only marginally concerned with the nature of the failure it handles. In its pure form any failure will move the system to the same safe-state. This is a benefit of the technique, because it allows for handling of failures of unknown type. No matter what failed and how, as long as this failure can be detected, the system will move to the safe-state. It is possible to have multiple, different safe-states and have the state manger choose between them on a failure by failure basis. Obviously, this technique is not suited for handling very basic failures, because the system moves to the safe-state very time. The real benefit of the technique is the capability of handling any type of detectable failure.


-79-

Example

State Manager

Error Detector Output

Source Dst.

Input

Source Dst.

Safe state

CapabilityProtocol

Dynamic View

State-Manager inserts thesafe-state into the system

State Manager activates


Continue according to theinserted state

Error Detected?

Yes

No


in the output of another module. ?? State Manager - A module controlling the systems state. Capable of changing it. ?? Safe-State - A module defining the system state, where the system is incapable to

harm somebody through its actions or in-actions. ?? Capability Protocol - A data storage of all current system capabilities

Limitations and Constraints This technique can be very inflexible because it forces the system into the safe-state each time a failure happens. Sometimes it might not be necessary to immediately go


-80-

to the safe-state. Other techniques can be cheaper and maintain usability better that this technique.

Strategy Name

Total Recovery Description

Whenever a failure occurs a known action is taken to remove the failure and any effects resulting from it from the system. This is the preferred method for recovering from failures because the operation the failure happened in still finishes with a true result.


The techniques presented are very specific since they handle specific failures. If a failure happens, which the recovery mechanism was not designed for, it will not be caught or handled. It is strongly suggested to add a recovery mechanism suitable for catching unknown failures to the system in addition to the implementation of this strategy. All of the techniques described below have to be supported by some error detection method. A system can only recover from a failure, if it knows about its presence. As a result, the effectiveness of all failure recovery techniques listed here depends heavily on the error detection technique used.

T3: Error Handler Name

Error Handler

Known As None

Description When the error detector detects an error, the recovery mechanism will take the process’s output and fix it. By doing this, the recovery mechanism is preventing the process of generating bad input for the next module down the line. This technique is incorporated into most current programming languages in the form of exception handling (in the case of C/C++). It is very basic and by itself cannot provide total recovery from all possible failures. The other techniques listed in this strategy are more powerful. Still, error handlers have distinct benefits. Architecture and financial wise, error handlers are very cheap to produce. They can be included in the code of the module they safeguard in the form of try-catch blocks. In this case the programming language framework provides most of the modules


-81-

needed. On the other extreme a complete solution might be implemented using custom built modules for the error detector and the error handler. This implementation is still very basic and easy to implement. The second benefit of error handlers is their capability to handle a vast amount of failures. Granted, these failures might be very basic in nature, but this attribute fits the technique. A basic failure should be handled by a basic recovery technique. Anything else is overkill. To summarize, error handlers are an easy and inexpensive way to recover from basic failures, leaving the more complicated failures to the more powerful techniques.

Example

Data Error Handler

Error Detector

Output

Output

Source Dst. Source

Source Dst.

Dst.

Input

Source Dst.

Dynamic View

Execute Error Handler


Error Detected?

Yes

No


in the output of another module. ?? Error Handler - A unit capable of correcting the effects of a failure in a previous

module.


-82-

Limitations and Constraints The major problem error handlers have, is the fact that they can only recover from known failures. When a specific failure happens, there must be a specific solution implemented inside the error handler for this one case. If the error detector encounters a failure of unknown type, the error handler cannot recover because it does not know how too. It can only specify generic behavior for the event of an unknown failure. This technique is best used for fixing standard and obvious failures like all the exceptions defined in the programming language. To supplement it an additional failure recovery technique should be implemented, which can react to failures of unknown type.

Checklist

Checklist – Automatic Failure Recovery Building Blocks

Name Description B1) Error Detector

A module or component, which detects the presence of a failure in the output of another module.

B2) Error Handler A unit capable of correcting the effects of a failure in a previous module.

B3) State Manager

A module controlling the system’s state. Capable of changing it.

B4) State Repository

A container module of state snapshots of the system.

B5) Safe-State A module defining the system state, where the system is incapable to harm somebody through its actions or in-actions.

B6) Capability Protocol

A data storage of all current system capabilities


Met Not Met

Related Technique

1 Error Detector


T1,T2,T3

2 Error Handler

A unit capable of correcting the effects of a failure in a previous module.

T3

3 State Manager

A module controlling the systems state. Capable of changing it.

T1,T2

4 State Repository


T1


-83-

5 Safe-state A module defining the system state, where the system is incapable to harm somebody through its actions or in-actions.

T2

6 Capability Protocol


T2

Techniques T1) Continuous System State Archiving

B1, B3, B4

T2) Safe-State B1, B3, B5, B6 T3) Error Handlers B1, B2 Notes


-84-

11.2.5 Reconfiguration

Reconfiguration Concern Name

Reconfiguration Context

Reconfiguration is a method used to handle permanent failures within a module of a system. At any given moment in time, one module of a system can have an internal failure. If this failure is not handled by any safety technique, then the corresponding module is broken beyond repair. When one module of the system does not function, usually the entire system does not function. Reconfiguration addresses this issue. When a module ceases to function properly in a system employing Reconfiguration, the module in question is removed from the system. The remaining system no longer includes faulty modules and is able to function. Depending on the type of reconfiguration used, the system either replaces a faulty module with a spare module, or it removes the module along with its functionality and continues operations with diminished capabilities.

Example

The simplest version of reconfiguration is n-version programming using an error detector. Let us suppose there is a safety critical process with a defined input and output. The development team programmed three versions of this algorithm and uses them in a three-way n-version setup. An error detection unit is connected to all three process versions to determine the final output by comparing the individual results.


-85-


Source Dst. Source

Error Detector Data

Source Dst.

Computation

Computation

Dst.

Configuration Control

Data

Source Dst.

Data

Source Dst.

Data

Source Dst.

Data

Source Dst.

Data

Source Dst.

Control

Receiver Sender

Figure 1

In normal operations, an input to this setup will go to all three processes. They will all produce the same output and the error detector presents this result as the output. In the event of a failure in one of the processes, the error detector will present the majority result as the output and reconfigure the system. The one module with the minority result is deemed defective and is removed from the system. The remaining two modules and the voter still perform their purpose, and the overall system remains functional. The system reconfigured itself to remove any defective module from its setup. The error detector can use different algorithms to decide what to do in the event of a failure. The most basic and most commonly used method is disabling the offending unit. It is presented here.

Strategy Name

Redundancy Description

A system is able to compensate for failure of one component if it has an equivalent component ready to replace the faulty one. The system then uses the strategy of redundancy. The techniques in this strategy describe different setups for redundancy including a technique for the special case when a component fails for which no redundancy exists. The following techniques each implement reconfiguration for one module or component. For use with multiple modules or components, these techniques should be viewed as patterns and expanded upon accordingly.


-86-

Limitations and Constraints All of the techniques described below are limited by the fault detection technique used. Reconfiguration will only be successful if a fault can be detected and its origin determined. To do so, reconfiguration heavily depends on fault detection. As a limiting factor, the degree of modularization determines the extent and the difficulties of implementing a Reconfiguration. Reconfiguration only works well with system made up of components, because only components as a whole can be removed/added. A tightly coupled architecture can make the implementation of reconfiguration difficult and sometimes even impossible.

T1: Static Redundancy Name

Static Redundancy

Known As None

Description Static Redundancy uses redundant modules, which are active at all times during execution. This means in a set of three redundant modules all three of them receive information and produce results, until a fault is detected in one of them. When this happens, the module in question is removed from the system. The remaining two modules perform as usual. The system remains fully functional.

Example


Source Dst. Source

Error Detector Data

Source Dst.

Computation

Computation

Dst.


Data

Source Dst.

Data

Source Dst.

Data

Source Dst.

Data

Source Dst.

Data

Source Dst.

Control

Receiver Sender


-87-

Dynamic View

Continue according to fail- safe procedure

Activate fail-safe procedure Return from function block

Error Detected? Yes

No

Feed input to computations

Yes

No

Is the system capable of functioning after removing the component?

Configuration Manager determines faulty component

Configuration Manager removes faulty component

Configuration Manager determines result using the output

of the remaining components


Building Blocks ?? Computation – A unit with defined inputs and outputs, which performs

computations ?? Configuration Control – A unit, which determines and monitors system

configuration ?? Error Detector – A unit for detecting errors ?? N-Version Programming - One unit of functionality is implemented in different

ways (e.g. use different algorithms) to provide redundancy. Limitations and Constraints

Since all computations in this setup are preformed in sequence, this method can be expensive in terms of CPU time. If this is a problem see T2 ‘Dynamic Redundancy’.

T2: Dynamic Redundancy Name

T2: Dynamic Redundancy


-88-

Known As None

Description Dynamic redundancy uses spares modules to replace defective one. During normal operation, only one or a selection of modules are actively involved in producing results. When this module is determined to be defective, it is removed from the system and replaced with its designated spare. This technique can also be used to replace modules used in n-version programming. For example, if a system cannot have more than 3-way redundancy because of performance issues, dynamic redundancy can be used to remain at three-way redundancy level when a failure occurs. In the event of a failure of any given module, the module in question is replaced with a spare. This way the failure is removed from the system without changing the logical composition of the system. If this technique executes flawlessly, the system will always remain at 100% functionality all the time. In the case of the three-way redundancy setup a failure in any one of the processes would be handled by removing the offending process and replacing it with a spare. The three way-redundancy setup remains the same. There is no limit on the amount of spare modules used.

Example


Source Dst. Source

Error Detector Data

Source Dst.

Spare

Computation

Dst.


Data

Source Dst.

Data

Source Dst.

Data

Source Dst.

Data

Source Dst.

Data

Source Dst.

Control

Receiver Sender

Result 1

Source Dst.

Result 2

Source Dst.


-89-

Dynamic View

Continue according to fail-safe procedure

Activate fail-safe procedureReturn from function block

Error Detected?

Yes

No


Yes

No

Is the system capable of functioningafter removing the component?

Configuration Managerdetermines faulty component

Configuration Managerremoves faulty component

Configuration Managerdetermines result using the output

of the remaining components


Configuration Manager insertsspare into the configuration


computations ?? Spare Module - A unit with defined inputs and outputs, which performs

computations, which is only used when another module fails. ?? Configuration Control – A unit, which determines and monitors system

configuration ?? Error Detector – A unit for detecting errors ?? N-Version Programming - One unit of functionality is implemented in different

ways (e.g. use different algorithms) to provide redundancy.

Limitations and Constraints This technique still uses redundancy and is therefor limited by the CPU time available, but it is much less so than the technique of ‘Static Redundancy’.


-90-

T3: Graceful Degradation Name

T3: Graceful Degradation

Known As None

Description A system using graceful degradation will lose functionality when a failure occurs, but it will not crash. A simple example of this would be the printer of a computer. When it fails, the computer is simply no longer able to print. All other functions of the computer are still available. This method is reconfiguration in the sense of removing a defective module from the system. The system still reconfigures itself, but it just does not replace the defective part.

Example

Data Functionality 2 Data

Source Dst. Source

Error Detector Data

Source Dst.

Functionality 3

Dst.


Data

Source Dst.

Data

Source Dst.

Data

Source Dst.

Control

Receiver Sender

Result

Source Dst.

Data Functionality 1 Data


Capability Protocol

Update

Receiver Sender


-91-

Dynamic View

Continue according to fail-safe procedure

Activate fail-safe procedureReturn from function block

Error Detected?

Yes

No


Yes

No

Is the system capable of functioningafter removing the component?

Configuration Managerdetermines faulty component

Configuration Managerremoves faulty component

Return from function block -return ERROR indicator

Configuration Managerupdates Capability Protocol


computations ?? Configuration Control – A unit, which determines and monitors system

configuration ?? Error Detector – A unit for detecting errors ?? Capability Protocol – A data storage of all current system capabilities ?? N-Version Programming - One unit of functionality is implemented in different

ways (e.g. use different algorithms) to provide redundancy.

Limitations and Constraints This method will only work effectively, if the rest of the system makes use of capability protocol. An operation should not even be attempted, when the capability protocol states it cannot be done. Otherwise a lot of time is wasted, when to system repeatedly tries a computation destined to fail for the start. This can also end in an infinite loop.

Checklist


-92-

Checklist – Reconfiguration Building Blocks

Name Description B1) Computation A unit with defined inputs and outputs, which performs

computations B2) Configuration Control

A unit, which determines and monitors system configuration

B3) Error Detector


B4) Spare Module A unit with defined inputs and outputs, which performs computations, which is only used when another module fails.

B5) N-Version Programming


B6) Capability Protocol



Met Not Met

Related Technique

1 Computation A unit with defined inputs and outputs, which performs computations

T1, T2, T3

2 Configuration Control

A unit, which determines and monitors system configuration

T1, T2, T3

3 Error Detector A module or component, which detects the presence of a failure in the output of another module.

T1, T2, T3

4 Spare Module A unit with defined inputs and outputs, which performs computations, which is only used when another module fails.

T2



T1,T2

6 Capability Protocol


T3

Techniques Static Redundancy B1, B2, B3, B5 Dynamic Redundancy B1, B2, B3, B4, B5 Graceful Degradation B1, B2, B3, B5, B6


-93-

Notes


-94-

11.2.6 Testability

Testability Concern Name

Testability Context

A method used to increase the amount of errors able to be caught in the testing phase by providing interfaces for testing tools.

Example

A highly testable system should provide a means to integrate testing tools without compromising a safety critical system.

TestingInterface

Control

Source Dst.

System

Data

Source Dst.

Let us suppose that a system is developed with testability in mind. The system consists of many different layers of concerns, each of which interfaces with a special test layer one or more times. An example of a layer of concern would be user interface. The test layer functions as a means of gathering information about the status of the system and sending test data to the system. The test layer in itself does not possess any extensive computational functionality. It provides an interface for an external entity to the system for testing purposes. This entity is some sort of test control device, usually a process on the system itself, but separate from the system processes. The test layer only negotiates the connection between the system and the external entity. Any interpretation of the data or generation of data to insert into the system is left for the external entity. The point here is that the system should only perform the actions it is built for. Testing is usually does not take place when the system is deployed. Therefor it is desirable to have any test logic external and only include the minimal test-code inside the system.


-95-

Another point, which supports this approach, is the idea of testing the final system and not an almost finished version of it. Testing a system can easily be done by changing its code to produce artificial faults or by producing input from a script instead of the user. The problem here is that the test happens on a system, which is not identical to the actual deployed system. Especially when dealing with timing issues in hard real-time system this might be a considerable problem. With the test layer technique, the test layer remains in the code of the system when it is deployed. It is the external test entity, which is removed. The fact that the test interface remains in the system after deployment can be seen as a benefit. At any point in time different kinds of tools can be connected to the system for different purposes.

Strategy Name

Test Tool Interface Description

This strategy inserts an interface for test tools into the code of the system. When the system is completed this test interface is used to insert test data into the system and gather the test results. This allows for automated testing through use of sophisticated test tools. The following techniques all provide different testing capabilities. Depending on what capabilities are needed during testing a different collection of the listed techniques has to be implemented. Some of the techniques listed are patterns for implementation of one test case. To use multiple test cases the techniques have to be implemented multiple times.


All techniques in this strategy add code to the system. As a result, the finished system is bigger and almost certainly is slower than the same system without the added code. On hard real-time systems or systems with severely constrained hardware resources these techniques might be impossible to use.

T1: Test Layer Name

Test Layer

Known As Test Interface, Test API


-96-

Description A test layer encapsulates a testing interface for the system. It contains all interface components necessary for any external or internal testing tools. The test layer is connected to the system layers below it by a connection. The type of connection between the two can vary. The simplest connection type would be to have the system report to the test layer at certain moments in time. This would just involve routing component interactions through the test layer. This approach links the system to the test layer. It should only be used for very simple problems. More sophisticated approaches hid the existence of the test layer from the system. The test layer attaches to the system, not the system to the test layer. On of these approaches is the technique of breakpoints mentioned later in this pattern. The test layer is the interface between the system and the test tools and it is essential for all testability techniques. It by itself usually does not provide any testing capability. An external process is needed to accomplish this. This external process uses the test layer to gain access to the system. It extracts data from the system and runs an analysis on it. Alternatively the external process can also insert data into the system through the test layer. The connection between the test layer and the external process can take different forms. The simplest would be a simple component to component interaction with the system itself. A separate component, which executes inside of the system to test, gathers data through the test layer and runs its test cases. This is the basic configuration of the test layer. More complex configurations use process to process communication. For example the test layer can provide a serial port interface. In this case the external task executes on a separate system, possibly a dedicated test station, and interface with the system through a serial link.

Example

TestingInterface

Control

Source Dst.

System

Data

Source Dst.


-97-

Dynamic View There is no general pattern for the interaction between the system and the test layer or for the interaction between the test layer and the external process.

Building Blocks ?? Test Layer – a data injection and collection method used to verify the

functionality of a variety of operational scenarios.

Limitations and Constraints This method cannot be used if the system is under severe speed constraints. Adequate system resources are required, such as system memory.

T2: State-Snapshot Name

State-Snapshot

Known As None

Description One possible item to test is the state of the system at certain points in time. To do so, access to the state of the system must be available, so it can be analyzed within an external process. Within the system there is a state manger object. It is responsible to record the current state of the system and place this sate record in the state repository. For testing, a test interface is provided, which gives access to the recorded states. One might try to implement this technique without the use of the state manager, by extracting only a selected amount of state information on demand at only the location of interest. While this method of implementation will fit the purpose of validating the system state fine, it is not expandable at all. The state manager object is very versatile an integrate part of many techniques for safety critical systems and generic systems. Implementing a full state manger object is strongly recommended.


-98-

Example

State Manager

State Snapshot Repository

Functionality

TestingInterface

Dynamic View

State Snapshot deposited inRepository

State Manager Takes a StateSnapshot

Execution Continuous

Program Exits

This diagram shows the behavior of the part of the system, which executes as a part of the actual system to be tested. All this part does is take a snapshot of the current state at predefined moments and save it in the state repository for later use.

Building Blocks ?? State Manager - A module controlling the systems state. Capable of changing it. ?? Snapshot Repository - A container module of state snapshots of the system. ?? Test Interface – An interface module giving access to all functionality related to

testing.

Limitations and Constraints There are no mayor problems other than hardware and timing constraints, but the implementation of the state manager can be complicated depending on the location of the state variables in the architecture. Since the state manager must be connected to all objects with state variables, coupling of the state manger will be very high.


-99-

T3: Error-Injection Name

Error-Injection

Known As None

Description One very important way of testing safety critical systems is to inject faults to see whether the system can deal with them or not. This technique provides the mechanism of fault injection for faults related to bad data. Bad data faults are data items within the system, which are at an invalid state relative to the state of the rest of the system. For example, an incorrect output of a function is bad data relative to the input of the function and also to the rest of the system, which expects good data from the function. The most basic way to insert bad data into a system is to include a breakpoint object in a communication path between two objects. Normally an object directly invokes another object’s functions. In case of this technique, a breakpoint object is inserted between the two. The first object interfaces with the breakpoint object and it calls the function of the second object. The purpose of the breakpoint object is to give access to the data inside the communication. This access can either be ‘read’ to check for data consistency or ‘write’ to inject faults. Fault injection makes use of this breakpoint object to change data form a correct value to an incorrect one after a certain computation, to see if the error handler of the particular computation functions properly. At the point, where the error handler is supposed to have recovered from the failure, a second breakpoint is injected. This breakpoint gives access to the output of the error handler, to determine success of the recovery method. This breakpoint is optional, because the system behavior can be used as an indicator of success. If the system behaves as defined for the injected fault success is achieved. To control the breakpoints a test interface object is used. It manages a whole array of breakpoint objects. For all breakpoints it has to write to, the test-interface object has an interface. All breakpoints, which give access to data, notify the test-interface object of a data change. The configuration used in the example is only one of many possible solutions. Breakpoints can be injected at almost any location and at any number. The important thing to watch out for is to insert breakpoints at points where the value of the data they measure is defined and only to measure complete logical units of functionality.


-100-

Example

BreakpointFunctionality

TestingInterface

Error Handler

Breakpoint

Data

Source Dst.

Data

Source Dst.

Data

Source Dst.

Dynamic View

Breakpoint activates TestingInterface

Testing Interface injectsfaulty data

Breakpoint feeds faulty datainto functionality

Functionality triggersbreakpoint

Functionality Executes

Second breakpoint istriggered

Second breakpoint activatesthe Testing Interface

Testing Interface comparesintercepted data to expected data

Test interface records resultof the test

Second breakpoint returnsexecution to the system

Building Blocks ?? Breakpoint – A module giving access to the data of the module to module

communication it is used in.


-101-

?? Test Interface – An interface module giving access to all functionality related to testing.


The breakpoint objects and the test-interface object add computation time to the system. For hard-real-time systems this additional CPU time might be impossible to spare. The added size of the compiled code might be too much for limited environments such as embedded systems.

T4: Input-Output Testing Name

Input-Output Testing

Known As IO Testing

Description One way to read data from a system is to include a breakpoint object in a communication path between two objects. Normally an object directly invokes another object’s functions. In case of this technique, a breakpoint object is inserted between the two. The first object interfaces with the breakpoint object and it calls the function of the second object. The purpose of the breakpoint object is to give access to the data inside the communication. This access can either be ‘read’ to check for data consistency or ‘write’ to inject faults. IO testing makes use of this breakpoint object to gain access to data before and after a computation, to see if it is consistent. At these points, the breakpoints notify the test-interface object of a data change event. Within the test-interface any logging or validation of the data gathered by the breakpoints is done or initiated. The configuration used in the example is only one of many possible solutions. Breakpoints can be injected at almost any location and at any number. The important thing to watch out for is to insert breakpoints at points where the value of the data they measure is defined and only to measure complete logical units of functionality.


-102-

Example

BreakpointFunctionality

TestingInterface

Breakpoint

Data

SourceDst.

Data

Source Dst.

Dynamic View

Breakpoint activates TestingInterface

Testing Interface recordsinput data

Breakpoint feeds data intofunctionality

Functionality triggersbreakpoint

Functionality Executes

Second breakpoint istriggered

Second breakpoint activatesthe Testing Interface

Testing Interface comparesintercepted data to expected data

Test interface records resultof the test

Second breakpoint returnsexecution to the system

Building Blocks ?? Breakpoint – A module giving access to the data of the module to module

communication it is used in. ?? Test Interface – An interface module giving access to all functionality related to

testing.


-103-

Limitations and Constraints The breakpoint objects and the test-interface object add computation time to the system. For hard-real-time systems this additional CPU time might be impossible to spare. The added size of the compiled code might be too much for limited environments such as embedded systems.

Checklist

Checklist – Testability Building Blocks

Name Description B1) Test Interface An interface module giving access to all functionality related to

testing. B2) Breakpoint A module giving access to the data of the module to module

communication it is used in. B3) State Manager


B4) Snapshot Repository



Met Not Met

Related Technique

1 Test Layer A data injection and collection method used to verify the functionality of a variety of operational scenarios.

T1, T2, T3, T4

2 Breakpoint A module giving access to the data of the module to module communication it is used in.

T3,T4

3 State Manager


T2

4 Snapshot Repository


T2

Techniques Test Layer B1 State Snapshot B1, B3, B4 Fault Injection B1, B2 Input Output Testing B1, B2 Notes


-104-

knowledge centered assessment patterns an effective tool...

Documents