workshop on integrated approach for fault tolerance...

Final Report on

Workshop on Integrated Approach for Fault Tolerance - Current State and Future Requirements

compiled by:

Pankaj Jalote Satish K. Tripathi

Institute of Advanced Computer Studies and Department of Computer Science

University of Maryland College Park, MD 20742

Abstract: Fault tolerance is a very broad topic. There are many valuable results in the area with many important areas of research remaining. This workshop brought together researchers in four areas - applications, operating systems, hardware, and modelling - in an attempt to take an integrated view of fault tolerance. This report presents the outcome of the discussion of the different groups, and a description of the presentations made at the start of the workshop.

1. Background

The use of computers in critical applications such as aircraft flight control and industrial processes continues to increase. As it does so, there is a corresponding increase in the need for computing systems that will continue to operate despite failures. Perhaps the best approach for achieving such dependable computing is the use of integrated fault tolerance, where the system is designed from the ground up to support dependable computing for the given application. With a clean slate, the system designer can "allocate" portions of the functionality required to implement dependability to the different levels of the system rather than being forced a priori to use a given architecture or operating system. It is with this view that a workshop was organized to discuss the issues involved in an integrated approach to fault tolerance.

The workshop was held in Greenbelt Marriot, on May 4-5, under the auspicies of Univer- sity of Maryland Institute of Advanced Computer Studies. The goal of the workshop was to take an integrated view of fault tolerance, combining fault tolerance approaches in the areas of hardware, operating systems, and application software. About 50 people from various universities, and research organizations across the country attended the workshop.

The participants were divided into four working groups. Each group had about 12 participants and a discussion leader. The four working groups and their discussion leaders were:

Funding for the workshop was provided by University of Maryland Institute of Advanced Computer Studies, and Office of Naval Research, N00014-89-J-1867.

40

1. Hardware fault tolerance Prof. Dhiraj K. Pradhan, University of Massachusetts

2. Operating system aspects Prof. Richard D. Schlicting, University of Arizona

3. Application level fault tolerance Dr. John C. Knight, Software Productivity Consortium

4. Modeling and evaluation Prof. Kishor S. Trivedi, Duke University

The rationale behind viewing a computing system as consisting of three layers - hardware, operating system and application layers - was that it provides a high level view of a computer system and are the three major areas of fault tolerance research. Modeling and evaluation is clearly an important and somewhat neglected area, and it was decided to have a discussion group for it.

At the opening of the workshop, some initial presentation were made. There were a total of 6 speakers. Four of these were the four group leaders and the remaining two were Dan Palumbo from NASA Langeley, and Robert Dancy from IBM Federal Systems Division.

Mr. Palumbo is at NASA Langley and his talk was entitled "Fault tolerance - Nasa's perspective". Mr. Dancy is a senior member of the technical staff of air traffic control program of IBM Federal Systems Division, and his talk was entitled "Issues in fault tolerance design - large scale, complex systems". Representatives of NASA and air traffic control system were chosen as these are the major application areas for fault tolerance.

The initial presentations set the tone for the discussions. In the start each working group held separate discussions. The goal of each group was to identify major issues for fault tolerance in that area. In the second day the application and the systems group were merged due to substantial overlap of interests. In between the group discussions we had joint sessions where each group leader presented the outcome of their discussions and solicited feedback from other participants.

In the end, each group identified the areas/problems where current solutions are sufficient. In addition, a list of desirable properties for fault tolerance which are currently not well under- stood or well supported, were also identified. This was a major outcome of the workshop, as it gives the areas in which future research should be directed. There was a lot of lively discussion in each working group, and in the final joint session a number of research directions were listed by each group leader.

The next section of this report contains a brief description of the initial presentations made by different people. Sections 3, 4, and 5 contain the report of the different groups about their group discussions. A list of attendees, who took part in the discussions, is given in see-' tion 6.

2. INITIAL PRESENTATIONS

2.1. Fault To l erance - N A S A ' s Perspect ive Dan Palumbo, NASA Langley

Aircraft and spacecraft flight systems are two domains in which NASA is working to advance fault tolerance. The reliability requirement for these two areas can be quite different. A flight critical commercial aircraft system must be certified to have a probability of failure of

41

1.0E-9 for a 10 hour flight. Military aircraft typically require a probability of failure of 1.0E-7 for a 3 hour mission. An unmanned space probe may be designed with a goal of 0.99 probability of mission success for a 5 year mission.

Anticipating the increased application of digital electronics in these areas, NASA has been working since the late 1970's to develop fault-tolerant designs and advanced reliability prediction methods. The early efforts that are typified by the fault-tolerant systems FTMP and SIFT focused primarily on the computing site. Today, with need growing for highly integrated, distributed systems, designs emphasize a building block approach and stress inter- computer networking. The Advanced Information Processing System (AIPS) has been developed based on the reliability and computational requirements of the full spectrum of applications that NASA expects to encounter in the 1990's.

Reliability tools such as CARE III that did well to analyze the FTMP have evolved into sophisticated analysis packages that can analyze the multiple computing sites, the networks and the hundreds of I/O devices that make up today's advanced integrated real-time control system. The Hybrid Automated Reliability Predictor (HARP) and the Abstract Semi-Markov Specification Interface to the SURE Tool (ASSIST) are two examples of the current generation of reliability tools.

Fundamental concepts of fault-tolerant design have been established. These include replication, synchronization, consistent data distribution and reconfiguration. As these concepts gelled, it became apparent that the question of how to design a fault-tolerant system to achieve high reliability was beginning to be overshadowed by the problem of demonstrating that the design itself was correct and reliable. N-version design strategies have been suggested as an approach to highly reliable designs. However, experimental results have shown that different designs may not be entirely independent. Thus the amount of reliability gained by replicating the design is uncertain. An alternative approach is to prove the design to be correct. Although cumbersome, proof of correcmess techniques are making substantial gains and are beginning to be used to verify commercially available products.

Another issue in fault tolerant design is environmental effects. Prominent concerns are lightning strikes, high energy radio frequency radiation and other causes of transient upset. NASA has sponsored research in which a heavily instrumented aircraft has flown through thun- derstorms for the purpose of gathering data on the propagation of the lighming induced elec- tromagnetic fields. This data was subsequently used to derive math models which could be used to predict the field strengths. Currently, two digital engine controllers are being instrumented for transient upset testing based on the lightning data.

In a recent workshop sponsored by NASA Langley on the validation of flight critical systems the subject of integrated software tools was often raised. Software tools are seen as a technology which enables the design and production of the coming generation of integrated systems. The term integrated system in this context refers to the organization of various system functions (flight control, trajectory planning, vehicle management...) in a seamless design. Many functions may occupy a single processor as the term integrated implies. Yet, some functions may be distributed across many processors.

It is becoming evident that in such tightly coupled systems many inter- functional dependencies arise which are not anticipated by classical design methods. And when the dependencies are known, the impact of these dependencies is often lost as the system matures. The end result is that design decisions are made which violate one or more design assumptions that have become obscured. The system then contains hard to find latent design faults. It has been recommended by the working group committee that one of the functions of an integrated

42

design tool be a kind of bookkeeping where design constraints are maintained and propagated throughout the design and manufacture of the system.

In the Integrated Airframe/Propulsion Control System Architecture program (IAPSA), a fighter flight control system was integrated with two engine control systems. An architecture was defined which was based on the AIPS building blocks. A single quad Fault Tolerant Pro- cessor (FTP) contained the entire flight control system. A triplex FTP was dedicated to each engine. The FTPs were connected by an inter-computer network. Each FI'P had two I/O networks connected to its associated sensors and actuators.

This system was subjected to "pre-validation" reliability and performance analysis. The pre-validation exercise was as much of a test for the reliability and performance tools as it was for the system design. With respect to the system design, several validation issues were raised. The network design was very complex. This hinders both the validation of the correctness of the design but also the extraction of faithful reliability and performance models. Transient recovery time was too long. This is due in part to the F'I'P's tight coupling making it necessary to vote the entire RAM portion of memory before a channel may rejoin the active configuration. Finally, the priority based preemptive scheduler is easy to use, but the resulting performance (as determined by deadline margin) is difficult to predict and, therefore, hard to validate.

As might be expected, the analysis tools were found to have deficiencies also. Besides the monumental bookkeeping task, as mentioned above, two prime shortcomings were noted. The creation of models, whether performance or reliability, is not an exact science. An analyst is continually asking himself if the current model contains enough detail to faithfully model the target system. Once a good model is obtained, it often requires a great deal of computer resources to execute. This is especially true for the larger systems such as IAPSA.

In closing then, it becomes evident that, in the area of high reliability, the prime issue is not how to arrange the hardware components so that a physical failure can be tolerated, but, how to design the system so that the confidence in the design can be ~raised to a level commen- surate with the physical fault tolerance that is now achievable.

2.2. Application Level Fault Tolerance John C. Knight, Software Productivity Consortium

This is a summary of the initial presentation in the area of applications and summarizes the author's views of what constitute the most important research areas. It is by no means comprehensive nor complete.

Hardware Fault Tolerance In Hardware

Faults in hardware are often dealt with by the hardware itself. The hardware contains elements that are designed to deal with its own faults, and the effect is usually transparent to the application. Systems such as NMR typify this approach. Some techniques, such as hybrid redundancy, though based in hardware require some support either from the system software or the application in order to effect recovery.

In general, it is important to have a complete system view of the application of fault tolerance. The hardware cannot be considered in isolation. With this in mind, the most important areas that need to be addressed from the application's point of view when designing fault- tolerant hardware include:

(l) Interaction between the application and the hardware fault tolerance mechanism. For any specific technique used within the hardware, it is important to define exactly

/43

(2)

what the application needs to know about the use of the technique and when it needs to know it. The correct placement of responsibility and the resulting interaction is not viewed systematically at present. More specifically, the degree of masking that the hardware system provides for the faults that it deals with must be carefully characterized. For example, are the effects of a particular technique "instant" or would a real-time system have to be concerned about recovery impacting a deadline? Similarly, are the effects of any technique "permanent"? In particular, does an application have to anticipate future degradation?

The cost-effectiveness of the various approaches. There are many tradeoffs that can be considered if a system view is taken in the provision of dependability. For example, would it be preferable to use a distributed target in which the application was involved in recovery rather that some form of NMR of which the application would be ignorant?

Hardware Fault Tolerance Supported By Software.

The use of software support for fault tolerance in hardware is extensive. Many target architectures, in particular distributed and multiprocessor systems, are well suited to tolerating hardware faults but cannot do so without software support. The full potential of such architectures is not yet realized.

Major areas needing continued attention include:

(1) Programming languages for distributed and multiprocessor systems. In many cases, applications must be able to express the processing that will be used after a failure reduces the available hardware. Despite the fact that some work has already been done in this area, programming languages need to be developed that allow the con- cise statement of what the application needs.

(2) Scheduling and placement algorithms for distributed and multiprocessor systems. Optimal scheduling is being studied for various important target architectures but such studies do not usuaLly include the requirement for meeting revised deadlines after failure. Similarly, placement algorithms need to address post failure constraints.

(3) Graceful degradation in distributed database systems. The treatment of hardware faults in distributed database systems is essential to ensure data integrity but also provides the opportunity for increased availability and performance.

(4) Support for fail-stop hardware components. Fail-stop components are the fundamental building blocks of target hardware systems that are to provide graceful degradation. Their efficient construction is fundamental to the provision of high quality, fault-tolerant distributed and multi-processor systems.

Software Fault Tolerance Supported By Hardware.

This area has received relatively little attention in the past although some progress has been made. There has been some work in error detection and some work in support for back- ward error recovery. However, this is potentially a very valuable facility because it offers the possibility of isolation of the problem, i.e., the software and its associated fault, from the treatment of the situation, i.e., the hardware support. The general area of support for application- level fault tolerance within the hardware needs to be examined very carefully. Specifically, there is a substantial need for better hardware support in the areas of:

44

(1) Simple approaches to error detection. Such checks as address and general range checks are typically performed within the software. The overhead in this case is often considerable and the temptation to eliminate such checks is substantial despite the fact that the efficacy of such checks in error detection is well established. Hardware support is an obvious and attractive alternative.

(2) Complex approaches to error detection. Hardware support for checking path conditions, call sequences, etc, might be extremely valuable.

(3) Timing checks. In many systems, but particularly in real-time control systems, there is a general need for multiple, high-resolution timers within the hardware. An adequate supply of timers for use as watchdog timers is a simple way of providing error detection outside the software itself for a common cause of software difficulties.

(4) Protection systems. High performance protection, such as a capability based scheme, can only be provided with hardware support. The Intel iRMX-432 is a nice example of what can be done.

Software Fault Tolerance In Software.

In general, handling unanticipated software faults with software fault tolerance is unrealistic. Although techniques such as N-version programming have been studied in some detail, it is not clear that these techniques can be relied upon to provide dependable software systems. On the other hand, dealing with particular fault categories using specialized techniques is more realistic. For example, it might be possible to define software structures that allow detection and recovery from fairly arbitrary timing faults. Similarly, there are approaches to certain classes of faults that are provable, such as robust data structures. The major research challenge in this area is the development of a cost effective way of achieving high reliability in software, not necessarily through the use of any form of fault tolerance.

Experimentation

There is a general need to engage in industrial strength demonstrations of feasibility, performance, and realism of many proposed techniques in fault tolerance. Experimental evaluation is needed to provide hard evidence of the applicability of techniques. It is well known that small scale studies do not necessarily scale up and that results achieved in the laboratory might not hold when exposed to typical conditions in industry. The performance of large scale experimental evaluations will also reveal new problem areas and new insight. The major challenge in this area is the acquisition of sufficient funding to permit the necessary studies to be performed.

2.3. Integrated Fault Tolerance - Operating System Aspects Richard D. Schlichting, University of Arizona

General Issues

Many of the fundamental issues in designing a dependable computing system revolve around the appropriate use of abstraction. In other words, the key to good system design here - as it is for any type of system - is to develop the "r ight" abstractions for each level in the system hierarchy. Although the use of abstraction is important in all areas, our focus here is

45

on abstractions that are related in some way to supporting computations that can tolerate failures.

Developing good abstractions is a non-trivial task for a number of reasons. One is that it is not at all obvious what properties go into making up a good abstraction, or how to use existing abstractions to define new ones. Examples of abstractions usually conceded to have desirable properties include such things as transactions, stable storage, atomic broadcast, and distributed virtual memory. Distilling out the essential characteristics that make abstractions such as these valuable currently seems to be more of an art than a science.

A second problem is determining the level at which a given abstraction should be implemented. For example, should atomic broadcast in a distributed system be implemented in the hardware, by the operating system, or left to the application? There are obviously many factors to be considered, yet it seems reasonable to raise the question as to whether the end-to-end argument or some variant is applicable. That is, are there aspects of fault-tolerant computing that are best left to the application since implementing it at a lower level would only lead to unnecessary and wasteful duplication of effort? For example, it seems less than optimal to implement transactions in a general-purpose operating system since many database applications will implement the abstraction in any event for performance or security reasons.

Given a particular abstraction, the system builder must also develop implementation techniques that fit the particular situation. For example, if atomic broadcast is deemed a reasonable abstraction, which of the existing implementation techniques are most appropriate? Or is it, in fact, more appropriate to try to invent a new algorithm that is more carefully tuned to the requirements of the given application? Although undoubtedly a large investment, such an approach may be necessary if, for instance, performance considerations are critical.

A final problem in developing good abstractions is devising appropriate techniques for evaluating the abstractions and their implementations. This process typically involves evaluating the tradeoffs between different possibilities in the solution space using some combination of analytical models, simulation studies, or actual implementation experience. For example, a prototype implementation of a particular abstraction might reveal it as preferable to an alternative with richer semantics because of a significantly cheaper implementation. Of course, the exact nature of the tradeoffs involved and the way in which they are balanced depends heavily on the given application. It is also worth noting that this evaluation process can be difficult and time-consuming, especially if a prototype implementation is involved. This is especially true for fault-tolerant software, which is very difficult to test due to the random and asynchro- nous nature of failures.

Operating System Aspects Given the framework outlined above, the basic problem of the operating system designer

is to answer the question: "'Given the abstractions provided by the hardware, what dependability-related abstractions should the operating system implement for the application?" These abstractions can be divided roughly into three categories: those related to processor fault models, those concerned with data aspects of a program, and those concerned with control aspects. We discuss each in turn.

A processor abstraction is, in general, concerned with presenting a simplified failure model to the application by hiding certain types of failures. For example, the operating system could choose to implement the abstraction of a continuously operating processor to the application by using multiple redundant processors. Or the abstraction could be a fail-stop processor, i.e., a processor whose only failure is a dctectable crash.

46

The second type of abstraction are those that are concerned with providing increased dependability for data storage. The classic example here is stable storage, a virtual storage device that suffers no failures and whose contents survive failures of the associated processor. Abstractions that provide log support for transaction-oriented systems also fall into this category.

A final type of abstraction are those concerned with control aspects of programming. Such abstractions typically provide improved failure semantics, the form of which often depends on the applications. Atomic transactions, redundant processes, and replicated RPC can all be classified in this sense as control abstractions.

When developing and evaluating operating system abstractions, various factors need to be considered. One such factor is the machine architecture on which the abstraction will be implemented. That is, it stands to reason that the type of abstraction desirable or possible may be different for a single processor system than for a distributed or multiprocessor system.

A second import,ant factor already alluded to is the application domain. Different applications have vastly different dependability requirements and characteristics that must be considered. For example, database applications are much different from real-time applications, which are in turn different from general purpose applications.

There are also the usual tradeoffs between the power and expressiveness of the abstractions on one hand, and the implementation cost and efficiency on the other. These tradeoffs are again very application dependent. For example, in a general purpose operating system, it may be desirable to support a weaker form of file replication that does not guarantee consistency between copies in all failure scenarios. Although such a mechanism does not have the rich semantics of database-style consistency, it is significantly cheaper to implement and sufficiently powerful for many general-purpose applications.

Another relevant question for operating systems designers to address is how a given abstraction should be implemented. The factors to be considered here are very similar to those outlined above: machine architecture, application domain, and expressiveness/efficiency tradeoffs. In addition, the kind of failure model approximated by the hardware and dictated by the application is also relevant. For example, the implementation techniques required to implement an atomic broadcast given a fail-stop model are usually much different than if arbitrary (i.e., Byzantine) failures must be considered.

Lastly, two other operating system issues are worth mentioning. The first concerns the potential impact of increased support for fault-tolerant applications on the size and complexity of the operating system. The negative impact from this increase can be substantial; not only does it make the software more difficult to develop and test, it can also adversely affect performance of applications, even those that do not use the fault-tolerance features. To cope with this increased complexity requires the development and use of operating system structuring techniques. For example, one traditional approach that can be employed is the use of abstractions within the levels of the operating system itself. In other words, build up a hierarchy of fault-tolerant abstractions by the use of level-structuring rather than attempting to implement the application interface directly. Such an approach also facilitates making the operating system itself fault-tolerant by providing appropriate "internal" abstractions.

Finally, and perhaps most importantly, techniques and methodologies for formal reason- ing about fault-tolerant programs need to be developed. As the complexity and importance of systems for critical applications increase, the informal techniques of the past will need to be augmented with new, more rigorous approaches. After all, the most wonderful abstraction in the world is of little use if its implementation is incorrect.

47

Conclusions

The key to developing dependable computing systems for critical applications---as it is for any type of computing--is the appropriate use of abstraction. By designing and implementing a hierarchy of abstractions within the various levels of a system, the special difficulties associated with the construction of dependable systems can be reduced. Nevertheless, the inherent complications induced by the need to worry about failures will always make the design, implementation, and verification of fault-tolerant systems more complex than other systems.

Acknowledgments

Many of the issues outlined above were based on an informal poll. The author wishes to acknowledge the contribution of the following respondents: G. Andrews, P. Bemstein, J. Black, H. Garcia-Molina, M. Herlihy, N. Hutchinson, P. Jalote, R. LeBlanc, R. Olsson, L. Peterson, C. Pu, J. Purtilo, F. Schneider, D. Taylor, and W. Weihl.

2.4. Issues in Hardware Dhiraj K. Pradhan, University of Massachusetts

The issues discussed in hardware fault tolerance were quite diverse and included topics such as fault models, use of error correcting codes, the need of CAD tools for fault tolerance chip design, and the need for innovative research in system design. Also discussed was the need for close interaction between the fault tolerance research community and system designers. The myth is that hardware has become reliable but the reality is that although the advances in device technology has made the individual components more reliable, the overall system reliability continues to be a major problem because of the ever-increasing system complexity. A good example of this can be found in memory system design where the number of cells in a memory chip has increased dramatically. For example the 64K chip of a few years back is now soon to be replaced by new memory chips of several million bits. The chip size of the modem day chip is not significantly larger than the earlier chips with significantly reduced capacity. The major improvement in the density has been achieved through shrinking device dimensions. This has, in turn, resulted in reliability problems. For example the soft error problem in memory devices is significantly larger because the smaller dimensions of the present-day device makes it more susceptible. Therefore, the overall reliability of memory system continues to be a major concern. Also, it is to be recognized that as the chip size increases the problem of manufacturing defects becomes more serious. This is particularly important because it can be hard to distinguish between certain manufacturing defects and operational faults; that is, lurking manufacturing defects often result in operational faults during the early stages of operational life. Therefore, it was felt that hardware fault tolerance research should integrate the study of manufacturing defects as well as operational faults. A framework may be developed which allows for sharing of redundancy between two competing requirements yield enhancement and reliability improvement. Fundamental to this is the development of realistic fault models for both maturing and emerging technology such as CMOS and BICMOS. The traditional stuck at fault model may soon prove to be inadequate

Important classes of operational faults are the transient and intermittent faults. Overcom- ing the effect of these faults require error correction and on-line error detection. The most effective means of achieving this is through the use of error correcting codes. Use of error correcting codes in computers is a distinct discipline from traditional coding theory. The traditional coding theory has been developed for error control in communications. There are

48

peculiar challenges to use of error correcting codes in computers. For example speed of encoding and decoding is an important concern in computers. The error models tend to be different in computer error control. Corrected bit errors in multiple words are peculiar to the computers. Therefore vigorous research should be continued in coding for computers. Coding can often provide use of redundancy more effectively than brute force duplication in TMR type approaches. An important example of this is the (4,2) concept employed in the digital tele- phone exchange system of Phillips. With redundancy less than that required for TMR, better error control than TMR is achieved through the use of clever coding techniques.

One of the major obstacles to use of on-chip fault tolerance is the lack of CAD tools that allows one to incorporate fault tolerance into chip designs. The wide acceptance of scan designs has been the result of readily available CAD tools. Similar tool development for fault tolerant chip design will be a major step towards automated fault tolerant hardware design.

Also more effective fault tolerant hardware designs may be possible by considering specific applications in hand. For example, co-processor design for specific applications can be made more readily fault tolerant than general purpose hardware. The trade-off in designing general purpose hardware versus special purpose hardware, from the point of view of fault tolerance, needs to be studied.

In summary, hardware fault tolerance continues to be a major concern because of ever increasing system complexity. An entire system of a few years back has now become a single chip. Present day systems consist of several such chips and thus a fault in a single component on a single chip can result in serious failures. Therefore, vigorous research needs to be continued both at chip level as well as subsystem and system level fault tolerance. This research should include technological considerations as well as manufacturing considerations. New and novel techniques need to be devised to provide fault tolerance at the chip level, closer to the site of the fault. This will result in minimal performance degradation achieving fault tolerance.

2.5. Issues in Evaluation Kishor S. Trivedi, Duke University

Computer system evaluation has been traditionally separated into performance evaluation (under fault free conditions) and dependability evaluation. It is clear that both these types of evaluations yield only a partial picture of system behavior. It is increasingly being recognized that performance evaluation in the presence of faults is needed. Likewise, separate evaluation of software and hardware is no loner adequate. Systems that are being designed and/or used generally possess concurrency, resource sharing and contention, fault tolerance, and degradable performance. Furthermore, these system are rather complex. Techniques and tools of system evaluation should, in a single framework, permit system performance, dependability and performability evaluation. These tools should permit the evaluation of hardware-software systems. They should allow the user to reflect concurrency, resource contention, fault tolerance and degradable performance. The techniques and tools should explicitly address the fact that the systems being evaluated are rather complex. Automatic means of interfacing design database and the evaluation tool must be provided. Calibration of evaluation models must be facilitated by means of databases that keep track of data on individual system elements, the environment and applications. Validation and verification of evaluation tools and individual evaluations must be carefully considered.

Among the approaches to evaluation of fault tolerance, actual implementation and measurement is the most believable and expensive approach while the modeling approaches (Monte-carlo simulation and analytical methods) often make unsupported assumptions for

49

tractability. It is then natural to integrate the measurement techniques with modeling in order to obtain a cost effective method of system evaluation. The same is true of modeling techniques themselves. Simulation models are capable of incorporating detailed system behavior that analytical models may have to approximate. Hybrid models that combine the power of simulation with the efficiency of analytical models achieve the best of both worlds. Within the analytical domain, hierarchical models that at once reflect design abstractions and at the same time avoid a cumbersome one-level model are clearly desirable.

It can be deduced, therefore, that several types of integration are desired from the evaluation perspective:

Integration of System Design with Evaluation Integration of Measurement Data and Modeling (Calibration and Validation) Integrated Evaluation of Software and Hardware Integration of Performance and Dependability Evaluation Integrating Simulation and Analytic Models Integrating Combinatorial and Markovian Models

Moderate amounts of progress has been made on the last three items but all these topics need a good deal of work.

The complexity of systems being evaluated give rise complexity in evaluation. Three kinds of complexities can be discerned in model-based evaluation:

Desired Measure(s) of Effectiveness Model Largeness Model Stiffness

Measure(s) of Effectiveness

The measures of effectiveness can be divided into performance, dependability and com- bined measures. For instance, throughput (under failure free conditions) is a performance measure, reliability or availability are dependability measures and the expected number of jobs com- pleted in a given interval of utilization is a composite measure (assuming that the effects of failures are taken into account). Common analytic models for performance evaluation are product-form queuing networks, stochastic Petri nets, directed-acyclic graphs, (semi-)Markov chains, or heirarchical combinations of these. Reliability block diagrams, fault trees, (semi- )Markov chains, Stochastic Petri nets or heirarehical combinations of these are Commonly used as analytic models of dependability. (semi-)Markov reward models are commonly used for composite measures of performance and dependability.

Measures can also be divided into user-oriented and system-oriented. Response time is in the former category while the throughput is in the latter. They can be further classified into mean, higher moments or distributions. For instance, in a non-real-time system we may be content with just the mean response time. But for a real-time system, the distribution of response time is required in order to determine and/or minimize the probability of violating a deadline. It should be clear that the complexity of evaluation increases as we go from the computation of the mean to that of the distribution. System evaluation can be done assuming steady-state has been reached or a transient analysis may be carried out. Transient analysis will clearly require more effort than steady-state analysis.

50

Much work is needed in identifying the most appropriate set of measures for a given application, formulation of models for computing these measures and numerical solution techniques that are most suited in each case.

Model Largeness Generating and solving large models for performance, dependability and performability is

a continuing challenge to the modeler. Largeness tolerance and largeness avoidance are two broad approaches to the problem. In the former approach, we provide a higher level language for describing models and an automatic means of translation. In this way the modeler does not have to directly face the largeness although the generation, storage and the solution of large underlying models is needed. A number of different modeling languages have evolved. Some are specialized to an application domain (e.g., SAVE and METFAC for availability modeling; ASSIST and HARP for reliability modeling; QNAP for performance modeling) while others are more general purpose (e.g., METASAN, SPNP and SHARPE).

Largeness avoidance approach, on the other hand, is based on the belief that hierarchical composition of different model types can be exploited to avoid the generation of a large, one- level model altogether. This is the philosophy adopted in SHARPE in a general sense and in HARP in a particular sense. State truncation is another way to avoid largeness and it is routinely used in HAI~P, SAVE and SPNP models.

We must point out that the use of hierarchical models often (but not always) implies an approximation. Likewise, state truncation involves an approximation. Techniques to bind the errors due to such approximations need to be explored. Besides these approximations, many other types of errors/approximations can and do occur in models; a modeler must watch out for these errors.

Model Stiffness

Computer systems analysts are often called upon to capture phenomena occurring at widely varying rates in a single model. Solution of such stiff models poses numerical difficulties. Although some recent progress has been made in solving stiff Markov and Markov reward models, more work is necessary in this direction.

3. R E P O R T BY APPLICATION/SYSTEM GROUPS:OS AND APPLICATION ISSUES

John C. Knight and Richard D. Schlichting

After preliminary discussions held separately, it became clear that the concerns of the applications group were very similar and closely related to those of the operating systems group. Subsequently, the groups merged for discussion and this is a joint report of their conclusions.

This section presents two general sets of results. The first contained in the next five sub-sections is an assessment by the group participants of things that currently can and cannot be done in various areas. The second is a list of recommendations for both practice and research by the community.

51

3.1. Cannot Do - General

(1) It is not generally possible to overdesign software in the same sense that hardware is often over-designed deliberately to ensure reliability; for examples parts subject to physical loads are made thicker and thereby stronger than strictly necessary.

(2) There are no provable, adaptable scheduling algorithms for use with distributed computing systems that support graceful degradation.

(3) There are no provable, adaptable placement algorithms for use with distributed computing systems that support graceful degradation.

(4) There are no generally applicable techniques to support graceful recovery in distributed systems.

3.2. Error Detection - Can Do

(1) Error detecting and correcting codes.

(2) Error correcting and detecting data structures.

(3) Executable assertions.

(4) Diagnosis of some hardware problems in software.

3.3. Error Detection - Cannot Do

(1) Distinguish reliably between hardware and software failures.

(2) Build "fail-stop" software in the sense of "fail-stop" machines.

(3) Implement fail-stop machines efficiently.

(4) Reliably detect errors resulting from design faults in both hardware and software.

(5) Classify design faults so as to permit a tailored response.

3.4. State Repair - Can Do

(1) Slow checkpointing.

(2) Logging so as to provide reliable transaction processing

(3) Replicate data reliably using commit protocols.

(4) Achieve interactive consistency in the face of Byzantine faults.

3.5. State Repair - Cannot Do

(1) Fast checkpointing.

(2) Recovery in multiprocessors following partial failure.

(3) Fault isolation in software.

(4) State repair for forward recovery.

3.6. Recommendat ions

(1) Demonstration (or evaluation) systems should be built in a laboratory environment to facilitate risk reduction. These would be a large-scale, instrumented systems built with promising but unproven ideas. They would be compared with low-risk systems built in parallel with conventional techniques.

52

(2) Operating systems need to be constructed so that they can easily be tailored to the specific fault-tolerance requirements of a given application. An application should not have to suffer loss of performance or reliability because of operating system support for unnecessary functionality.

(3) Language primitives and techniques need to be developed to support extensive reason- ableness and consistency checks in production software systems.

(4) More and better hardware building blocks need to be developed to allow for a diverse range of targets for various applications. The range of needs and the characteristics of the various applications need to be determined in order to allow the definition of these hardware parts.

(5) A cooperative project between application, system, and hardware designers is needed to design hardware facilities to support comprehensive software error detection. Questions that need to be resolved include: the exact facilities to be provided; the degree of parallel- ism that might be effected in the hardware during checking; how the application interface should signal the detection of an error; and bow subsequent recovery would be handled.

(6) A flexible, well-structured system is required to allow an application to be informed of erroneous states detected by either hardware or system software. Such a system must include an associated semantics of continuation. Many constraints must be kept in mind for such a system, for example the fact that real-time processing may be required.

(7) Expert systems should not be used for critical applications.

(8) Design diversity should not be relied upon to deal with design faults.

(9) Excess complexity should not be introduced in an effort to improve reliability since the complexity may itself reduce reliability.

(10) The technology of fault tolerance should not be "oversold" by its practitioners lest the community form the opinion that acceptable dependability can be obtained routinely from the technology.

4. R E P O R T OF HARDWARE GROUP Dhiraj K. Pradhan and Niraj Jha

The hardware fault tolerance issues that were discussed in the workshop can be broadly classified into the following categories: (i) (ii) (iii) (iv) (v) (vi) (vii) (viii)

Fault Modeling Design for testability Error detecting/correcting codes Chip-level fault tolerance System-level fault tolerance Yield improvement CAD tools for fault tolerance Interaction between industry and academia

4.1. Fault Modeling

In order to test a chip in a reasonable amount of time, we need to model faults at an appropriate level. Traditionally, the stuck-at fault model has been used. This model assumes that the lines in the circuit behave in a permanently stuck-at logic 0 or stuck-at logic 1 fashion. However, this model has been shown to be inadequate for the most dominant technology today

CMOS. Stuck-open and stuck-on fault models, which model permanently non-conducting

53

and permanently conducting transistors respectively, have been shown to occur in practice. Another type of fault which has largely been ignored in the past, but which occurs frequently in practice, is the bridging fault. In order to design fault tolerant systems with very high reliability, it is important to use highly reliable chips. Therefore, we should use a comprehensive fault model which includes stuck-at, stuck-open, stuck-on and bridging faults.

4.2. Design for Testability

With chips of more than a million transistor complexity already available, test generation is becoming an increasingly difficult problem. Scan designs were introduced in the last decade to reduce the problem of testing sequential circuits into the problem of testing combinational circuits. Though this considerably eases the burden on a testing engineer, the testing problem remains quite complex. Built-in self-test (BIST) techniques were introduced to avoid the need for test generation. It is encouraging to see both the BIST techniques for chip testing as well as the boundary scan technique for board-level testing becoming more popular. However, most of these techniques are only geared towards detecting stuck-at faults. New research is needed to extend these techniques to detect stuck-open faults. These faults require a sequence of vec- tors which are more difficult to generate and apply to a circuit.

Also important research needs to be carried out in design of memory for testability. Test related research issues include test set invalidation due to circuit delays and charge distribution. We need efficient techniques for designing testable CMOS circuits whose test sets are robust in the presence of the above problems.

The testing of dynamic CMOS circuits has been largely ignored even though they are much easier to test than static CMOS circuits. Dynamic CMOS circuits also have area and speed advantages over static CMOS circuits. Thus, we need to focus more on these circuits.

Design for testability (DFT) has become a necessity in this era of VLSI. For fault- tolerant systems, these techniques assume an even greater importance.

4.3. Error Detecting/Correcting Codes

In order to tolerate faults, we first need to detect them and, if possible, to correct the errors generated by them. Coding for error control remains an extremely important issue. Coding often allows for a low cost error detection and fault capture. However, error control coding problems for computers can be quite different than in communications. For example, in VLSI circuits unidirectional errors are quite common. Optimal codes are already known which can detect unidirectional errors in any number of bits. More efficient t-unidirectional error detecting (t-UED) codes can be used when we need protection against errors in at most t bits. The hardware overhead is smaller if t-UED codes are used and hence they are very attractive. Other codes which can correct a small number of errors and detect many more have also been proposed. While it is important to emphasize the efficiency of the codes, one should not over- look the complexity of encoding and decoding. An encoder/decoder which requires small hardware and time overhead is likely to be more popular.

4.4. Chip-level Fault Tolerance

In order to concurrently detect errors during normal operation one can use self-checking circuits. The chip area overhead of these circuits can be as low as 15-20%. With the increasing complexity of chips, the overhead is likely to go down even further. These circuits are based on error-detecting codes. They are much more attractive than circuits in which a module is duplicated and the outputs of the two modules are compared. Since transient faults are said

54

to form up to 90-95% of all faults, error detecting circuits ought to become a standard method for providing low-cost error detection in the future.

4.5. System-level fault Tolerance

The traditional approach to system-level fault tolerance has been to use massive redundancy in hardware or time. However, some new techniques have recently been developed which surprisingly reduce both the hardware and time overhead at the same time. The recently used (4,2) concept by Phillips in their switching system is an example of how new techniques, which are superior to the brute force approaches like TMR and NMR, can be employed. Another innovative idea in system design is the use of algorithm based fault-tolerance technique.

4.6. Yield Improvement

The need for using redundant modules for yield improvement is well-recognized. Much work has been done on developing schemes for reconfiguring a defective circuit into a defect- free one. Recent results indicate on-chip use of coding can enhance yield quite effectively. Such approaches need to be further explored.

4.7. CAD Tools for Fault Tolerance

One reason why self-checking circuits and other schemes which are based on codes have not found the popularity that they deserve may be that CAD tools to design these circuits are not available. Designers who are not aware of these results tend to avoid learning and incorporating them in their chips. One way around this would be to :start developing CAD tools which aid in the development of circuits with on-chip error detection and correction capabili- ties. Similar tools can be developed for incorporating fault tolerance at other levels too.

4.8. Interaction Between Industry and Academia

Concern was expressed at the lack of much interaction between industry and academia in fault-tolerant computing. More universities need to offer courses to expose future engineers to this area so that they are not reluctant to incorporate fault tolerant schemes into their designs later.

At another level, sharing of data between industry and academia can be mutually beneficial. For example, data about chip failure statistics can help researchers develop more realistic fault models. More interaction is also needed to make the university researchers aware of the problems, that are important for the industry to solve.

5. REPORT BY THE EVALUATION GROUP Kishor S. Trivedi

Throughout the workshop, people have been focusing on the following key problems of performance modeling: What are the measures of the effectiveness and how to compute them? What are the instruments to gather the performance/error data? What are the modeling languages? How do we integrate models, such as combinatorial, queuing, and Markovian models while assessing a complex system? How do we integrate measures, such as performance and dependability, to assess the system behavior more effectively? How do we integrate different types of faults, such as design faults as well as operational faults in one model?

Choosing the appropriate measures is highly application dependent. One measure may effectively express the characteristics of an application while it may altogether be inappropriate

55

for the other application. Commonly used measures are fault-free performance (response time and utilization), dependability (such as reliability, availability, safety), performance in the presence of fault (such as performability), cost and price of a commercial products, and area or size of a IC chip.

Generally speaking, the performance measures of interest in a fault tolerant system are the following: availability, reliability, response time, utilization, speed-up and scalabihty. The means values of these measures are commonly used to describe the behavior of the system, though variances and distribution functions of the measures are needed in certain situations.

There are several modeling techniques that are used to evaluate the fault-tolerant systems. Analytical models, such as queuing models and Markov models provide algebraic/numerical results which can predict the performance of those systems that can be adequately modeled by these methods. Unfortunately, these models require relatively strict and often unrealistic assumptions to make the mathematical analysis tractable. Monte-Carlo simulation is alternative to analytic approach. The advantage of the Monte-Carlo simulation is its simplicity. However, it is usually very expensive. Moreover, it does not enjoy the simple representation of results as in the analytic cases. If some parameters in the model change, another simulation has to be car- tied and generaUy the results from previous simulation make no contribution to the new results. Exception noted here is the work of Suri and Ho.

Among the analytic models, Markov models, queuing models, reliability graphs, and fault trees are the most popular techniques. Last three are more efficient but restrictive while Mar- kov models are more general the number of states create a problem. Stochastic Petri nets reduce the burden of large model construction. Largeness avoidance and fixed-point iteration seem to be the way of the future.

Calibration is another important issue that drew much attention in this workshop. It is not clear if there exist simple guidelines to determine the parameters of a model.

More research on handling large models, stiff models, and models with nonexponential distributions is needed.

6. LIST OF ATTENDEES

The following people attended the workshop, and contributed to the group discussions.

Ashok Agrawala, University of Maryland Paul Ammann, Software Productivity Consortium Jo Atlee, University of Maryland Richard Beigel, Johns Hopkins University J.P. Black, University of Waterloo Roy H. Campbell, University of Illinois Sharat Chandran, University of Maryland A.T. Dahbura, AT&T Bell Laboratories Robert Dancey, IBM Carl Elks, NASA Langley Research Center A.C. Fu, Princeton University Hector Garcia-Molina, Princeton University Amrit Goel, Syracuse University Maurice Herlihy, Carnegie-Mellon University Yennum Huang, University of Maryland Pankaj Jalote, University of Maryland

56

Niraj Jha, Princeton University K. Kant, Pennsylvania State University K. H. Kim, University of California at Irvine John Knight,Software Productivity Consortium Philip E. Kramer, Stratus Computer S. Y. Kung, Princeton University Clifford Lau, Office of Naval Research Ming-Yee Lai, X'IS System Architecture Richard J. LeBlanc, Georgia Institute of Technology Nancy Leveson, MIT De-Ron Liang, University of Maryland Sam Lomonaco, University of Maryland-Baltimore Gerald Masson, Johns Hopkins University Frank Mathur, Hacienda Heights, CA Michelle McElvany, Allied-Signal Aerospace Co. J.F. Meyer, University of Michigan Daniel Mosse, University of Maryland Kazuo Nakajima, University of Maryland John Palaimo, RADC/COEE Dan Palumbo, NASA Langely Research Center David V. Pitts, University of Lowell Dhiraj K. Pradhan, University of Massachusetts Paul J. Prisaznuk, AEEC Aeronautical Radio, INC. Jim Purtilo, University of Maryland Arthur S. Robinson, Systems Technology Dev. Corp. Kenneth Salem, University of Maryland John R. Samson, Honeywell INC. Richard D. Schlicting, University of Arizona Zary Segall, Carnegie-Mellon University Liuba Shrira, MIT Deepinder Sidhu, University of Maryland-Baltimore John A. Sjogren, NASA Langely Research Center Jim Smith, Office of Naval Research David Stotts, University of Maryland Gregory Sullivan, Johns Hopkins University David J. Taylor, University of Waterloo Philip Thambidurai, Allied-Signal Aerospace Satish K. Tripathi, University of Maryland Kishor Trivedi, Duke University William WeLl, MIT Tom Wilkes, University of Lowell Jeanette Wing, Carnegie-Mellon University Steve Young, NASA Langley Research Center

57

workshop on integrated approach for fault tolerance...

Documents