00888642

Upload: ankit-batra

Post on 09-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 00888642

    1/14

    Concurrent Exception Handling and Resolutionin Distributed Object Systems

    Jie Xu, Alexander Romanovsky, and Brian Randell

    AbstractWe address the problem of how to handle exceptions in distributed object systems. In a distributed computing environment,

    exceptions may be raised simultaneously in different processing nodes and thus need to be treated in a coordinated manner.

    Mishandling concurrent exceptions can lead to catastrophic consequences. We take two kinds of concurrency into account: 1) Several

    objects are designed collectively and invoked concurrently to achieve a global goal and 2) multiple objects (or object groups) that are

    designed independently compete for the same system resources. We propose a new distributed algorithm for resolving concurrent

    exceptions and show that the algorithm works correctly even in complex nested situations, and is an improvement over previous

    proposals in that it requires only O(nmaxN2) messages, thereby permitting quicker response to exceptions.

    Index TermsConcurrent exeception handling, distributed systems, exception resolution, nested atomic actions, object-oriented

    programming.

    1 INTRODUCTION

    CONCURRENT and distributed computing systems oftengive rise to very complex asynchronous and interactingactivities. The provision of error recovery is an extremelydifficult problem in such circumstances [24]. In order tocontrol this complexity and, hence, facilitate the provisionof error recovery, some way of restricting interaction andcommunication will be required. The concept of coordinatedatomic actions (or CA actions) that has been developed atNewcastle University [29], [30], [31] is one such mechanismfor the strict enclosure of interaction and recovery activities.A CA action coordinates error recovery between multiple

    interacting objects in an object-oriented (OO) or object-based system by integrating two complementary conceptsconversations [24] (together with a technique for concur-rent exception handling [5]) and transactions [19]. The workreported in this paper is essentially a continuation ofresearch on CA actions, concentrating on technical details ofconcurrent exception handling and resolution.

    Using the CA action framework as one kind of controlabstraction, we establish an exception model that clarifiesconcepts such as exception context, declarations, andpropagation in a distributed object system. We develop anabortion mechanism in order to tackle the problem ofhandling exceptional situations across nested levels of CAactions. In a distributed computing environment, anotherdelicate problem is that of concurrent exceptions [5]different processing nodes may raise different exceptionsand the exceptions may be raised simultaneously. This canfurther complicate the process of exception handling. Takea twin-engine aircraft as an example. If just the left (or right)

    engine fails, the controls can be adjusted appropriately to

    compensate for the loss of the left (right) engine. However,

    if both the right and left engine fail at the same time,

    handling both engine-loss exceptions separately and se-

    quentially can lead to catastrophic consequences. Another

    example is that of a distributed control system composed of

    several components. Two exceptions, Fire_Alarm and

    Gas_Leakage, may be raised concurrently by different

    components. It is obvious that handling such exceptions

    sequentially (in either order) is not adequate and a new

    method that can handle a combined occurrence of these two

    exceptions has to be developed.There are a variety of reasons why several exceptions

    may be raised concurrently. For example, in practice, it is

    often difficult to interrupt the normal operations of the

    other nodes immediately after an exception has been raised,

    so new exceptions may be raised in these other nodes before

    they are informed of the initial exception. In addition,

    concurrently raised exceptions may be merely a manifesta-

    tion in multiple nodes of a system-wide exception. In

    Section 3.2, we present a detailed discussion of the necessity

    of coping with concurrent exceptions and argue that a

    hierarchy-based approach is essential in order to find a

    higher-order exception that can cover all the exceptionsraised concurrently. This further requires a distributed

    scheme for determining the proper recovery strategy and

    for involving all the related nodes in the recovery activity.In 1986, Campbell and Randell [5] introduced the first

    algorithm (referred to as the CR algorithm) for concurrent

    exception resolution in a process-oriented system. We

    identify a number of issues involved in the CR algorithm

    and present a new distributed algorithm relying on strictly

    defined, practically oriented assumptions. This new, object-

    oriented algorithm is of lower message complexity than the

    original solution and it is formally proven and implemented

    in distributed Ada 95 [1].

    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 10, OCTOBER 2000 1019

    . J. Xu is with the Department of Computer Science, University of Durham,DH1 3LE, UK. E-mail: [email protected].

    . A. Romanovsky and B. Randell are with the Department of ComputerScience, University of Newcastle upon Tyne, NE1 7RU, UK.

    Manuscript received 4 Sept. 1996; revised 23 May 2000; accepted 28 Aug.2000.For information on obtaining reprints of this article, please send e-mail to:

    [email protected], and reference IEEECS Log Number 100287.1045-9219/00/$10.00 2000 IEEE

  • 8/8/2019 00888642

    2/14

    2 FAULTS, ERRORS, AND EXCEPTION HANDLING

    In this paper, we consider a distributed system consisting ofnodes connected by a communication network. The objectsthat run on network nodes communicate with each other bymessage passing. The CA action framework takes bothhardware and software faults into account [29]. Hardware

    faults include crashes of, or transient faults in, nodes or thecommunication network. Erroneous information mayspread through communication channels. However, asassumed by others in similar research on exceptions [8],[12], we assume that exception handling embedded in aCA action needs to be effective and responsible just forsome specific expected conditions, i.e., errors detected bytests and programs running on fault-free hardware,including those detected and signaled by hardware or bylower level services such as file services.

    2.1 Error Recovery and Exception Mechanisms

    Fault-tolerant software detects errors produced by faultsand employs error recovery techniques to restore normalcomputation. Forward error recovery (mostly exceptionhandling schemes) is based on the use of redundant datathat repairs the system by analyzing the detected error andputting the system into a correct state. In contrast, backwarderror recovery returns the system to a previous (presumedto be) error-free state without requiring detailed knowledgeof the errors.

    An exception (handling) mechanism is a (more language-oriented) control structure that allows programmers todescribe the replacement of the normal program execution by an exceptional execution when occurrence of anexception (i.e., inconsistency with the program specificationand, hence, an interruption to the normal flow of control) isdetected (see [8] for rigorous and thorough discussion).There have been some considerable efforts in developingformal semantics of exception handling (e.g., ML [23] andHoare's s1 otherwise s2 construct). The exceptionmechanism is usually considered an essential part of amodern language (see Ada 95 [1], C++, Eiffel [22], and Java,for example). Ideally, it should be coherent with thelanguage, the entire programming paradigm, and thedesign methodology.

    For any given exception mechanism, exception contexts[5], namely regions in which the same exceptions aretreated in the same way, have to be declared (very oftenthey are blocks or procedure bodies). Each context shouldhave a set of associated exception handlers, one of which iscalled when a corresponding exception is raised. There aredifferent exception models: The termination exception modelassumes that, when an exception is raised, the correspond-ing handler copes with the exception and completes the block execution; the resumption model assumes that thehandler recovers the program state and the program thencontinues the execution from the operation following thatwhich raised the exception. If the handler for the raisedexception does not exist in the context or it is not able torecover the program, then the exception will be propagated.Such exception propagation often goes through a chain of

    procedure calls or of nested blocks. The appropriate handler

    is sought in the exception context containing the contextwhich raised or propagated the exception.

    Exception handling and the provision of error recoveryare extremely difficult in concurrent and distributedsystems. Much previous work has focused on the con-currency aspect of such systems (see the subsequentdiscussion) without addressing the other important aspect

    distribution. Note that each node in a distributed systemmay possess a separate memory; as a consequence, softwaresegments executing on different nodes will reside in disjointaddress spaces and so must communicate by the exchangeof messages over relatively narrow bandwidth communica-tion channels [2]. The time of message passing is thereforenot negligible. Obviously, implementing coordinated re-covery in a loosely coupled distributed system is a moredifficult problem than designing similar schemes undereither a uniprocessor or a multiprocessor environment withcommon memory. Special protocols must be designed inorder to ensure timely response to exceptions and coordi-nated error recovery in spite of physical distribution.

    2.2 Conversations and Concurrent ExceptionHandling

    The concept of conversations was first proposed in [24] andintended to provide joint backward error recovery ofconcurrent processes that have been designed to cooperate by exchanging information. Each process participating insuch a conversation must save its state on entering it. Whileinside a conversation, a process can only communicate withother processes in the same conversation. If any processfails its acceptance test, then every participating processwill roll back to the saved state and retry, perhaps using analternate algorithm. Processes can enter a conversationasynchronously, but (at least notionally) must leave it at the

    same time once the acceptance test in each process has beensatisfied. This general idea has been extended in severaldifferent ways in later research (see [20] for comprehensivediscussion).

    A systematic approach to concurrent exception handlingwas developed in [5] by integrating an exception mechan-ism into the conversation scheme and extending the well-known atomic action paradigm [14]. A set of exceptions isassociated with each action (i.e., each conversation). Eachprocess participating in an action has a set of handlers forall or some of the predefined exceptions. When any of theseexceptions is raised in any process, appropriate handlers(for the same exception in all processes) will be initiated in

    all action participants. The notion of exception resolutionand the resolution mechanism proposed in [5] are criticalwhile considering multiple concurrent exceptions. Theexception tree structure introduced in [5] has advantagesover simple exception priorities for resolving these excep-tions. (An exception tree consists of all exceptions asso-ciated with the action and imposes a partial order on themso that a higher exception has a handler which is intendedto handle any lower level exception.) However, theidentification of exceptions and the establishment ofexception trees are not a simple task for an actualapplication. The developer of the application has to analyzetypical abnormal situations to define exceptions and the

    exception tree based on application-specific characteristics,

    1020 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 10, OCTOBER 2000

  • 8/8/2019 00888642

    3/14

    fault assumptions, and available resources. The workreported in [31] demonstrated how these issues wereaddressed in an industrial case study.

    The nesting of actions (or conversations) presents newproblems not normally encountered in nonnested circum-stances. For example, an exception may be raised in aprocess that has not yet entered a nested action, althoughsome participants of the same containing action where the

    exception is raised have entered this nested action. Theauthors of [5] suggested two methods for overcoming thisdifficulty. The first method is to wait until the nested actionis completeda natural decision because the execution ofthe nested action is invisible and indivisible for thecontaining action (see Fig. 1a, where P1, P2, and P3 areparticipating processes). An alternative method is to raisean abortion exception in all participants of the nested actionin order to abort the nested action completely, as shown inFig. 1b. (One can implement an abortion handler in eachparticipating process of a nested action or let the underlyingsupport mechanism handle any abortion exception.) Thesecond method seems to be more practical. First of all, it can

    be the case that a process detecting an error is expected toenter the nested action but will never be able to, so otherprocesses in the nested action would wait forever for it tocontinue execution. Second, for real-time systems, it wouldbe more predictable to abort the nested action than to waitfor its completion [5].

    There has been relatively little work on implementationsof coordinated error recovery in a distributed system.Implementations of distributed process-oriented conversa-tions are discussed in [13], [32]. Of these, [32] focused ontwo particular conversation schemes (i.e., the name-linkedrecovery block and the abstract data type) and addressedvarious implementation issues which are specific to the

    chosen conversation structures. The work in [13] discusseda distributed implementation of the conversation schemeusing broadcasts, assuming that all processes will entereach conversation simultaneously. Neither approach can beused directly to implement the CA action scheme becausethey focus on some particular schemes, with no support forforward error recovery and for OO systems. The Archelanguage introduced in [12] allows the application pro-grammer to implement a function that can resolve theexceptions propagated from several objects (i.e., differentimplementations) of the same type. The resolution functiontakes all exceptions that have been raised and not handledin those objects as input parameters and returns the only

    concerted exception that will be handled in the context of

    the calling object. Although the Arche approach is object-oriented, it cannot be used for CA actions since it supportsonly a limited kind of concurrency, which relies heavily onthe underlying multifunction call feature. As a result, it can be used for NVP-type schemes [3], but is not suitable forcooperative concurrency and recovery of several objectswith different types. Moreover, parameterized exceptions,though suitable for Arche, cannot be used directly for

    CA actions because the handler for an exception which wasnot raised may be still called once exception resolution isrequired.

    2.3 Exception Mechanisms in RealisticOO Languages

    In reality, exceptions in OO languages can be declaredeither as classes, objects, or strings [10], [26], whileexception handlers can be declared and attached to thelevel of statements, methods, classes, or objects. Somelanguages, such as C++, Modula-3, and Arche [12], onlyallow exception handlers to be attached to statements andothers, such as Lore [10], Eiffel [22], Guide [4], extendedC++ [21], and extended Ada [9], permit exception handlers

    to be attached to methods and objects or classes. Suchflexible attachment has numerous advantages:

    1. a clear separation of an object's abnormal behaviorfrom its normal one, in accordance with the conceptof an idealized fault-tolerant component [14];

    2. object/class recovery provided at the object level;3. exceptions associated with types; and4. software layering which facilitates the design of

    fault-tolerant systems.

    There is evidence from practice [9] as well: The use of objectexception handlers can decrease program complexity andfacilitate program design, maintenance, and reuse.

    Object/class exception propagation is another importanttopic. Lore, Eiffel, and Guide propagate exceptions throughthe call chain; the exception context is associated not onlywith the method execution, but also with the object/classitself. In extended Ada, exceptions cannot be passed out ofclass handlers, whereas extended C++ propagates the classexception along the object creation chain (which may ormay not coincide with the call chain).

    There are only a few concurrent OO languages, such asJava, Ada 95 [1], Arche, and Guide, known to us that haveexception handling features. Java has a C++-like exceptionmechanism, without specially coping with concurrency-related (or multithreaded) issues. Ada 95 allows handlers to

    be called in several concurrent tasks when an exception has

    XU ET AL.: CONCURRENT EXCEPTION HANDLING AND RESOLUTION IN DISTRIBUTED OBJECT SYSTEMS 1021

    Fig. 1. Two methods for treating nested actions while an exception is raised.

  • 8/8/2019 00888642

    4/14

    been raised in one of them. This language has a limitedform of concurrent-specific exception propagationanexception will be propagated to both calling and calledtasks if it is raised during the rendezvous. Arche permits

    user-defined resolution of concurrent exceptions among agroup of objects that belong to different implementations ofa given type, which, unfortunately, is not generallyapplicable to the coordination of multiple interacting objectswith different types. For our purposes, we need anexception model applicable to any group of interactingobjects whether or not of the same type.

    3 CA ACTIONS: CONCURRENCY, COORDINATION,AND EXCEPTION RESOLUTION

    The CA action scheme presents a general technique forachieving fault tolerance in concurrent and distributedOO software by integrating conversation-type structures,

    transactions, and concurrent exception handling into auniform conceptual framework [29]. This technique allowscomplex OO software to be designed in a disciplined andstructured way. CA actions take two kinds of concurrencyinto account: cooperating and competing [14]. Several objectscan be designed collectively by different programmers (orteams) and invoked concurrently in order to achieve certain joint and global goals. But, these objects must cooperatewithin the boundaries of a CA action. Competitiveconcurrency may also exist in such systems since two ormore separately designed objects can compete concurrentlyfor the same system resources (i.e., objects). More precisely,CA actions use conversations as a mechanism for control-

    ling concurrency and communication between objects thathave been designed to cooperate with each other (referredto as participating objects of the CA action). Shared externalobjects are controlled by the associated transaction mechan-ism that guarantees the ACID properties (atomicity,consistency, isolation, durability [19]). In particular, objectsthat are external to the CA action, and can hence be sharedwith other actions and objects concurrently, must be atomicand individually responsible for their own integrity.

    Fig. 2 shows an example in which two participatingobjects enter a CA action in order to play the correspondingroles. Within the CA action, two concurrent roles commu-nicate with each other and manipulate the external objects

    cooperatively in pursuit of some common goal. However,

    during the execution of the CA action, an exception e israised by one of the roles. The other role is then informed ofthe exception and both roles transfer control to theirrespective exception handlers, H1 and H2 for this particular

    exception, which attempt to perform forward error recov-ery. The effects of erroneous operations on external objectsare repaired by putting the objects into new correct states sothat the CA action is able to exit with an acceptableoutcome. Two participating objects leave the CA actionsynchronously at the end of the action. (As an alternative toperforming forward error recovery, the CA action couldundo the effects of operations on the external objects, rollback, and then try again, possibly using diversely designedsoftware alternates.)

    In principle, exception handling (or forward errorrecovery in general) can and should be integrated into theCA action framework. But, this cannot be achieved before

    we find appropriate solutions to three fundamentalproblems: exception model, exceptions in nested actions,and exception resolutionit is these problems that are thefocus of this paper.

    3.1 Exception Model for CA Actions

    To be generally applicable, our model assumes thatexceptions may be declared in any of the ways discussedin Section 2.3. The exceptions that can be raised within aCA action must be declared, together with the actiondeclaration, and exception handlers must be associatedwith participating objects of the CA action. Thus, when aparticipating object enters the action, it enters the corre-

    sponding exception context. A subset of these participatingobjects may further enter a nested CA action, which has allthe properties of a nested transaction in terms of atomicobjects. Note that an exception is raised within a CA action,but signaled from a nested action to its containing action.Since the nesting of CA actions causes the nesting ofexception contexts, it must thus be guaranteed that eachparticipating object of the nested action is associated withan appropriate set of handlers. In practice, such associationcould be done either statically or dynamically. Once theassociation is provided, clear semantics of exceptionpropagation can be enforced easily. Exceptions can bepropagated along nested exception contexts, corresponding

    to the chain of nested CA actions.

    1022 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 10, OCTOBER 2000

    Fig. 2. Forward error recovery in a CA action.

  • 8/8/2019 00888642

    5/14

    However, the issue of how handlers are associatedcorrectly with the exception context depends upon theparticular way objects enter a CA action and uponpeculiarities of the target OO system. If an object entersan action through an operation call and stays in it until thatoperation is completed, then operation-level exceptionswould be appropriate for the required association. Other-

    wise, only object/class level exceptions can be used if theexception context cannot be changed dynamically. Here, wesimply adhere to the termination model of exceptionsinany exceptional situations, handlers take over the duties ofparticipating objects in a CA action and complete the actioneither successfully or by signaling a failure exception to thecontaining action.

    Because a CA action may cope with two kinds ofconcurrency, external shared objects must be treatedexplicitly when forward error recovery is requested. Wedo not impose strict rules on the use of atomic objectsduring forward recovery, but require that these atomicobjects should be always left to be in a consistent state

    immediately after recovery. It is particularly important tonotice that an exception raised within the CA action doesnot necessarily cause restoration of all the atomic objects totheir prior states. The appropriate exception handlers maywell be able to lead them to new valid states (see Fig. 3a,where O1 and O2 are participating objects).

    If any of the external shared objects fails to reach acorrect state, a failure exception must be signaled to thecontaining CA action. A CA action can start a new attemptafter backward error recovery has been performed on theexternal objects, as illustrated in Fig. 3b. A new associatedtransaction will be then issued [29]. In principle, objectprogrammers have the freedom of choosing appropriate

    policies in order to guarantee the consistency of externalatomic objects. There would be a wide spectrum ofapplication-specific strategies: from simple correction ofthe erroneous states through handlers to the bottom lineof relying on undoing all previous modifications.

    3.2 Necessity of Concurrent Exception Resolution

    One of the most basic questions as to the importance of ourwork concerns whether it is necessary to deal withconcurrent exceptions; there may be a very low, negligibleprobability that multiple exceptions arise in a system at theexactly same time. However, after a careful examination ofthe problem, we argue the necessity of coping withconcurrent exceptions for the following reasons:

    . It is, in practice, difficult to interrupt the normaloperations of all participating objects immediatelyafter one of them has raised an exception. Theprobability that new exceptions are raised in otherobjects before they are informed of this exception ismuch higher in a distributed system than in a uni- ormultiprocessor system with common main memory.

    .

    Because no error detection tools are perfect, thelatent period of an error is not negligible and, so,erroneous information can be spread within the boundaries of a CA action; thus, several errorsoccurring concurrently in different objects can be thesymptoms of a different, more serious fault [5].

    . Due to the nesting of CA actions and the timeinterval, which may be lengthy, between the firstoccurrence of an exception in a nested action and theend of that nested action (which may lead to furtherexceptions), several successive exceptions raised indifferent containing and nested actions can be, ineffect, concurrent.

    . Different participating objects can be involved in

    nested CA actions at different levels so that theirexception contexts may be different.

    . In distributed systems, the overall hardware-relatedfailure probability is relatively higher and they aremore difficult to program without design faults thancentralized systems [7].

    . Finally, very often there is a correlation betweenerrors so that they happen in a very short period oftime in different participating objects. On one hand,due to hardware-related operational errors, severalnodes can be affected by the same bad conditions orby a channel through which traffic between severalnodes can be damaged. On the other hand, because

    participating objects of a CA action were designedcooperatively from a given specification, an error inthe specification or cooperative misunderstandingduring the design could affect several or all of theparticipating objects.

    When exceptions are raised concurrently, we cannot

    simply handle them sequentially in some order. A tradi-

    tional priority-based method is not adequate, either. Recall

    the aircraft example in the Introduction section. The

    exceptions raised concurrently in two engines must be

    handled cooperatively and at the same time. A hierarchy-

    based approach is therefore needed in order to find a single

    higher-order exception that can cover all the exceptions

    XU ET AL.: CONCURRENT EXCEPTION HANDLING AND RESOLUTION IN DISTRIBUTED OBJECT SYSTEMS 1023

    Fig. 3. Error recovery in external atomic objects. (a) Forward error recovery. (b) Backward error recovery.

  • 8/8/2019 00888642

    6/14

    raised concurrently. This further requires a distributedresolution scheme for determining the correct recoverystrategy and for involving all the participating objects in therecovery activity.

    3.3 Concurrent Exception Resolution: Issues andDifficulties

    There are three sources of exceptions defined in [5]. Thefirst source is of exceptions that are raised during theexecution of the application code; the second is ofexceptions signaled by the nested action; the third is ofexceptions that are raised because participants of the atomicaction received information about an exception raised insome other participant, but have no handler for it, so theyhave to examine the exception tree and find and raise a newappropriate exception (for which there is a handler). This ismainly because not all of the exceptions declared in theaction declaration will necessarily have associated handlersin each participant of the action (although each participantcould contain the use of the default handler). In [5], eachparticipant knows only a reduced local tree of exceptionswith specific handlers and has to look through it afterraising each exception and after each resolution. However,repeated search of the local tree could cause a kind ofdomino effect; in certain cases, the CA action will fail inspite of there being several handlers implemented inrespective participating objects. This could happen if theexception tree is organized as a directed chain: Consider anaction A which has the exception tree TA and twoparticipating objects O1 and O2, with two reduced trees(i.e., sets of exceptions with specific handlers) TO1 and TO2,respectively.

    e X I b P b Q b R b S b T

    b U b V

    yI X I b Q b S b U

    yP X P b R b T b V

    Note that if exception e8 is raised in O2 and O1 isinformed of it, then O1 has to find the appropriate exceptionto raise (in this example it is e7). When O2 is informed ofe7, it has to raise the further exception e6, in which case e5will be raised in O1, and so on. Therefore, in this example,any exception will always lead to further exceptions untilthe root of the exception tree is reached. To solve thisproblem, we will assume in our new mechanism that eachparticipating object has handlers for all exceptions declared

    in a given action. It is, we argue, a natural assumption sinceparticipating objects are implemented cooperatively and allof them should be involved in any activity of exceptionhandling.

    Another important issue is how to raise an abortionexception in nested actions. The CR algorithm relies heavilyon such abortion, but assumes that the related operationscan be provided by the underlying system. However, wefound that this is not a trivial problem. Consider fourconcurrent objects, O1, O2, O3, and O4, in several nestedatomic actions (see Fig. 4). If O1 detects an error and thusraises an exception, O2, O3, and O4 will be informedsubsequently of the exception. Since O1 may know nothing

    about nested actions A2 and A3, O2, O3, and O4 are

    responsible for actual abortion of these actions. Several

    problems (which were not adequately discussed in [5]) are

    as follows:

    1. A3 should be aborted before A2;2. O2, O3, and O4 are responsible for aborting A2;3. If O1 was supposed to enter A2 and A3 but failed to

    do so due to an error (O1 is thus a belated participantfor A2 and A3), O2, O3, and O4 could wait for it tocomplete the abortion of A2 and A3 (so abortionhandlers must be used that have been implementedin a very special way in order to avoid deadlocks);

    4. If O2 raises an exception as well, all A3 participants(maybe including O1 as a belated participant, see thereason above) must participate in error recovery; anylower level resolution performed by O2 should beignored when the resolution is started by O1 withinA1 (however, since belated participants can partici-pate in the resolution only when they enter thenested action, the entire protocol execution for

    resolution has to be delayed); and5. To abort nested actions, only abortion handlersshould be executed because the execution of otherhandlers would not guarantee correct abortion.

    Hence, all exceptions signaled by abortion handlers in a

    nested action have to be ignored unless the action is nested

    directly in the action where an exception was raised (e.g., all

    signaling from within the nested action A3, but not from A2,

    will be ignored for the resolution performed by Action A1).In fact, from the point of view of a single object that

    cooperates with other objects in a distributed system, our

    exception model can be described in the following way: If

    an exception is raised in the object, it is guaranteed that ahandler for the exception or for an exception with a higher

    order in the resolution tree will be executed. An object may

    be interrupted at any time and forced to perform an

    exception handler even if it did not encounter any problem

    of its own. This includes the situation in which the current

    action the object participates in is aborted and the object is

    thus forced to execute its corresponding abortion handler.

    We shall describe a new algorithm for exception resolution

    below, based on a set of carefully defined assumptions,

    which allow us to overcome all the above-mentioned

    problems.

    1024 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 10, OCTOBER 2000

    Fig. 4. An example of four objects in nested actions.

  • 8/8/2019 00888642

    7/14

    4 DISTRIBUTED ALGORITHM FOR EXCEPTIONRESOLUTION

    According to our model, objects may enter a CA actionasynchronously. A (centralized or decentralized) managerof CA actions has 1) to guarantee that all participatingobjects will wait for each other on the acceptance test line

    while using backward error recovery or 2) to invokeexception handlers for the same exception in these objectsin order to provide coordinated forward error recovery.

    4.1 Assumptions and Definitions

    It is assumed that, for a given CA action, each participatingobject knows all other participating objects of the sameaction and has the same exception tree (which is staticallydeclared). Each object also has a name list of the nestedactions it participates in. The currently innermost action forthe object is called the active CA action (for that object).Again, note that nested actions can end their executions bysignaling an exception to the containing CA action.

    In order for action Ai to abort a nested action, an abortion

    exception must be raised within the nested action and anyactivity of the nested action stopped (including any nestedresolution in progress and execution of any handlers). Eachobject in this nested action then starts the correspondingabortion handler. In general, when an object in its activeaction Ai+k needs to take part in the abortion of a chain ofthe nested actions Ai+1 (the outermost), Ai+2, ..., Ai+k (theinnermost), it must execute abortion handlers in the order(i + k), (i + k 1), ..., (i + 1), ignoring any exception whichmay be signaled to a containing action. During the processof abortion, only the exception signaled by abortionhandlers of directly nested action Ai+1 is allowed to beraised in the containing action Ai. This is simply because

    any handler for a specific exception cannot be called inthose actions which have to be aborted.

    An object transits from the normal state to the abnormalstate when 1) an exception is raised within the object or 2) itreceives the message concerning an exception in one ormore other objects. Again, it is important to notice that anobject in the exceptional state, of action Ai, may raise afurther exception which is signalled by abortion handlers ofthe nested action Ai+1. However, we assume that only onesuch exception can be raised within action Ai. The handlerfor the exception is intended to perform the simplelast-will recovery (see the discussion in [5]). We allowfor the possibility that the abortion handlers of the nested

    action Ai+1 signal different exceptions to the containingaction Ai (though, in accordance with [5], we believe that, inpractice, the same exceptions should be signaled from allparticipating objects of action Ai+1). If there exists any

    belated participating object of action Ai+1, the abortionhandlers of other participating objects will not have to waitfor it in order to carry out abortion promptly.

    Let CA-action be the outermost CA action. We defineGCA-action as the group of all participating objects {O1, O2, ...,Oi, ..., Oj, ...}, where each object Oi has a unique number andall objects are ordered (e.g., object names and the lexico-graphic ordering could be used). Such ordering helps todynamically identify a unique object among objects that

    raised exceptions and the chosen object will be responsible

    for performing actual exception resolution. Let A be theactive action of Oi and GA be the corresponding set ofparticipating objects. We assume that each object Oi has thefollowing data structures:

    list LEirecords exceptions that have been raised orsuspended states of objects that have halted normalcomputation;

    stack SAistores the exception context and the exceptiontree corresponding to each of the nested CA actions.

    In the interests of simplicity and brevity, we assume thatapplication-related message passing is treated indepen-dently. In our algorithm, only the following specificmessages are used:

    Exception(A, Oi, E) is sent by object Oi to all participatingobjects of actionA when an exception E is raised within Oi;

    Suspended(A, Oi, S) is sent by each object Oi that does notraise any exception, but has received Exception orSuspended messages from the others;

    Commit(A, E) is sent by a chosen object in action A to allparticipating objects after it completes resolution ofexceptions, where E is the resolved exception. Acorresponding handler for E will be called by each objectonce it receives this Commit message.

    4.2 The Algorithm

    Our algorithm is based on the general support provided bythe underlying system, including FIFO message sending/receiving between objects and calls to abortion handlers.During its execution, a participating object Oi may be in oneof the following states (denoted by S(Oi)): N = Normal, X =Exceptional (if an exception was raised in Oi), and S =

    Suspended (ifOi has to stop the normal computation due tothe exceptions raised in other objects). In addition, 3 stands for put in and A for sent to in the descriptionof our algorithm. (For a possible implementation, aprocess running this algorithm may be associated witheach pair (Oi, A) of objects and CA actions.)

    Algorithm 1:

    For any Oi, S(Oi): = N; and empty LEi, SAi;loop

    if Oi enters A then

    3 SAi; consume messages having arrived;end if;

    if Oi completes A thendelete the top element in SAi;

    S(Oi) : = N if A ends with success or S(Oi) : = X if A endswith failure;

    leave A (synchronously);end if;

    if Ei is raised in Oi then

    S(Oi) : = X; 3 LEi;Exception(A, Oi, Ei) A all Oj in GA;exception information A external objects (used by Oiwithin A);

    end if;

    XU ET AL.: CONCURRENT EXCEPTION HANDLING AND RESOLUTION IN DISTRIBUTED OBJECT SYSTEMS 1025

  • 8/8/2019 00888642

    8/14

    if Oi receives Exception(A*, Oj, Ej) or

    Suspended(A*, Oj, S) thenif A* contains or equals A then // is the top element

    in SAi// or 3 LEi;exception information A uninformed external objects(used by Oi within A*);

    if A* contains A thenabort all nested actions until A*;delete the elements in SAi until ;delete all elements except or

    in LEi;if Eab is raised by the abortion handler then

    S(Oi): = X; 3 LEi;Exception(A*, Oi, Eab) A all Oj in GA*;else S(Oi): = S; 3 LEi;

    Suspended(A*, Oi, S) A all Oj in GA*;end if;

    else if S(Oi) = N then

    S(Oi) : = S; 3 LE

    i;

    Suspended(A*, Oi, S) A all Oj in GA*;end if;

    end if;

    else retain the Exception or Suspended message till

    Oi enters A*;end if;

    end if;

    if Oi has all states, X or S, of other objects within A// is the top element in SAiand Oi has the biggest number among objects with thestate X then

    resolve exceptions in LEi; //find E in the exception treeCommit(A, E) A all objects in GA {Oi};empty LEi and handle E;

    end if;

    if Oi receives Commit(A*, E) thenif = the top element in SAi then empty LEi

    and handle E;end if;

    end if;

    end loop

    4.3 Two Examples

    Before we present the proof of the algorithm, let us considertwo examples which demonstrate how our algorithm

    works.

    Example 1. Assume that three objects, O1, O2, and O3,

    participate in the action A. If exceptions E1 and E2 are

    raised in O1 and O2 concurrently, then the three objects

    will undertake the following steps:O1: sends Exception to O2 and O3. Later on, O1 may

    receive Exception from O2 and Suspended from O3. Itthen waits for the Commit message. Once it receivesCommit(A, E), it will start handling the resolving

    exception E.

    O2: sends Exception to O1 and O3. Later on, it mayreceive Exception from O1 and Suspended from O3.O2 then resolves the exceptions E1 and E2 (becausename(O2) > name(O1)), finds the resolving exception E,sends Commit(A, E) to O1 and O3, and starts handling E.

    O3: receives Exception from O1 or O2 (no matterwho arrived first) and sends the Suspended message to

    O1 and O2 while suspending any normal computation.Once O3 receives Commit(A, E) from O2, it will starthandling E.

    Example 2. Assume that four objects, O1, O2, O3, and O4,participate in a set of nested CA actions (see Fig. 4). Iftwo errors are detected in both O1 and O2 and, hence,exceptions E1 and E2 are raised simultaneously, then thefour objects will undertake the following steps, respec-tively (this example demonstrates how a resolutionstarted in the nested action A3 is eliminated by theresolution performed by the containing action A1):

    O1: sends Exception to O2, O3, and O4. Later on, itreceives suspended from O3 and O4 and Exception(A1,

    O2, E3) from O2 (assuming that an exception E3 wassignaled by the abortion handler in O2 within Action A2).O1 then waits for Commit message (because name(O2) >name(O1)). Once O1 receives Commit(A1, E) from O2, itwill start handling E.

    O2: sends Exception to O3 (but O3 is a belatedparticipant for Action A3 in Fig. 4) and waits for somemessage(s) from O3. Because O3 is not yet in A3, thisException message cannot reach O3. When O2 receivesException from O1, it has to abort nested CA actionsA3 and A2. Since the abortion handler in A2 signals afurther exception E3 to A1, O2 will send Exception(A1,O2, E3) to O1, O3, and O4. During or after the abortion

    process it should receive Suspended from O3 and O4.O2 then resolves the exceptions E1 and E3 (becausename(O2) > name(O1)), finds the resolved exception E,sends Commit(A1, E) to O1, O3, and O4, and startshandling E.

    O3: receives Exception from O1 and has to abort A2.We assume no further exception is signaled by theabortion handler in O3. Thus, O3 sends Suspended toO1, O2, and O4. It should also receive Suspended fromO4 and Exception(A1, O2, E3) from O2. Once O3receives Commit(A1, E) from O2, it will start handling E.

    O4: takes the action similar to that of O3, but it is not a belated participant for Action A3. Once O4 receivesCommit(A1, E) from O2, it will start handling E.

    4.4 Correctness and Complexity

    In order to prove the correctness of our algorithm, werestate the following assumptions:

    Assumption 1. Dependable communication between objects is guaranteed, i.e., no message loss or corruption.

    Assumption 2. FIFO message passing between objects issupported by the target system, i.e., two messages from objectOi will arrive at object Oj in the same order as they were sent.

    For a specific distributed system, let Tmmax be themaximum time of message passing between two objects in

    the system, Treso be the upper bound of the time spent in

    1026 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 10, OCTOBER 2000

  • 8/8/2019 00888642

    9/14

    resolving current exceptions, Tabort be the maximumpossible time for an object to abort one nested CA action,nmax be the maximum number of nesting levels of CAactions (if no nesting, then nmax = 0), and max bemaximum possible time of handling an (resolved) exception.

    The correctness criteria for the proposed algorithm aretwo-fold: 1) There is no deadlock and 2) the resolvedexception does cover all the exceptions raised concurrentlywithin the outermost action. The no deadlock property will

    guarantee that the algorithm will produce a result even-tually and the second criterion will ensure that thedelivered result by the algorithm is actually the intendedresult. To further facilitate the understanding of our proofs,Fig. 5 shows a simple state transition diagram thatillustrates possible state changes of a participating object.

    We now show that no deadlock is possible in ourproposed algorithm.

    Lemma 1. Consider N objects that interact within nested CAactions. For any object Oi, if it reaches the state X (exceptional)or S (suspended), it will complete exception handlingultimately in at most T, where

    Pnmx Qmmx nmxort

    nmx Ireso mxX

    Proof. In order to prove the above bound, let us considerthe worst case, i.e., an object that raises an exception is inthe innermost CA action and each time the abortion of anested action occurs right at the end of exceptionhandling within that nested action.

    Without loss of generality, assume that an object Oi inthe innermost action raises an exception and changes itsstate into X. It will send the exception message to all theother objects, by Assumption 1, which will arrive at themin Tmmax. Since there are no further nested actions withinthe innermost action, any message from the other objectsabout an exception or suspended state would come to Oiin at most 2Tmmax. Note that actual exception resolutionmay take Treso. Therefore, Oi will receive a resolved

    exception and then complete exception handling in atmost (3Tmmax + Treso + max).

    IfOi has not yet left the innermost action, but a furtherexception occurs in its direct containing action, then theabortion of the innermost action will be required. Afterthe abortion, Oi will send either an abortion exception orsuspended message to other objects, which will arrive atthem in (Tabort + Tmmax). Oi will then receive the resolved

    exception (or resolve the exceptions by itself) in at most

    (Treso + Tmmax) and complete exception handling withinmax. The whole process costs at most (2Tmmax + Tabort +

    Treso + max).In the worst case, the above process could be repeated

    nmax times until the outermost CA action is reached.Totally, the repeated process will cost at most

    nmax(2Tmmax + Tabort + Treso + max). Adding the timespent in the innermost action, we thus have that

    Pnmx Qmmx nmxort

    nmx Ireso mxY

    namely, object Oi will complete exception handlingultimately and leave the outermost CA action. tu

    By Lemma 1, we know that any object will completeexception handling within a finite time bound. Therefore,deadlock during the process of exception handling will beimpossible while executing the proposed algorithm. How-ever, in order to prove the entire correctness of theproposed algorithm, we must show that any resolvedexception is a proper cover of the multiple concurrentexceptions that have been raised so far.

    Lemma 2. For a given CA action A, if no exception is raised inany containing action of A, then no more new exceptions willbe raised within A once the exception resolution starts.

    Proof. Assume that, to the contrary, a new exceptionmessage arrives at the resolving object after it hasstarted the resolution. Note that, from the proposedalgorithm, the resolving object must know all the states(X or S) of the participating objects in A before it can begin any actual resolution. Hence, by Assumption 2,the only possibility is that the newly arriving exceptionis caused by an abortion event, namely, A must beaborted by some containing action, contradicting the

    assumption that no exception is raised in any contain-ing action of A. tu

    Lemma 3. Consider N objects that interact within nestedCA actions. If multiple exceptions are raised concurrently,an ultimate resolved exception that covers all the exceptionswill be generated by the proposed algorithm.

    Proof. An exception that is raised in the containingCA action will abort any effect the nested action mayhave made or be making (even if a resolved exception forthe nested action has been identified and the correspond-ing exception handling has been in operation). Note that,however, the number of nesting levels is finite and

    bounded by nmax. Abortion will be no longer possible if

    XU ET AL.: CONCURRENT EXCEPTION HANDLING AND RESOLUTION IN DISTRIBUTED OBJECT SYSTEMS 1027

    Fig. 5. State changes of a participating object.

  • 8/8/2019 00888642

    10/14

    the current active action A is the outermost CA action. ByLemma 2, the exception resolution will start finally andno more new exceptions will be raised. tu

    From Lemmas 2 and 3, we know that a resolved

    exception will always cover all the currently existingexceptions. Any further exception will cause the abortionof any effect of previous resolutions and trigger the newexception resolution. Because deadlock is not possible, thefinal resolved exception will be raised in the end. Wetherefore have the conclusion below.

    Theorem 1. The proposed algorithm is deadlock-free and always performs correct exception resolution.

    Without the nesting of CA actions, it is obvious that themessage complexity of our algorithm is O(N2) messages,where N is the number of the objects participating in the

    outermost CA action. More precisely,

    1. When only one exception is raised and there are nonestedactions,thenthenumberofmessagesis( N+1)(N 1), i.e., (N 1)Exceptions, (N 1)2Suspendeds,and (N 1)Commit messages;

    2. When all N objects have the exceptions raisedsimultaneously, then the number of messages is still(N + 1) (N 1), i.e., N (N 1)Exceptions and(N 1)Commit messages.

    From the proposed algorithm, we can see that thenumber of messages is, in fact, independent of the numberof concurrent exceptions, which is a great improvementover our previous algorithm in [26]. Taking the nesting ofactions into account, we have the theorem below.

    Theorem 2. In the worst case, our proposed algorithm requiresexactly nmax (N2 1) messages.

    Note that the CR algorithm [5] is of complexity O(nmax N3). Our previous algorithm in [26] could use nmax 3N(N 1) messages. Our new algorithm is less complexbecause only one object (rather than all the objects) resolvesmultiple exceptions and only one object needs to send theCommit message. In the interest of fault tolerance, thealgorithm can be easily extended to the use of a group of

    objects that are responsible for performing resolution and

    producing the Commit messages. This only contributes aconstant factor to its total complexity.

    4.5 Implementation and Performance-RelatedAnalysis

    In order to implement the resolution algorithm and supportreliable message passing, a practical way could be to usegroup communication and a group membership service[16]. Participating objects in a CA action could be treated asmembers of a closed group which multicasts servicemessages to all members. Another way would be the useof reflection and meta-objects [21]. The algorithm can beprogrammed as a meta-protocol connecting a set of meta-objects: one for each CA action participant. Exceptions,handlers, and exception contexts should be first classobjects. Such implementation would allow the dynamicchange of different resolution algorithms (e.g., centralizedor decentralized) to be transparent to the application

    programmer. We can use Open C++ [6] as a testbed whichoffers ObjectCommunities as a group communicationfeature and simplifies transactions as a particularly practicalsystem for small experiments. Practical experiments ofusing this language to implement distributed replicatedobjects have been very successful [11].

    We have recently implemented a prototype of theresolution mechanism and the CA action supporting systemin Ada 95 [1] (with the standard features of the DistributedAnnex) in order to identify and tackle implementation andperformance-related issues. We have chosen Ada 95 (theGNAT Ada 95 compiler, public release 3.04, on SunOS 5.4)because it is one of few standard OO languages that have

    features for distributed programming. Besides, its elaboratefeatures for concurrent programming, such as protectedobjects, asynchronous transfer of control, and conditionalentry calls, greatly simplify the task of programming therun time support and ensuring the data consistency.

    Fig. 6 shows the system architecture for our prototypeimplementation. For each given CA action, its participatingobjects, called roles, are located on separate computingnodes. Communication and interaction between these rolesare supported by a message passing subsystem (MPS).Received messages are first kept in a cyclic buffer beforebeing consumed. A run-time system (caaRTS) that supportsthe execution of CA actions is established, together with

    MPS. This support system is decentralized in the sense that

    1028 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 10, OCTOBER 2000

    Fig. 6. Architecture for a prototype implementation.

  • 8/8/2019 00888642

    11/14

    every distributed node has a copy of caaRTS. This basic CA

    action support offers the main CA action features: (nested)action entry points and exits, raising and signaling ofexceptions, abortion of (nested) actions, and calls tohandlers. In addition, we have also implemented a basicprotocol for participating objects to leave a CA actionsynchronously.

    Note that an exception may interrupt the normalcomputation or cause the abortion of the nested actions.We use the Ada 95 asynchronous transfer of control (ATC)to interrupt the action execution; the exception context ofeach CA action consists of the ATC blocks of its participat-ing objects. The exception context in a participating objecthas an abortion handler and a set of action handlers. Every

    partition has a copy of the resolution function and of theresolution tree so as to ensure that the handlers for the sameexception are called in all participating objects. The typescommon to all participating objects are declared in packagePure, which is used in compiling all packages; it includesthe names of all exceptions, the lists of participants of allactions, the types declaring all object states, and all types ofmessages.

    This prototype shows that the protocol is easy toimplement: The entire implementation has about 1,000lines of code, 800 of which form the partition executiveand only 300 of those deal with exception handling andresolution. The protocol fits well with the structure of the

    modern distributed systems. It demonstrates how toextend the basic CA action executive by just addingnew functionalities to it. The implemented protocol isgeneral and can be easily moved to other distributedsystems, perhaps with minor adjustment as to perfor-mance enhancement.

    A simple application using nested CA actions was runon the prototype implementation. The application programwas executed in a 20 times loop and the execution timemeasured (in seconds). One of the experiment sets was setup as follows: One role of a containing action raises anexception and its nested actions have to be aborted. Afurther exception is raised by the abortion handler. A high-

    order exception (covering both exceptions) is then identified

    and raised in all the roles. We varied three parameters,

    Tmmax, Tabort, and Treso, in order to examine different effectsof these factors on the execution time (where Tmmax is the

    maximum time of message passing between two concurrent

    execution threads in the system, Tabort is the maximum

    possible time for a thread to abort a nested action, and Tresois the upper bound of the time spent in resolving multiple

    exceptions).The values for these three parameters in our experiments

    are chosen based on two main considerations: 1) their

    reality and 2) our ability to analyze the overheads

    introduced by exception resolution by varying the para-

    meters. The values chosen for Tmmax correspond well to

    several types of practical distributed systems. The intervalof values for Tabort is decided based on the fact that the

    abortion of atomic actions in practice requires some

    access to discs in order to restore the state of affected

    objects. The values for parameter Tabort are chosen for the

    situations in which exception trees have a fairly large size

    so that any search in them can take a considerable time.

    For example, let Tmmax = 0.2s, Tabort = 0.1s, and Treso =

    0.3s; the execution of the application will take about

    94.36s. Table 1 presents some typical data with varying

    values of Tmmax, Tabort, and Treso.

    XU ET AL.: CONCURRENT EXCEPTION HANDLING AND RESOLUTION IN DISTRIBUTED OBJECT SYSTEMS 1029

    TABLE 1Performance-Related Results

    Fig. 7. Effect on total execution time.

  • 8/8/2019 00888642

    12/14

    The experimental data obtained are essentially consistentwith the theoretical analysis presented previously. Fig. 7illustrates the effect on the total execution time of theapplication system. When Tmmax is limited within 1.0s, thecost of message passing has a minor impact on the totalexecution time. The execution time will increase dramati-

    cally once the time of message passing becomes longer thanone second. On the other hand, with an increase in Treso orTabort, the total execution time has a very gentle and linearchange. This demonstrates, at least by this given prototypeimplementation, that the cost of message exchanges is stillof the major concern, while concurrent exception handlingdoes not introduce a high run-time overhead.

    Another set of our experiments is to compareAlgorithm 1 with the original CR algorithm developedin [5]. The CR algorithm was implemented by modifyingour algorithm appropriately, and the same applicationprogram was used to collect the related data. The totalexecution time was then measured with respect todifferent T

    resoand T

    mmax. Fig. 8 demonstrates the major

    change of the total execution time when varying Tmmaxwith Treso = 0.3s and when varying Treso with Tmmax =1.0s. A big difference in the execution time can beobserved even with a fixed N (N = 3 in our case). Inparticular, the difference becomes more obvious with anincrease of the time of message passing (see Fig. 8a). Theprocedure for exception resolution is called N (N 1) (N 2).

    5 CONCLUSIONS

    The concept of CA actions offers a general and convenientframework for designing distributed and concurrent

    OO software. This paper has focused on important technicaldetails of concurrent exception handling and resolutionunder the CA action framework (though the results aregenerally applicable to other atomic action schemes). Oursolutions are intended for a wide set of OO languages andfor practical systems that interact with their environments;such systems typically are incapable of simple backwardrecovery. The OO exception model developed in this paperextends and improves the models which may be found insequential OO languages and the nonconcurrent models forsome concurrent OO languages.

    How to correctly cope with nested CA actions inexceptional situations is a significant and delicate problem,

    especially in a distributed computing environment. In [5],

    the authors presented just a draft of their resolutionalgorithm, without discussing assumptions under whichthe algorithm may work. The semantics of the operation ofraising abortion exceptions in nested actions and of theresuming/suspending mechanism in such nested actionswere not addressed clearly. We have developed an abortion

    mechanism that coordinates recovery measures used in both participating objects of nested actions and externalatomic objects. A new distributed algorithm for exceptionresolution has been designed to handle concurrent raisingof multiple exceptions in interacting objects.

    We have applied our approach to realistic industrial

    applications by developing robust software to control

    different types of production cells. A production cell model,

    based on a metal-processing plant in Karlsruhe, Germany,

    was first created by the FZI (Forschungszentrum Informa-

    tik) in 1993 [15] in order to evaluate different formal

    methods and to explore their practicability for industrial

    applications. Since then, this case study has attracted wide

    attention and has been investigated by over 35 different

    research groups. In 1996, the FZI presented the specification

    of two extended models, called the Fault-Tolerant Produc-

    tion Cell [17] and Real-Time Production Cell [18]. We have

    recently designed and implemented two control software

    systems for these case studies [27], [31]. Analytical and

    experimental results have demonstrated the feasibility of

    our approach to concurrent exception handling and showed

    that the approach makes it possible to reason about

    complex exceptional behaviour of a system in a very

    disciplined and structured way.Future research directions would be in two primary

    areas in terms of the further development of the CA actionconcept. The first is the introduction of an appropriatelinguistic mechanism for specifying (nested) CA actions in adistributed environment. Second, it is important to inves-tigate the implementation-related issues with CA action'srun-time support mechanisms.

    ACKNOWLEDGMENTS

    This work was supported by the ESPRIT Long TermResearch Project 20072 on Design for Validation (DeVa)and IST Programme RTD Project 1999-11585 on DependableSystems of Systems (DSoS) and by the EPSRC Flexx Project

    on Flexible Software.

    1030 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 10, OCTOBER 2000

    Fig. 8. Performance-related comparison of Algorithm 1 and the CR algorithm.

  • 8/8/2019 00888642

    13/14

    REFERENCES[1] Information TechnologyProgramming LangauagesAda. Language

    and Standard Libraries, ISO/IEC 8652:1995(E), Intermetrics, Inc.,1995.

    [2] C. Atkinson, Object-Oriented Reuse, Concurrency and Distribution.Addison-Wesley, 1991.

    [3] A. Avizienis, The N-Version Approach to Fault-Tolerant Soft-ware, IEEE Trans. Software Eng., vol. 11, no. 12, pp. 1,491-1,501,

    Dec. 1985.[4] R. Balter, S. Lacourte, and M. Riveill, The Guide Language,

    Computer J., vol. 37, no. 6, pp. 521-530, 1994.[5] R.H. Campbell and B. Randell, Error Recovery in Asynchronous

    Systems, IEEE Trans. Software Eng., vol. 12, no. 8, pp. 811-826,Aug. 1986.

    [6] S. Chiba and T. Masuda, Designing an Extensible DistributedLanguage with a Meta-Level Architecture, Proc. European Conf.Object-Oriented Programming '93, pp. 482-501, 1993.

    [7] F. Cristian, Understanding Fault-Tolerant Software, Comm.ACM, vol. 34, no. 2, pp. 56-78, 1991.

    [8] F. Cristian, Exception Handling and Tolerance of SoftwareFaults, Software Fault Tolerance, M. Lyu, pp. 81-107, Wiley, 1995.

    [9] Q. Cui and J. Gannon, Data-Oriented Exception Handling, IEEETrans. Software Eng., vol. 18, no. 5, pp. 393-401, May 1992.

    [10] C. Dony, Exception Handling and Object-Oriented Program-ming: Towards a Synthesis, Proc. European Conf. Object-OrientedProgramming/Object-Oriented Programming, Systems, Languages, and

    Applications '90, pp. 322-330, 1990.[11] J. Fabre, V. Nicomette, T. Perennou, R.J. Stroud, and Z. Wu,

    Implementing Fault Tolerant Application Using ReflectiveObject-Oriented Programming, Proc. 25th Int'l Symp. Fault-Tolerant Computing, pp. 489-598, June 1995.

    [12] V. Issarny, An Exception Handling Mechanism for ParallelObject-Oriented Programming: Towards Reusable, Robust Dis-tributed Software, J. Object-Oriented Programming, vol. 6, no. 6,pp. 29-40, 1993.

    [13] P. Jalote, Using Broadcast for Multiprocess Recovery, Proc. SixthDistributed Computing Systems Symp., pp. 582-589, 1986.

    [14] P.A. Lee and T. Anderson, Fault Tolerance: Principles and Practice,second ed. Wien: Springer-Verlag, 1990.

    [15] C. Lewerentz and T. Lindner, Formal Development of ReactiveSystems: Case Study Production Cell. Springer-Verlag, 1995.

    [16] L. Liang, S.T. Chanson, and G.W. Neufeld, Process Groups andGroup Communications: Classifications and Requirements,Computer, vol. 23, no. 2, pp. 56-66, Feb. 1990.

    [17] A. Lo tzbeyer, Task Description of a Fault-Tolerant ProductionCell, version 1.6, available from http://www.fzi.de/prost/projects/korsys/korsys.html, 1996.

    [18] A. Lo tzbeyer and R. Mu hlfeld, Task Description of a FlexibleProduction Cell with Real Time Properties, ForschungszentrumInformatik, Karlsruhe, Germany (ftp.fzi.de/pub/prost/projects/korsys/task_descr_flex_2_2.ps.gz), 1996.

    [19] N.A. Lynch, M. Merrit, W.E. Wehil, and A. Fekete, AtomicTransactions. Morgan Kaufmann, 1993.

    [20] Software Fault Tolerance, M.R. Lyu, ed. Wiley, 1995.[21] P. Maes, Concepts and Experiments in Computational Reflec-

    tion, Proc. conf. Object-Oriented Programming, Systems, Languages,and Applications 87, ACM SIGPLAN Notices, vol. 22, no. 12,pp. 147-155, 1987.

    [22] B. Meyer, Eiffel: The Language. New York: Prentice Hall, 1992.[23] L.C. Paulson, ML for the Working Programmers. Cambridge Univ.

    Press, 1996.[24] B. Randell, System Structure for Software Fault Tolerance, IEEE

    Trans. Software Eng., vol. 1, no. 2, pp. 220-232, 1975.[25] A. Romanovsky, I. Shturtz, and V. Vassilyev, Designing Fault-

    Tolerant Objects in Object-Oriented Programming, Proc. SeventhInt'l Conf. Technology of Object-Oriented Languages and Systems(TOOLS EUROPE 92), pp. 199-205, 1992.

    [26] A. Romanovsky, J. Xu, and B. Randell, Exception Handling andResolution in Distributed Object-Oriented Systems, Proc. 16thIEEE Int'l Conf. Distributed Computing Systems, pp. 545-552, May1996.

    [27] A. Romanovsky, J. Xu, and B. Randell, Coordinated ExceptionHandling in Real-Time Distributed Object Systems, Int'l J.Computer Systems Science and Eng., special issue on object-oriented

    real-time distributed systems, vol. 14, no. 4, pp. 197-208, 1999.

    [28] C.M.F. Rubira, Structuring Fault-Tolerant Object-Oriented Sys-tems Using Inheritance and Delegation, PhD thesis, Univ. ofNewcastle upon Tyne, 1994.

    [29] J. Xu, B. Randell, A. Romanovsky, C. Rubira, R.J. Stroud, and Z.Wu, Fault Tolerance in Concurrent Object-Oriented Softwarethrough Coordinated Error Recovery, Proc. 25th Int'l Symp. Fault-Tolerant Computing, pp. 499-508, June 1995.

    [30] J. Xu, B. Randell, C.M.F. Rubira-Calsavara, and R.J. Stroud,Toward an Object-Oriented Approach to Software Fault Toler-

    ance, Recent Advances in Fault-Tolerant Parallel and DistributedSystems, D.K. Pradhan and D.R. Avresky, eds., IEEE CS Press,pp. 226-233, Sept. 1995.

    [31] J. Xu, B. Randell, A. Romanovsky, R.J. Stroud, A.F. Zorzo, E.Canver, and F. von Henke, Rigorous Development of a Safety-Critical System Based on Coordinated Atomic Actions, Proc. 29thInt'l Symp. Fault-Tolerant Computing, pp. 68-75, June 1999.

    [32] S.M. Yang and K.H. Kim, Implementation of the ConversationScheme in Message-Based Distributed Computer Systems, IEEETrans. Parallel and Distributed Systems, vol. 3, no. 5, pp. 555-572,Sept. 1992.

    Jie Xu received the PhD degree from theUniversity of Newcastle upon Tyne in advancedfault-tolerant software. He is a lecturer incomputer science at the University of Durham,United Kingdom. From 1990 to 1998, Dr. Xu waswith the Computing Laboratory at Newcastle,where he was promoted to senior researcher in1995. He moved to a lectureship at Durham in1998 and cofounded the Distributed SystemsEngineering Group and the DPART Laboratory

    supporting highly dependable distributed computing. Dr. Xu haspublished more than 100 book chapters and research papers in areasof system-level fault diagnosis, exception handling, software faulttolerance, and large-scale distributed applications. He has beeninvolved in several research projects on dependable distributedcomputing systems, including two EC-sponsored ESPRIT BRA projectsand one ESPRIT Long Term Research project. He is principalinvestigator of the FTNMS project on fault-tolerant mechanisms formultiprocessors and co-Investigator of the EPSRC Flexx project onhighly dependable and flexible software. Dr. Xu is an editor of IEEEDistributed Systems OnLine, chair of the IEEE SRDS Workshop onReliable Distributed Object Systems, and tutorial speaker for the IEEE/IFIP International Conference on Dependable Systems and Networks.He has given invited lectures in international colloquiums and served assession chair and PC member of various international conferences andworkshops.

    XU ET AL.: CONCURRENT EXCEPTION HANDLING AND RESOLUTION IN DISTRIBUTED OBJECT SYSTEMS 1031

  • 8/8/2019 00888642

    14/14

    Alexander Romanovsky received an MScdegree in applied mathematics from MoscowState University in 1976 and a PhD degree incomputer science from St. Petersburg StateTechnical University in 1988. He was with thisuniversity from 1984-1996, doing research andteaching. In 1991, he worked as a visitingresearcher at ABB Ltd Computer ArchitectureLab Research Center, Switzerland. In 1993, he

    was a visiting researcher at the Istituto diElaborazione della Informazione, CNR, Pisa, Italy. In 1993-1994, hewas a postdoctoral fellow in the Department of Computing Science, theUniversity of Newcastle upon Tyne (United Kingdom). In 1992-1998, hewas involved in the Predictably Dependable Computing SystemsESPRIT Basic Research Action and the Design for Validation ESPRITBasic Project. In 1998-2000, he worked on the Diversity in Safety CriticalSoftware (DISCS) EPSRC/UK Project. He is now a senior researchassociate with the Department of Computing Science, University ofNewcastle upon Tyne, working on the Dependable Systems of SystemsEC IST RTD Project. Dr. Romanovsky coedited the IEEE Transactionson Software Engineering special issue on current trends in exceptionhandling and organized ECOOP 2000 Workshop on Exception Handlingin Object-Oriented Systems; he is now a program committee member ofthe Fourth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing. His research interests include softwarefault tolerance, software diversity, concurrent programming, concurrent

    object-oriented and object-based languages, real time systems, excep-tion handling, operating systems, software engineering, and distributedsystems. He has coauthored more than 120 scientific papers, bookchapters, reports, and a patent in these and related areas.

    Brian Randell graduated in mathematics fromImperial College, London, in 1957 and joined theEnglish Electric Company, where he led a teamwhich implemented a number of compilers,including the Whetstone KDF9 Algol compiler.From 1964 to 1969, he was with IBM, mainly atthe IBM T.J. Watson Research Center in theUnited States, working on operating systems, thedesign of ultra-high speed computers, and

    system design methodology. He then became aprofessor of computing science at the University of Newcastle uponTyne, where, in 1971, he set up the project which initiated research intothe possibility of software fault tolerance, and introduced the recoveryblock concept. Subsequent major developments included the New-castle Connection and the prototype Distributed Secure System. He hasbeen Principal Investigator on a succession of research projects inreliability and security funded by the Science Research Council (nowEngineering and Physical Sciences Research Council), the Ministry ofDefence, and the European Strategic Programme of Research inInformation Technology (ESPRIT). Most recently, he has had the role ofProject Director of DeVa, the ESPRIT Long Term Research Project onDesign for Validation, and of CaberNet, the ESPRIT Network ofExcellence on Distributed Computing Systems Architectures, and isnow leading an IST Research Project on Malicious- and Accidental-FaultTolerance for Internet Applications (MAFTIA) and an IST RTD Projecton Dependable Systems of Systems (DSOS). He has published nearly

    200 technical papers and reports and is the coauthor or editor of sevenbooks.

    1032 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 10, OCTOBER 2000