reliability prediction for fault-tolerant software architectures

Reliability Prediction for Fault-TolerantSoftware Architectures

Franz Brosch1, Barbora Buhnova2, Heiko Koziolek3, Ralf Reussner4

1Research Center for Information Technology (FZI), Karlsruhe, Germany2Masaryk University, Brno, Czech Republic

3Industrial Software Systems, ABB Corporate Research, Ladenburg, Germany4Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

[email protected], [email protected], [email protected], [email protected]

ABSTRACTSoftware fault tolerance mechanisms aim at improving thereliability of software systems. Their effectiveness (i.e., re-liability impact) is highly application-specific and dependson the overall system architecture and usage profile. Whenexamining multiple architecture configurations, such as insoftware product lines, it is a complex and error-prone taskto include fault tolerance mechanisms effectively. Existingapproaches for reliability analysis of software architectureseither do not support modelling fault tolerance mechanismsor are not designed for an efficient evaluation of multiple ar-chitecture variants. We present a novel approach to analysethe effect of software fault tolerance mechanisms in varyingarchitecture configurations. We have validated the approachin multiple case studies, including a large-scale industrialsystem, demonstrating its ability to support architecture de-sign, and its robustness against imprecise input data.

Categories and Subject DescriptorsD.2.11 [Software]: SOFTWARE ENGINEERING—Soft-ware Architectures; D.2.4.g [Software]: SOFTWARE EN-GINEERING—Software/Program Verification—Reliability

General TermsSoftware Engineering, Reliability, Design

KeywordsComponent-Based Software Architectures, Reliability Pre-diction, Fault Tolerance, Software Product Lines

1. INTRODUCTIONSoftware fault tolerance (FT) mechanisms mask faults in

software systems and prohibit them to result in a failure. FTmechanisms are established on different abstraction levels,such as exception handling on the source code level, watch-dog and heart-beat as design patterns, and replication on the

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.QoSA ’11 Boulder, Colorado, USACopyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.

architecture level [20, 18]. FT mechanisms are commonlyused to improve the reliability of software systems (i.e., theprobability of failure-free operation in a given time span).The effect of a FT mechanism (i.e., the extent of such animprovement) is however non-trivial to quantify because ithighly depends on the application context.

The challenge of assessing FT mechanisms in differentcontexts becomes particularly apparent on the architec-ture level, when evaluating different architectural configu-rations, and even more in the design of software productlines (SPL) [7]. An SPL has a core assembly of softwarecomponents and variation points where application-specificsoftware can be filled in. Thus, it can result in many differentproducts all sharing the same core with different configura-tion at the variation points.

Existing approaches for reliability prediction [11, 13, 15]either do not support modelling FT mechanisms (e.g., [5,8, 12, 21]) or do not allow for explicit definition and reuseof core modelling artefacts, and hence are difficult to applyin varying contexts, as with architecture design and SPLs(e.g., [24, 26]).

The contribution of this paper is an approach to analysethe effect of a software fault tolerance mechanism in depen-dence of the overall system architecture and usage profile.The approach is novel as it (i) takes software fault tolerancemechanisms explicitly into account, and (ii) reuses modelparts for effective evaluation of architectural alternatives orsystem configurations. The approach is ideally suited forsoftware product lines, which are used to formulate and illus-trate the approach. It builds upon the Palladio ComponentModel and an associated reliability prediction approach [4],which includes both software reliability and hardware avail-ability as influencing factors. Our tool support allows thearchitects to design the architecture with UML-like mod-els, which are automatically transformed to Markov-basedprediction models, and evaluated to determine the expectedsystem reliability.

The remainder of this paper is structured as follows. Sec-tion 2 outlines our approach and explains the steps involved.Section 3 details the models used in our approach and thenSection 4 explains how these models are formalised and anal-ysed to predict the system reliability. Section 5 evaluates theapproach on two case studies. Section 6 delimits our workfrom related approaches. Finally, Section 7 draws conclu-sions and sketches future directions.

2. PREDICTION PROCESSThis section outlines our reliability prediction approach

for fault-tolerant software architectures, software families,and software product lines (SPL) in particular. Accord-ing to Clements et al. [7], an SPL is defined as “a set ofsoftware-intensive systems that share a common, managedset of features satisfying the specific needs of a particularmarket segment or mission and that are developed from acommon set of core assets in a prescribed way”. Our mod-elling approach provides support only for the technical partof an SPL or software family as we assume that domainengineering and asset scoping have been performed beforemodelling the architecture and reusable components.

Our approach iteratively follows eight steps depicted inFig. 1. First, the software architect creates a CoreAsset-

Base, which models interfaces, reusable software compo-nents, and their (abstract) behaviour (step 1). The Core-

AssetBase is enriched with software failure probabilities ofactions forming component (service) behaviour (step 2). Af-terwards, the software architect can include different FTmechanisms, such as recovery blocks (explained in Sec-tion 3.2) or redundancy (step 3), either as additional com-ponents or directly into already modelled component be-haviours. The FT mechanisms allow for different configu-rations, e.g., the number of retries or replicated instances.

1. Model components,

interfaces, and behaviour

Core Asset Base(Section 4.1)

2. Model failure

probabilities in behaviours

3. Model fault tolerance,

adjust configurations

5. Model products, allocation,

usage

Resource environments(Section 4.2)

Products(Section 4.3)

6. Model transformation

7. Markov chain analysis

DTMCs(Section 5)

8. Product instantiation

Results Not OK

Results OK

4. Model resource

environment, HW availability

Figure 1: Process activities and artifacts

The software architect then creates a Resource-

Environment to model hardware resources (step 4) and spe-cific Products (step 5), including component allocation andsystem usage information. Combined with the CoreAsset-

Base, these models are transformed into multiple discrete-time Markov chains (step 6), from which system reliabilitypredictions and sensitivity analyses can be deduced (step7). If the prediction results do not satisfy the reliability re-quirements, the FT mechanisms can be reconfigured and/orthe resource environment and products adjusted iteratively.Otherwise, the modelled products are deemed sufficient forthe requirements, and the product can safely be instantiatedfrom the core asset base (step 8). The following two sectionsdescribe the models and the predictions in detail.

3. RELIABILITY MODELLINGThis section introduces our meta model for describing the

reliability characteristics of software product lines, whichare used to formulate our approach. The model and itsreliability solver are implemented using the Eclipse ModelingFramework (EMF) and build upon the Palladio ComponentModel (PCM) [2]. An excerpt of our meta model is depictedin Fig. 2; for a full documentation, refer to our website 1.

1http://sdqweb.ipd.kit.edu/wiki/ReliabilityPrediction

The model allows to express variation points on differentlevels:• architectural level: using composite components to en-

capsulate subsystems and enable their replacement• component level: selecting different component imple-

mentations for a given specification• component implementation level: using component pa-

rameters with values for specific product configura-tions• resource level: using allocation references expressing

different deployment schemes for product line configu-rations

Our model is better suited for our purposes than UML ex-tended with the MARTE-DAM profile [3] because it allowsmodel reuse through the core asset base and is reduced toconcepts needed for the prediction. The following sectionsexplain the core asset base (Section 3.1), the fault-tolerancemechanisms (Section 3.2), the resource environment (Sec-tion 3.3) and the product (Section 3.4) in detail.

3.1 Core Asset BaseThe modelling element CoreAssetBase of our meta model

(cf. Fig. 2) represents a repository for elements assem-bled into products and contains ComponentTypes, Inter-

faces, and FailureTypes. ComponentTypes can either beatomic PrimitiveComponents or hierarchically structuredCompositeComponents with nested inner components.

Composite components allow the core asset base to con-tain whole architecture fragments (e.g., SPL core assem-blies) that can be reused in different products. Such a coreassembly can have optionally required interfaces as varia-tion points. Component types are associated with interfacesthrough ProvidedRoles or RequiredRoles, and can exportComponentParameters that allow for implementation-levelvariation points. Fig. 3 shows an excerpt of an instance ofa core asset base including a composite component (4 ) anda component parameter (value).

For reliability analyses, the model requires constructsto express the behaviour of component services in termsof using hardware resources and calling other compo-nents. Therefore, a component type can contain a num-ber of ServiceBehaviours that specify the actions executedupon calling a specific Signature of one of the compo-nent’s provided interfaces. The behaviour may consist ofInternalActions, ExternalCallActions, and control flowconstructs, such as probabilistic branches and loops.

An internal action represents a component-internal com-putation and can contain multiple FailureOccurrences andResourceDemands. FailureOccurrences model softwarefailures during the execution of a service using a probability.These failure probabilities can be determined with differenttechniques [11, 5, 17], such as reliability growth modelling,defect prediction based on code metrics, statistical testing,or fault injection.

An external call action represents a call to other com-ponents and thus references a Signature contained in oneof the required interfaces of the current component. Fig. 4shows a small example of a service behaviour containing in-ternal actions and external call actions. Notice that thetransitions of the BranchAction are labelled with input pa-rameter dependencies (e.g., P (X ≤ 1)), because the concretevalues for the input parameters (e.g., X) are not known tothe component developer providing the specification. The

Core

AssetBase

Component

Type

InterfaceFailureType

Product

Component

Instance

Assembly

Connector

Resource

Environment

Resource

Container

Service

Behaviour

Abstract

Action

ExternalCall

Action

InternalAction

Recovery

BlockAction

Failure

Occurence+probability

BranchAction

LoopAction

RecBlock

Behaviour

FailureHandling

Entity

Primitive

Component

Composite

Component

UserInterface

Product

Usage

ProcRe-

sourceType

1..*

Signature

0..*

0..*

+pre

0..1

0..1

+succ

0..*0..*

0..*

1..*

+nextBehaviour

2..*

+occuringFailureType

0..*

+occuringFailureType

0..*

+usedResourceType

+calledService

0..*

0..*

0..*

0..*

+instantiatedComponentType

+instantiated

ResourceType

Processing

Resource+mttf

+mttr

Link

Resource+failProb

+to

+from0..*

0..*

0..*

UserCall

+probability

+allocatedResourceContainer

[…]

[…]

[…]

Resource

Demand+demand

[…]Component

Parameter

0..*

[…]Component

Configuration

0..*

+instantiatedParameter

ProvidedRole

RequiredRole

0..*

0..*

[…]

Figure 2: Excerpt of our meta model for specifying reliability characteristics in a software product line

component developer should keep the component as reusableas possible, i.e. make no assumptions about the usage pro-file. The developer specifies parameter dependencies as aninfluencing factor on the component reliability. The depen-dencies are automatically resolved during system analysis,when the usage profile is known. Hence, the influence ofthe given system-level usage profile on the system reliabilityis explicitly considered by our approach by propagating thesystem-level input parameters through the component-basedarchitecture [4].

<<Primitive

Component>>

1

<<FailureTypeA>>

<<Primitive

Component>>

3

<<Composite

Component>>

4

<<Primitive

Component>>

2

<<FailureTypeD>>

<<FailureTypeB>>

<<FailureTypeC>>

[...]<<Component

Parameter>>

value

5

6

<<CoreAssetBase>>

Figure 3: Example instance of a core asset base

<<CPUProcessing

ResourceType>>

<<InternalAction>><<Branch

Action>>

<<ExternalCall

Action>>

<<Recovery

BlockAction>>

<<FailureOccurence>>

probability = 0.00012

<<InternalAction>>

<<ExternalCallAction>>

P(X<=1)

<<Signature>><<RecBlock

Behaviour>>

<<RecBlock

Behaviour>>

<<InternalAction>>

<<InternalAction>>

<<ServiceBehaviour>>

P(X>1)

Figure 4: Service behaviour example (partial view)

3.2 Fault-tolerance Meta-ModelTo model fault tolerance within component services, we

use the concept of recovery blocks, which is analogical tothe exception handling in object oriented programming. It

is represented with a RecoveryBlockAction that contains anacyclic sequence of at least two RecBlockBehaviours. Thefirst behaviour models the normal execution within a ser-vice, while the following behaviours handle failures of cer-tain types and initiate alternative actions (analogically totry and catch blocks in exception handling). Each behaviourcontains an inner sequence of AbstractActions that againcan contain any type of action and even nested recoveryblocks.

Handled Failures:

None

Possible Failures:

Failure A, B, C, D

RecBlockBehaviour1

Handled Failures:

B, C

Possible Failures:

Failure B, C

RecBlockBehaviour2

Handled Failures:

C, D

Possible Failures:

Failure B, D

RecBlockBehaviour3

<<RecBlock

Behaviour>>

1

<<Failure

Type>> A

Success

<<Failure

Type>> B

<<Failure

Type>> C

<<Failure

Type>> D

<<RecBlock

Behaviour>>

2

<<RecBlock

Behaviour>>

3

No Failure

No Failure

No Failure

Failure B

Failure CFailure C

Failure A

Failure D

Failure B Failure DFailure B

Figure 5: Recovery-block example and its semantics

Consider the example in Fig. 5 of a recovery block ac-tion with three RecBlockBehaviours and four different fail-ure types A, B, C, D. During the first behaviour all failuretypes can occur. Failures of type A cannot be recovered andlead to a failure of the whole block. The second behaviourhandles failures of type B and C and the third behaviourhandles failures of type C and D. In the second behaviour,failures B and C are again possible, whereas failures of type

B lead to a failure of the block, while failures of type C andD are handled by the third behaviour. Notice that failuresof type C cannot occur from this recovery block as they arehandled by behaviour 3 and cannot again occur during theexecution of behaviour 3.

As a RecBlockBehaviour can contain arbitrary be-havioural constructs, such as internal actions, calls,branches, loops, and nested recovery block actions, they area flexible means to model FT mechanisms. They allow mod-elling exception handling, checkpoint and restarting, processpairs, recovery blocks, or consensus recovery blocks and aretherefore capable of making reliability predictions for a largeclass of existing FT mechanisms. If an external call action isembedded into a RecBlockBehaviour, errors from the calledcomponent (and any other component down the caller stack)can be handled. The case studies in Section 5 show differentpossible usages of recovery block actions.

3.3 Resource EnvironmentA ResourceEnvironment contains ResourceContainers

modelling server nodes and LinkResources modelling net-work connections. Each resource container can contain anumber of ProcessingResources like CPUs or hard disks.To include hardware availability into the calculation of thesystem reliability, each processing resource contains a meantime to failure (MTTF) and a mean time to repair (MTTR)attribute. These values can be determined from vendor spec-ification and/or former experience [23]. Link resources con-tain failure probabilities for network problems, which can bedetermined using simple test series.

A resource environment model can be reused across dif-ferent Products using allocation references (i.e. mappingof components to resources). Furthermore, it is possible tohave multiple resource environments (e.g., different serversizes or high availability servers), which then constitute anadditional variation point or feature of the product line.Components in the core asset base are decoupled from con-crete resource environments, because they only refer to ab-stract ProcResourceTypes, but not to concrete Processing-Resources. Thus, it is possible to connect a product to aspecific resource environment through an allocation refer-ence without the need to alter the core asset base.

3.4 ProductA Product contains a number of ComponentInstances

wired through AssemblyConnectors and accessed througha UserInterface. Only components with matching inter-faces may be composed in a product. It is possible to con-nect different component instances complying to the sameinterfaces at different points in the architecture, thus im-plementing architecture-level variation points. Through thecompositions, the overall system behaviour is defined as aconnection of all service behaviours (i.e., after composition,external call actions in service behaviours can be replacedby the service behaviours of the called services).

Component instances can contain Component-

Configurations that realize implementation-level variationpoints. We introduced these parameters in our formerwork [4]. They are data values that can model differentconfigurations or features of a component implementationand change component behaviour.

Component instances within a Product must be allocatedto ResourceContainers through an allocation reference that

models a deployment on a certain server. Thus, the com-ponent reliability depends on the availability of the server’shardware resources, employed by the component instance.The ProductUsage contains a number of UserCalls withdifferent probabilities, which model the system usage profileof a product.

4. RELIABILITY PREDICTIONThis section first describes how the system reliability (un-

der a specified usage profile) is calculated on the softwarelayer (Section 4.1) and then integrated with calculations onthe hardware and network layers (Sections 4.2, 4.3).

4.1 Software LayerFor the software layer, the approach derives the reliabil-

ity from a Markov model that reflects all possible executionpaths through a Product architecture, and their correspond-ing probabilities.

The Approach: POFOD Calculation

Franz Brosch – Proposal Presentation – Dagstuhl, March 2009

Action 1 Action 2 Action n-1 Action n

A1 An

Service Behaviour

Discrete-time Markov chain

1.0 1 – ∑fpi(A1) 1 – ∑fpi(An)

fp1(A1) fpm(A1)fp2(A1)

fp1(An) fp2(An) fpm(An)

…

…

F1 F2 Fm

SI

Figure 6: Markov model for service behaviours

Fig. 6 shows how a ServiceBehaviour representedthrough a sequence of n AbstractActions is transformedinto an absorbing discrete-time Markov chain. The chaincontains an initial state I, an absorbing success state S, onestate Ai for each action, and one absorbing failure state Fj

for each of the m user-defined software FailureTypes. Atransition from Ai to Fj denotes that action i might exhibita failure of type j upon execution, with a probability offpj(Ai). The probability of failure of the whole behaviourfp(Beh) is the probability to reach any of the failure statesFj (and not the success state S) from the initial state I:

fp(Beh) = 1−n∏

i=1

(1−m∑

j=1

fpj(Ai))

For the success probability of the behaviour sp(Beh), wehave:

sp(Beh) = 1− fp(Beh)

The calculation of fpj(Ai) depends on the type of the ac-tion Ai. For InternalActions, the failure probabilitiesfpj(Ainternal) are given as a direct input to the model(cf. Fig 2). LoopActions, BranchActions, ExternalCall-Actions, and RecoveryBlockActions have nested Service-

Behaviours Beh (again sequences of AbstractActions),which need to be evaluated in a recursive step first. Forloops, we have:

fpj(Aloop) =

k∑i=1

(P (ci) ·ci−1∑l=0

(sp(Beh)l · fpj(Beh)))

with a finite set of loop iteration counts {c1, . . . , ck} ⊆ N,each with its probability of occurrence P (ci). For branches,we have:

fpj(Abranch) =

k∑i=1

(P (Behi) · fpj(Behi))

with a finite set of nested behaviours{Beh1, . . . , Behk

}and

their probabilities of occurrence P (Behi). An External-

CallAction fails if the called behaviour fails:

fpj(Acall) = fpj(Beh)

Recovery blocks are characterized through a sequence ofnested behaviours [Beh1, . . . , Behk]. Having fpj(Behi) forevery behaviour and failure type, each Behi can be repre-sented with a trivial Markov model T (Behi), as illustratedin Fig. 7, and the recovery-block model constructed as theircombination (following the semantics illustrated in Fig. 5).

Forest of Behaviour Trees for a Recovery Block

I1

1-∑fpj(B1) fpm(B1)fp1(B1) …

S1 F11 F1m…

T(B1): I2

1-∑fpj(B2) fpm(B2)fp1(B2) …

S2 F21 F2m…

Ik

1-∑fpj(Bk) fpm(Bk)fp1(Bk) …

Sk Fk1 Fkm…

Handles: none T(B2): Handles: j21, j22,… T(Bk): Handles: jk1, jk2,…

Figure 7: Markov models for recovery behaviours

Let for each model T (Behi) be Ii its initial state, Si itssuccess state and Fij its failure state for each failure typej. To connect the isolated trees into a single Markov chain,we add j + 2 states, namely I, S and Fj for each j, and thefollowing transitions:

• I1.0−−→I1,

• Si1.0−−→S for each i ∈ {1, . . . , k},

• Fij1.0−−→ Ix where x ∈ {i + 1, . . . , k} is the index of the

closest tree handling failure type j if such a tree exists,

• and if not, Fij1.0−−→ Fj for each Fij with no outgoing

transitions.Finally, the failure probabilities fpj(Arecovery) are com-puted as the probability of reaching Fj from I (via sum-mation of the multiplied probabilities over available paths).

Based on the described calculations, the success probabil-ity of the topmost behaviour sp(BehT ) (sequence of systemservices invoked by the user) can be calculated in a recursiveway, yielding the system-level reliability with respect to thesoftware layer.

4.2 Hardware LayerFor ProcessingResources, the approach employs a hard-

ware availability model, and integrates this model with thesoftware layer for a combined consideration of software andhardware failures. Having the MTTFi and MTTRi valuesfor all resources {r1, . . . , rp}, we first calculate the steady-state availability Ai of ri:

Ai = MTTFi/(MTTFi + MTTRi)

and interpret it as the probability of ri being available whenrequested at an arbitrary point in time. Furthermore, we de-fine the set of physical system states as the set {s1, . . . , sq}where each sj is a combination of the individual resourcestates. We distinguish two possible states for each resource,

being either available (OK) or unavailable (FAIL). Assum-ing independent hardware failures, the probability P (sj) forthe system to be in state sj is the product of the individ-ual state probabilities. For example, for a system with 2resources, the probability for r1 being OK and r2 beingFAIL is:

P ((r1 = OK) ∧ (r2 = FAIL)) = A1 · (1−A2)

Combined SW/HW Layers

Franz Brosch – Proposal Presentation – Dagstuhl, March 2009

I

F1 Fm SR1 Rp

s1 sq

P(s1) P(sq)

fp1(BehT|s1) sp(BehT|sq)… …

Figure 8: Combined HW/SW consideration

Fig. 8 illustrates the combined consideration of the soft-ware and the hardware layer. Upon a user call, the systemmight be in any of the physical system states sj . This isreflected by a transition from the initial state I to sj withthe corresponding state probability P (sj). Being in sj , theexecution might either fail due to an unavailable hardwareresource accessed by system control flow (sj to Ri), or witha software failure (represented through transitions from sj

to Fk), or it might succeed (sj to S).To calculate the failure probabilities fpk(BehT |sj), k ∈{1, . . . , m + p}, we need to incorporate the failures due tohardware unavailability into the software layer model (asshown in Fig. 6). To this end, we set the hardware-causedfailure probability fpk(BehT |sj), k > m, of Internal-

Actions to 1.0 if they require a ProcessingResource thatis not available under the given physical system state sj .If an internal action requires two or more unavailable re-sources, the failure probability is distributed evenly amongthese. In this manner, the software layer is evaluated in-dependently for each physical system state, and the overallsuccess probability of service execution (represented by itstopmost behaviour BehT ) is computed as the weighted sumover the success probabilities of all physical system states:

sp(BehT ) =

q∑j=1

(P (sj) · sp(BehT |sj))

where for each physical system state we have:

sp(BehT |sj) = 1−m+p∑k=1

fpk(BehT |sj)

Thanks to the separation of service behaviour according tothe individual physical system states, the approach can bet-ter reflect the correlation of subsequent resource demands(i.e. a resource accessed twice within a single service exe-cution is likely to be either OK in both cases or FAIL inboth cases), caused by significantly longer resource failureand repair times compared to the execution time of a singleuser-triggered service [22].

4.3 Network LayerTo incorporate the network layer into the model, we as-

sume that each call routed over a LinkResource involvestwo message transports (request and return), and that each

transport might fail with the given failure probability of thelink fp(L). We adapt the software layer model (see Sec-tion 4.1) by a differentiation of ExternalCallActions intolocal calls and remote calls. We keep the failure probabilitiesfor local calls:

fpj(AlocalCall) = fpj(Beh)

but incorporate the link failure probability into the calcula-tion for remote calls:

fpj(AremoteCall) = fpj(Beh) · fp(L)2

Thus, we enable coarse-grained consideration of network re-liability, without going into the details of more sophisticatednetwork simulations, which are out of scope of this paper.

5. EVALUATION

5.1 Goals and SettingThis section serves to validate our reliability and fault

tolerance modelling approach. The goals of the validationare (i) to demonstrate how the new FT modelling techniquescan support design decisions, (ii) to provide a rationale forthe validity of our models and their resilience to impreciseinput parameters, and (iii) to show the effectiveness of ourmodels for SPLs.

Regarding goal (ii), validating reliability predictionagainst measured values is inherently difficult, as failuresare rare events, and the necessary time to observe a sta-tistically relevant number of failures is infeasibly long forhigh-reliability systems. Several existing approaches there-fore limit their validation to demonstrating examples andsensitivity analyses (e.g., [8, 10, 12, 22, 25]), showing howthe approaches can be used to learn about system failurebehaviour, and proving the robustness of prediction resultsagainst imprecise input data at design time. A numberof authors involve real-world industrial software systems intheir validation (e.g., [5, 16, 26]). We follow the same path,and additionally compare our numerically calculated predic-tions against a more realistic, but also more time-consumingqueueing network simulation [4], in order to at least partiallyvalidate prediction accuracy. The simulation has fewer as-sumptions than the analytical solution. It takes system ex-ecution times (encoded into ResourceDemands) into accountand lets resources fail and be repaired according to theirMTTF and MTTR, not based on the simplified steady-stateavailability.

We have validated our approach on a number of differentsystems: a distributed business reporting system, the com-mon component modelling example (CoCoME), a web-basedmedia store product line [2], an industrial control systemproduct line [17], and the SLA@SOI Open Reference Case.The models for these systems can be retrieved from our web-site 1. In the following, we describe the predictions for theweb-based media store (Section 5.2) and the industrial con-trol system (Section 5.3) in detail. The media store allowsus to present all reliability predictions, while the predictionsfor the industrial control system have been obfuscated forconfidentiality reasons.

5.2 Case Study I: Web-based Media StoreThe media store model is inspired by common web-

service-based data storage solutions and has similar func-tionality to the ITunes Music Store [2]. The system provides

a centralized storage for media files, such as audio or videofiles, and a corresponding up- and download functionality.The media store product line contains three standard prod-uct configurations: standard, comfort, and power. More con-figurations are possible by instantiating the feature model inFig. 9.

Media Store

UserInteractionChoice FileLoaderChoice EncoderChoice DataAccessChoice

UserInteractionComfort

UserInteraction

UserInteractionFT

FileLoaderPower

FileLoader

FileLoaderFT

Encoder

EncoderPower

EncoderFT

DataAccess

DataAccessFT

MandatoryOptional Or

AlternativeCacheHitRate

Figure 9: Feature model of the media store SPL

Fig. 10 summarizes the different products and design al-ternatives of the media store product line. The core func-tionality is provided through four component types: User-

Interaction, FileLoader, Encoding, and DataAccess. Forsome of these components alternative comfort or power vari-ants are present in the core asset base for the different prod-uct variants (cf. Fig. 9). The database and the DataAccess

component are deployed on one (or optionally two) sepa-rated database server(s).

IUser-

IFileLoader

IFileLoader

IDataAccess

IEncoding

IEncoding

Inter-action

IUser-

Inter-action

IData-

Access

IData-

Access

«Component Instance»

UserInteractionFT


UserInteraction[Comfort]

«ComponentInstance»

FileLoaderFT


FileLoader[Power]


DataAccessFT


Encoding[Power]


EncodingFT

«Processing Resource»

CPU

«ProcessingResource»

HDD


CPU


HDD


CPU


HDD


DataAccess


DataAccess

Figure 10: Media store products and design alter-natives

During up- and download of media files, different typesof failures may occur in the involved component instances:A BusinessLogicFailure may occur during the processing ofuser requests in the UserInteraction[Comfort] component.A CacheAccessFailure may occur in the FileLoader[Power]

component induced by malfunctioning cache memory. Bugsin the compression algorithm of the Encoder[Power] compo-nent may lead to an EncodingFailure. A DataAccessFailuremay occur in the DataAccess component due to internaldatabase errors or faults in the database server’s file system.Additionally, as hardware failures, a CommunicationFailure,CPUFailure, and/or HDDFailure can occur.

<<RecoveryBlockAction>>

<<ExternalCallAction>>DataAccess[Main].retrieveFile()

<<RecBlockBehaviour>> 2Handles CPUFailure

<<RecBlockBehaviour>> 1

<<ServiceBehaviour>> DataAccessFT.retrieveFile()

<<ExternalCallAction>>DataAccess[Backup].retrieveFile()

Figure 11: Service behaviour of DataAccessFT

For illustrative purposes, we set the software-level failureprobabilities to 10−5 for each individual failure occurrence inthe model, with the following exceptions distinguishing the

99,84%

99,88%

99,92%

99,96%

standard comfort power

[no FT]

UserFT

FileFT

EncFT

DataFT

system reliability

producttype

0%

5%

10%

15%

20%

25%


UserFT

FileFT

EncFT

DataFT

system failurereduction

producttype

(a) System reliability for all design alternatives

0,00%

0,02%

0,04%

0,06%

0,08%

0,10%

0,12%

0,14%


HDDFailure

CPUFailure

CommunicationFailure

DatabaseFailure

CacheAccessFailure

EncodingFailure

BusinessLogicFailure

failureprobability

product type

0,00%

0,01%

0,02%

0,03%

0,04% standard

comfort

power

failure probability

failure type

99,80%

99,84%

99,88%

99,92%

99,96%


prediction

simulation

system reliability

producttype[UserFT]

(b) Failure probabilities per failure type

99,88%

99,90%

99,92%

99,94%[noFT]

EncFT

FileFT

UserFT

DataFT

system reliability

alteredfailure type

(c) System reliability changes for altered input data

0,000%

0,005%

0,010%

0,015%

0,020%

0,025%

2 4 6 8 10 12 14 16 18 20

Business

Encoding

Cache

Database

Comm

CPU

HDD

failure probability

# of reques-ted files

99,992%

99,993%

99,994%

99,995%

99,996%

99,997%

99,998%

2 4 6 8 10 12 14 16 18 20

[no FT]

UserFT

FileFT

EncFT

DataFT

systemreliability

number ofrequested files

(d) Failure probabilities depending on usage profile

Figure 12: Media store prediction results

products: in the UserInteractionComfort component, theprobability of BusinessLogicFailures rises to 10−4 because ofthe more complex business logic compared to the standardvariant. Compression algorithms are generally complex andmay fail with a probability of 10−4 in the Encoder compo-nent, and with 2 × 10−4 in the EncoderPower component.For all hardware resources, we assume the MTTF being oneyear and the MTTR 50 minutes, implying an availabilityof 99.99%. In other settings, these values could have beenextracted from log files of existing similar systems.

Fault tolerance mechanisms may be optionally introducedinto each media store product, in terms of additional com-ponents which are shown in grey in Fig. 10. For example, aUserInteractionFT component may be put in front of theUserInteraction[Comfort] component. It has the abilityto buffer incoming requests, to re-initialise the business logicin case of a BusinessLogicFailure, and to retry the failed re-quest. As another example, the DataAccessFT componentmay be used to handle CPUFailures on the main DB serverby redirecting calls to the backup server. Fig. 11 illustratesthe service behaviour of the file retrieval call, which con-sists of a single RecoveryBlockAction with two RecBlock-

Behaviours.Each described fault tolerance mechanism can be used for

each product, and more than one mechanism may be appliedin parallel. We focus on cases where at most one mechanismis used.

The usage profile of the media store consists of 20% uploadcalls and 80% download calls, an average of 10 requestedfiles per call, and a probability of 0.3 for files to be large,i.e. requiring compression during upload. We calculatedthe expected system reliability for each product and designalternative. Each calculation took below one second on astandard PC with a 2.2 GHz CPU and 2.00 GB RAM.

To provide evidence about the possible decision supportfor different design alternatives (goal (i) in Section 5.1),Fig. 12(a) shows the system reliability for each product andfault tolerance alternative. The comfort product has thelowest reliability, because of the included statistics function-

ality, which involves additional computing and requests tothe database for storage and retrieval. The power producthas the highest system reliability, as the high hit rate in theFileLoaderPower cache decreases the number of necessarydatabase accesses. Employing the DataAccessFT componenthas the highest effect compared to the design alternativeswithout fault tolerance. Notice that the FT mechanismshave different influences in the different variants. For exam-ple, the UserInteractionFT is most effective for the comfortvariant.

Fig. 12(b) provides more detail and shows the probabilityof a system failure due to a certain failure type. Summarizedover all products, CommunicationFailures, CPUFailures andHDDFailures most probably cause a system failure. The riskof a CommunicationFailure is especially high for the comfortproduct, which requires many database accesses and corre-sponding network traffic. Thus, a software architect mayrecognize the need to introduce new fault tolerance mecha-nisms for these failures.

To demonstrate the robustness of our model to impreciseinput data (goal (ii) in Section 5.1), we first examined therobustness of the reliability prediction to alterations in theinput failure probabilities. We changed the failure probabil-ities of the components of the comfort product one at a timeby multiplying them with 10−1. We also increased the hard-ware resource availability 99.99% to 99.999% one at a timefor each resource. Fig. 12(c) shows new system reliabilitiesfor the different fault tolerance variants, indicating that theranking of the design alternatives is almost identical over thedifferent failure type alterations. The DataAccessFT is al-ways top-ranked. However, rank changes do occur in case ofaltered BusinessLogicFailure and CacheAccessFailure prob-abilities, indicating that these probabilities should be esti-mated as careful as possible.

As a second sensitivity analysis, Fig. 12(d) focuses on thepower product without fault tolerance. It shows the sen-sitivity of the failure probabilities per failure type to thenumber of requested files (i.e., a change to the usage pro-file). For most failure types, the failure probabilities rise

with the number of files, as more database accesses and mes-sage transports over network are required, as well as morefile compressions and cache accesses. The business logic,however, is independent of the number of files, which keepsthe BusinessLogicFailure probability constant. CPUFailuresand HDDFailures do not influence the system reliability formore than 8 files. For cases with 2 to 8 files there is a chancethat all files are found in the cache, and no database accessis necessary, thus lowering the failure probability.

To analyse prediction accuracy, we ran the reliabilitysimulation for each of the three products in the User-

InteractionFT variant for 107 simulated seconds, and got adeviation from the analytical results between 0.0006% and0.0067%. Fig. 13 shows the results. The ranking of the threeconsidered variants is confirmed by the simulation, which in-dicates that the analytical results are sufficiently accurate.

0,00%

0,02%

0,04%

0,06%

0,08%

0,10%

0,12%

0,14%


HDDFailure

CPUFailure

CommunicationFailure

DatabaseFailure

CacheAccessFailure

EncodingFailure

BusinessLogicFailure

failureprobability

product type

0,00%

0,01%

0,02%

0,03%

0,04% standard

comfort

power

failure probability

failure type

99,80%

99,84%

99,88%

99,92%

99,96%


prediction

simulation

system reliability

producttype[UserFT]

Figure 13: Media store simulation results

The effectiveness of the approach for SPLs (goal (iii)in Section 5.1) is demonstrated by the fact that nearlyall model parts can be reused throughout the media storeproducts and design alternatives. Only some Component-

Instances are specific to certain alternatives and need to beconnected via additional AssemblyConnectors: the User-

InteractionComfort, EncodingPower, FileLoaderPower,and all FT components.

5.3 Case Study II: Industrial Control SystemAs a second case study, we analysed the reliability of a

large-scale industrial control system from ABB, which isused in many different domains, such as power generation,pulp and paper handling, or oil and gas processing. The sys-tem is implemented in several millions lines of C++ code.On a high abstraction level, the core of the system consists ofeight software components that can be flexibly deployed onmultiple servers depending on the system capacity requiredby customers. Fig. 14 depicts a possible configuration of thesystem with four servers. The names of the components andtheir failure probabilities have been obfuscated for confiden-tiality reasons.

The upper part of Fig. 14 shows the ServiceBehaviours

for the components C7 and FT2. The components FT1 andFT2 have been introduced into model inspired by existingFT mechanisms. FT1 is able to restart component C1 uponfailed requests. FT2 is able to query two redundant instancesof component C4, which are deployed on different servers,thereby implementing fault tolerance against hardware fail-ures.

The reliability of the core system has been analysed ina former study [17], where no fault tolerance mechanismsor product variants were considered. For this case study, wereused the failure probabilities from the former study, whichhad been determined using software reliability growth mod-els based on bug tracker data. We also reused the transi-tion probabilities between the components, which had been

C1

C7

C8

C6

C5

C3

C2

FT1C4 Ext1

Ext2

C4'

FT2

Server1 Server2 Server3 Server4

Server3'

<<Internal

Action>>

<<ServiceBehaviour>> C7

<<Branch

Action>>

<<ExternalCallAction>> C5

<<ExternalCallAction>> C6

<<InternalAction>>

0.805

0.193

0.002

<<FailureOccurence>>

probability = ...

<<ServiceBehaviour>> FT2

<<RecBlockAction>>

<<RecBlock

Behaviour>>

<<RecBlock

Behaviour>>

<<ExternalCall

Action>> C4

<<ExternalCall

Action>> C4'

Figure 14: Control system product line model withtwo exemplary service behaviours

determined using source code instrumentation and systemexecution. The hardware reliability parameters were basedon vendor specifications.

The industrial control system is realized as a product lineand sold to customers in different variants depending ontheir requirements. Fig. 15 shows a small excerpt of variantsin terms of a feature model. There are many more possiblevariants, as third party components can be integrated intothe system via standardized interfaces. The components C1-C8 are mandatory. For component C4, there are two alterna-tive implementations (C41 and C42), which address differentcustomer requirements. There are two external componentsExt1 and Ext2, which can be optionally included into thecore system. The feature model also includes the differentFT mechanisms as variants.

FT

Control System

C1 C2 C3

C5 C6 C7 C8

C4

C41 C42

Ext

Ext1 Ext2 FT1 FT2

[…]

Mandatory

Optional Or

Alternative

Figure 15: Feature model of the control systemproduct line variants (excerpt)

For the scope of this paper we restricted the reliabilityanalysis to the core system (standard) and three differentvariants. Variant 1 uses component C42 instead of C41 butis otherwise identical to the core system. Variant 2 incorpo-rates the external component Ext1, which is only connectedto component C4 (cf. Fig. 14). Variant 3 incorporates com-ponent Ext2, which is connected to component C1, C2, C4,and C6. These variants correspond to realistic configura-tions, which have formerly been sold to customers.

To demonstrate the decision support for different alter-natives (goal (i) in Section 5.1) we analysed how the pre-dicted system reliability varies for the different variants andFT mechanisms (Fig. 16(a)). The actual values are obfus-cated for confidentiality reasons. Variant 1 is the predictedas being the most reliable. Introducing FT1 generally bearsa higher increase in reliability than introducing FT2, whichincludes adding an additional server for the redundant in-stance of component C4. The impact on system reliabilityof FT2 is less pronounced for variant 1 than for the othervariants, because it already uses a higher reliable versionof component C4. Thus, the software architect can decidewhether the increased costs for adding an additional serverfor realising FT2 in this variant are justified.

To show the robustness of the models against imprecise

Standard Variant 1 Variant 2 Variant 3

Syst

em

Re

liab

ility

Product Type

no FT

FT1

FT2

FT1+FT2

(a) Prediction results for different variants

Syst

em

Re

liab

ility

Component Failure Probability

C1

C2

C4

C5

(b) Sensitivity to component failure probabilities

Figure 16: Exemplary prediction results for the industrial control system

input parameters (goal (ii) in Section 5.1), we conducteda sensitivity analysis modifying the failure probabilities ofselected components (Fig. 16(b)). The system reliability ismost sensitive to the component reliability of C1 as the curvehas the steepest slope. The system reliability is robust tothe component reliability of C5. Overall, the model behaveslinearly and the deviations of the system reliability are com-parably small to changes in individual component failureprobabilities. In this case, the ranking of the design al-ternatives remained robust against uncharacteristically highvariations of the component failure probabilities.

For a comparison between numerically computed predic-tions and simulation data, we ran a simulation for each vari-ant for 106 simulated seconds. The mean error betweenthe numerically computed and simulated system reliabilityacross all variants was 0.0077 percent. The ranking of thevariants remained was the same for the simulation results asfor the numerical results. We conclude that the numericalcalculations were sufficiently precise in this case.

To show the effectiveness of the approach for SPLs (goal(iii) in Section 5.1), we quantified the amount of changes nec-essary to model the product variants in our case. For Variant1, a single ComponentInstance and AssemblyConnector hadto be added to the standard Product and deployed to therespective ResourceContainer. This did not require the ad-justment of transition probabilities. For Variants 2 and 3,also only single ComponentInstances had to be added to thestandard Product.

6. RELATED WORKOur method for architectural fault tolerance modelling

is related to approaches on software architecture reliabil-ity modelling [13, 21], fault tolerance modelling on the levelof software architecture [18], and reliability engineering forsoftware product lines [7].

Multiple surveys on software architecture reliability mod-elling are available [11, 13, 15]. R.C. Cheung [6] was amongthe first to propose architectural reliability modelling usingMarkov chains. Some recent approaches refine such mod-els to support compositionality [21], and different failuremodes [10], but do not regard fault tolerance mechanisms.L. Cheung et al. [5] use hidden Markov models to deter-mine component failure probabilities for Markov chain ar-chitecture models. Further approaches in this area apply theUML modelling language [8, 12] or are specifically tailoredto service-oriented systems [27], but also do not include faulttolerance mechanisms or support for reusing model artefactsin different contexts, such as product configurations.

Some approaches do tackle the problem of incorporat-ing fault tolerance mechanisms into the architectural pre-diction models [18]. Sharma and Trivedi [25] includes addi-tional states and recovery transitions into architecture levelMarkov model to model component restarts or system re-boots. Wang et al. [26] provides constructs for Markovchains to model replicated components. Kanoun et al. [16]model fault tolerance of hardware/software systems usinggeneralized stochastic Petri nets. These approaches do notconsider component-internal control and data flow, and howit is influenced by error handling constructs. Thus, theymay yield inaccurate predictions when fault tolerant soft-ware behaviour deviates from the specific cases consideredby the authors. Furthermore, none of these approaches sup-ports reusing model artefacts.

Considering reliability during the design of a softwareproduct line is a major challenge, because different prod-uct variants may have different influences on the expectedreliability. Immonen [14] proposes the ’reliability and avail-ability prediction’ (RAP) method for SPLs. RAP, however,does not support compositional models, hardware reliability,or explicit fault tolerance mechanisms. Olumofin et al. [19]tailor the architecture trade-off analysis method to evaluateSPLs for different quality attributes. They focus on the iden-tification of scenarios but provide no architectural model orpredictions. Dehlinger et al. [9] introduce the PLFaultCATtool to analyse SPL safety using fault tree analysis. Theirmodels do not reflect the software architecture and thereforecomplicate evaluating different design alternatives. Auer-swald et al. [1] model product families of embedded systemsusing block diagrams, but provide no usage profile model orquantitative reliability prediction.

7. CONCLUSIONSWe presented an approach to support the design of reliable

and fault-tolerant software architectures and software fam-ilies. Our approach allows modelling different architecturalalternatives and product line configurations from a sharedcore asset base and offers a flexible way to include manydifferent fault tolerance mechanisms. A tool transforms themodels into Markov chains and calculates the system relia-bility involving both software and hardware reliabilities. Weevaluated our approach in multiple case studies and demon-strated its value to support architectural design decisions,its robustness against imprecise input data, and its effec-tiveness for SPLs.

Our approach provides a new perspective for designingsoftware architectures and families. It allows software ar-

chitects to validate their designs during early developmentstages and supports their design decisions quantitatively. Asthe effectiveness of different fault tolerance mechanisms ishighly context dependent, our approach enables software ar-chitects to quickly analyse many different alternatives andrule out poor design choices. This can potentially lead tomore reliable systems, which are built more cost-effectivelybecause late life-cycle changes for better reliability can beavoided.

In future work, we aim to include more sophisticated hard-ware reliability modelling techniques into our approach to of-fer more refined predictions. We will extend our tool for au-tomated sensitivity analyses and design optimisation. Ourprediction approach can potentially be extended for otherquality attributes, such as performance or security.

8. ACKNOWLEDGMENTSThis work was supported by the European Commission as

part of the EU-projects SLA@SOI (grant No. FP7-216556)and Q-ImPrESS (grant No. FP7-215013). Furthermore,we thank Igor Lankin for his support in developing the ap-proach.

9. REFERENCES[1] M. Auerswald, M. Herrmann, S. Kowalewski, and

V. Schulte-Coerne. Software Product-FamilyEngineering, volume 2290 of LNCS, chapterReliability-Oriented Product Line Engineering ofEmbedded Systems, pages 237–280. Springer, 2001.

[2] S. Becker, H. Koziolek, and R. Reussner. The PalladioComponent Model for Model-Driven PerformancePrediction. Journal of Systems and Software,82(1):3–22, 2009.

[3] S. Bernardi, J. Merseguer, and D. Petriu. Adependability profile within MARTE. Software andSystems Modeling, pages 1–24, 2009.

[4] F. Brosch, H. Koziolek, B. Buhnova, and R. Reussner.Parameterized Reliability Prediction forComponent-based Software Architectures. In Proc. ofQoSA’10, volume 6093 of LNCS, pages 36–51.Springer, 2010.

[5] L. Cheung, R. Roshandel, N. Medvidovic, andL. Golubchik. Early prediction of software componentreliability. In Proc. of ICSE’08, pages 111–120. ACMPress, 2008.

[6] R. C. Cheung. A User-Oriented Software ReliabilityModel. IEEE Trans. Softw. Eng., 6(2):118–125, 1980.

[7] P. Clements and L. Northrop. Software Product Lines:Practices and Patterns. Addison-Wesley, 2001.

[8] V. Cortellessa, H. Singh, and B. Cukic. Earlyreliability assessment of UML based software models.In Proc. of WOSP’02, pages 302–309. ACM, 2002.

[9] J. Dehlinger and R. R. Lutz. PLFaultCAT: AProduct-Line Software Fault Tree Analysis Tool.Automated Software Engineering, 13(1):169–193, 2006.

[10] A. Filieri, C. Ghezzi, V. Grassi, and R. Mirandola.Reliability Analysis of Component-Based Systemswith Multiple Failure Modes. In Proc. of CBSE’10,volume 6092 of LNCS, pages 1–20. Springer, 2010.

[11] S. S. Gokhale. Architecture-Based Software ReliabilityAnalysis: Overview and Limitations. IEEE Trans. onDependable and Secure Computing, 4(1):32–40, 2007.

[12] K. Goseva-Popstojanova, A. Hassan, A. Guedem,W. Abdelmoez, D. E. M. Nassar, H. Ammar, andA. Mili. Architectural-Level Risk Analysis Using UML.IEEE Trans. on Softw. Eng., 29(10):946–960, 2003.

[13] K. Goseva-Popstojanova and K. S. Trivedi.Architecture-based approach to reliability assessmentof software systems. Performance Evaluation,45(2-3):179–204, 2001.

[14] A. Immonen. Software Product Lines, chapter AMethod for Predicting Reliability and Availability atthe Architecture Level, pages 373–422. Springer, 2006.

[15] A. Immonen and E. Niemela. Survey of reliability andavailability prediction methods from the viewpoint ofsoftware architecture. Software and Systems Modeling,7(1):49–65, 2008.

[16] K. Kanoun and M. Ortalo-Borrel. Fault-tolerantsystem dependability-explicit modeling of hardwareand software component-interactions. IEEETransactions on Reliability, 49(4):363–376, 2000.

[17] H. Koziolek, B. Schlich, and C. Bilich. A Large-ScaleIndustrial Case Study on Architecture-based SoftwareReliability Analysis. In Proc. 21st InternationalSymposium on Software Reliability Engineering(ISSRE’10). IEEE Computer Society, 2010. To appear.

[18] H. Muccini and A. Romanovsky. Architecting FaultTolerant Systems. Technical Report CS-TR-1051,University of Newcastle upon Tyne, 2007.

[19] F. G. Olumofin and V. B. Misic. Extending theATAM Architecture Evaluation to Product LineArchitectures. In Proc. of WICSA’05, pages 45–56.IEEE Computer Society, 2005.

[20] B. Randell. System structure for software faulttolerance. In Proc. Int. Conf. on Reliable software,pages 437–449. ACM, 1975.

[21] R. H. Reussner, H. W. Schmidt, and I. H. Poernomo.Reliability prediction for component-based softwarearchitectures. Journal of Systems and Software,66(3):241–252, 2003.

[22] N. Sato and K. S. Trivedi. Accurate and efficientstochastic reliability analysis of composite servicesusing their compact Markov reward modelrepresentations. In Proc. of SCC’07, pages 114–121.IEEE Computer Society, 2007.

[23] B. Schroeder and G. A. Gibson. Understanding diskfailure rates: What does an MTTF of 1,000,000 hoursmean to you? ACM Trans. Storage, 3(3):8, 2007.

[24] V. Sharma and K. Trivedi. Quantifying softwareperformance, reliability and security: Anarchitecture-based approach. Journal of Systems andSoftware, 80:493–509, 2007.

[25] V. S. Sharma and K. S. Trivedi. Reliability andPerformance of Component Based Software Systemswith Restarts, Retries, Reboots and Repairs. In Proc.of ISSRE’06, pages 299–310. IEEE, 2006.

[26] W.-L. Wang, D. Pan, and M.-H. Chen.Architecture-based software reliability modeling.Journal of Systems and Software, 79(1):132–146, 2006.

[27] Z. Zheng and M. R. Lyu. Collaborative reliabilityprediction of service-oriented systems. In Proc. ofICSE’10, pages 35–44. ACM Press, 2010.

reliability prediction for fault-tolerant software architectures

Documents