reliability prediction for fault-tolerant software architectures
TRANSCRIPT
Reliability Prediction for Fault-TolerantSoftware Architectures
Franz Brosch1, Barbora Buhnova2, Heiko Koziolek3, Ralf Reussner4
1Research Center for Information Technology (FZI), Karlsruhe, Germany2Masaryk University, Brno, Czech Republic
3Industrial Software Systems, ABB Corporate Research, Ladenburg, Germany4Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
[email protected], [email protected], [email protected], [email protected]
ABSTRACTSoftware fault tolerance mechanisms aim at improving thereliability of software systems. Their effectiveness (i.e., re-liability impact) is highly application-specific and dependson the overall system architecture and usage profile. Whenexamining multiple architecture configurations, such as insoftware product lines, it is a complex and error-prone taskto include fault tolerance mechanisms effectively. Existingapproaches for reliability analysis of software architectureseither do not support modelling fault tolerance mechanismsor are not designed for an efficient evaluation of multiple ar-chitecture variants. We present a novel approach to analysethe effect of software fault tolerance mechanisms in varyingarchitecture configurations. We have validated the approachin multiple case studies, including a large-scale industrialsystem, demonstrating its ability to support architecture de-sign, and its robustness against imprecise input data.
Categories and Subject DescriptorsD.2.11 [Software]: SOFTWARE ENGINEERING—Soft-ware Architectures; D.2.4.g [Software]: SOFTWARE EN-GINEERING—Software/Program Verification—Reliability
General TermsSoftware Engineering, Reliability, Design
KeywordsComponent-Based Software Architectures, Reliability Pre-diction, Fault Tolerance, Software Product Lines
1. INTRODUCTIONSoftware fault tolerance (FT) mechanisms mask faults in
software systems and prohibit them to result in a failure. FTmechanisms are established on different abstraction levels,such as exception handling on the source code level, watch-dog and heart-beat as design patterns, and replication on the
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.QoSA ’11 Boulder, Colorado, USACopyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
architecture level [20, 18]. FT mechanisms are commonlyused to improve the reliability of software systems (i.e., theprobability of failure-free operation in a given time span).The effect of a FT mechanism (i.e., the extent of such animprovement) is however non-trivial to quantify because ithighly depends on the application context.
The challenge of assessing FT mechanisms in differentcontexts becomes particularly apparent on the architec-ture level, when evaluating different architectural configu-rations, and even more in the design of software productlines (SPL) [7]. An SPL has a core assembly of softwarecomponents and variation points where application-specificsoftware can be filled in. Thus, it can result in many differentproducts all sharing the same core with different configura-tion at the variation points.
Existing approaches for reliability prediction [11, 13, 15]either do not support modelling FT mechanisms (e.g., [5,8, 12, 21]) or do not allow for explicit definition and reuseof core modelling artefacts, and hence are difficult to applyin varying contexts, as with architecture design and SPLs(e.g., [24, 26]).
The contribution of this paper is an approach to analysethe effect of a software fault tolerance mechanism in depen-dence of the overall system architecture and usage profile.The approach is novel as it (i) takes software fault tolerancemechanisms explicitly into account, and (ii) reuses modelparts for effective evaluation of architectural alternatives orsystem configurations. The approach is ideally suited forsoftware product lines, which are used to formulate and illus-trate the approach. It builds upon the Palladio ComponentModel and an associated reliability prediction approach [4],which includes both software reliability and hardware avail-ability as influencing factors. Our tool support allows thearchitects to design the architecture with UML-like mod-els, which are automatically transformed to Markov-basedprediction models, and evaluated to determine the expectedsystem reliability.
The remainder of this paper is structured as follows. Sec-tion 2 outlines our approach and explains the steps involved.Section 3 details the models used in our approach and thenSection 4 explains how these models are formalised and anal-ysed to predict the system reliability. Section 5 evaluates theapproach on two case studies. Section 6 delimits our workfrom related approaches. Finally, Section 7 draws conclu-sions and sketches future directions.
2. PREDICTION PROCESSThis section outlines our reliability prediction approach
for fault-tolerant software architectures, software families,and software product lines (SPL) in particular. Accord-ing to Clements et al. [7], an SPL is defined as “a set ofsoftware-intensive systems that share a common, managedset of features satisfying the specific needs of a particularmarket segment or mission and that are developed from acommon set of core assets in a prescribed way”. Our mod-elling approach provides support only for the technical partof an SPL or software family as we assume that domainengineering and asset scoping have been performed beforemodelling the architecture and reusable components.
Our approach iteratively follows eight steps depicted inFig. 1. First, the software architect creates a CoreAsset-
Base, which models interfaces, reusable software compo-nents, and their (abstract) behaviour (step 1). The Core-
AssetBase is enriched with software failure probabilities ofactions forming component (service) behaviour (step 2). Af-terwards, the software architect can include different FTmechanisms, such as recovery blocks (explained in Sec-tion 3.2) or redundancy (step 3), either as additional com-ponents or directly into already modelled component be-haviours. The FT mechanisms allow for different configu-rations, e.g., the number of retries or replicated instances.
1. Model components,
interfaces, and behaviour
Core Asset Base(Section 4.1)
2. Model failure
probabilities in behaviours
3. Model fault tolerance,
adjust configurations
5. Model products, allocation,
usage
Resource environments(Section 4.2)
Products(Section 4.3)
6. Model transformation
7. Markov chain analysis
DTMCs(Section 5)
8. Product instantiation
Results Not OK
Results OK
4. Model resource
environment, HW availability
Figure 1: Process activities and artifacts
The software architect then creates a Resource-
Environment to model hardware resources (step 4) and spe-cific Products (step 5), including component allocation andsystem usage information. Combined with the CoreAsset-
Base, these models are transformed into multiple discrete-time Markov chains (step 6), from which system reliabilitypredictions and sensitivity analyses can be deduced (step7). If the prediction results do not satisfy the reliability re-quirements, the FT mechanisms can be reconfigured and/orthe resource environment and products adjusted iteratively.Otherwise, the modelled products are deemed sufficient forthe requirements, and the product can safely be instantiatedfrom the core asset base (step 8). The following two sectionsdescribe the models and the predictions in detail.
3. RELIABILITY MODELLINGThis section introduces our meta model for describing the
reliability characteristics of software product lines, whichare used to formulate our approach. The model and itsreliability solver are implemented using the Eclipse ModelingFramework (EMF) and build upon the Palladio ComponentModel (PCM) [2]. An excerpt of our meta model is depictedin Fig. 2; for a full documentation, refer to our website 1.
1http://sdqweb.ipd.kit.edu/wiki/ReliabilityPrediction
The model allows to express variation points on differentlevels:• architectural level: using composite components to en-
capsulate subsystems and enable their replacement• component level: selecting different component imple-
mentations for a given specification• component implementation level: using component pa-
rameters with values for specific product configura-tions• resource level: using allocation references expressing
different deployment schemes for product line configu-rations
Our model is better suited for our purposes than UML ex-tended with the MARTE-DAM profile [3] because it allowsmodel reuse through the core asset base and is reduced toconcepts needed for the prediction. The following sectionsexplain the core asset base (Section 3.1), the fault-tolerancemechanisms (Section 3.2), the resource environment (Sec-tion 3.3) and the product (Section 3.4) in detail.
3.1 Core Asset BaseThe modelling element CoreAssetBase of our meta model
(cf. Fig. 2) represents a repository for elements assem-bled into products and contains ComponentTypes, Inter-
faces, and FailureTypes. ComponentTypes can either beatomic PrimitiveComponents or hierarchically structuredCompositeComponents with nested inner components.
Composite components allow the core asset base to con-tain whole architecture fragments (e.g., SPL core assem-blies) that can be reused in different products. Such a coreassembly can have optionally required interfaces as varia-tion points. Component types are associated with interfacesthrough ProvidedRoles or RequiredRoles, and can exportComponentParameters that allow for implementation-levelvariation points. Fig. 3 shows an excerpt of an instance ofa core asset base including a composite component (4 ) anda component parameter (value).
For reliability analyses, the model requires constructsto express the behaviour of component services in termsof using hardware resources and calling other compo-nents. Therefore, a component type can contain a num-ber of ServiceBehaviours that specify the actions executedupon calling a specific Signature of one of the compo-nent’s provided interfaces. The behaviour may consist ofInternalActions, ExternalCallActions, and control flowconstructs, such as probabilistic branches and loops.
An internal action represents a component-internal com-putation and can contain multiple FailureOccurrences andResourceDemands. FailureOccurrences model softwarefailures during the execution of a service using a probability.These failure probabilities can be determined with differenttechniques [11, 5, 17], such as reliability growth modelling,defect prediction based on code metrics, statistical testing,or fault injection.
An external call action represents a call to other com-ponents and thus references a Signature contained in oneof the required interfaces of the current component. Fig. 4shows a small example of a service behaviour containing in-ternal actions and external call actions. Notice that thetransitions of the BranchAction are labelled with input pa-rameter dependencies (e.g., P (X ≤ 1)), because the concretevalues for the input parameters (e.g., X) are not known tothe component developer providing the specification. The
Core
AssetBase
Component
Type
InterfaceFailureType
Product
Component
Instance
Assembly
Connector
Resource
Environment
Resource
Container
Service
Behaviour
Abstract
Action
ExternalCall
Action
InternalAction
Recovery
BlockAction
Failure
Occurence+probability
BranchAction
LoopAction
RecBlock
Behaviour
FailureHandling
Entity
Primitive
Component
Composite
Component
UserInterface
Product
Usage
ProcRe-
sourceType
1..*
Signature
0..*
0..*
+pre
0..1
0..1
+succ
0..*0..*
0..*
1..*
+nextBehaviour
2..*
+occuringFailureType
0..*
+occuringFailureType
0..*
+usedResourceType
+calledService
0..*
0..*
0..*
0..*
+instantiatedComponentType
+instantiated
ResourceType
Processing
Resource+mttf
+mttr
Link
Resource+failProb
+to
+from0..*
0..*
0..*
UserCall
+probability
+allocatedResourceContainer
[…]
[…]
[…]
Resource
Demand+demand
[…]Component
Parameter
0..*
[…]Component
Configuration
0..*
+instantiatedParameter
ProvidedRole
RequiredRole
0..*
0..*
[…]
Figure 2: Excerpt of our meta model for specifying reliability characteristics in a software product line
component developer should keep the component as reusableas possible, i.e. make no assumptions about the usage pro-file. The developer specifies parameter dependencies as aninfluencing factor on the component reliability. The depen-dencies are automatically resolved during system analysis,when the usage profile is known. Hence, the influence ofthe given system-level usage profile on the system reliabilityis explicitly considered by our approach by propagating thesystem-level input parameters through the component-basedarchitecture [4].
<<Primitive
Component>>
1
<<FailureTypeA>>
<<Primitive
Component>>
3
<<Composite
Component>>
4
<<Primitive
Component>>
2
<<FailureTypeD>>
<<FailureTypeB>>
<<FailureTypeC>>
[...]<<Component
Parameter>>
value
5
6
<<CoreAssetBase>>
Figure 3: Example instance of a core asset base
<<CPUProcessing
ResourceType>>
<<InternalAction>><<Branch
Action>>
<<ExternalCall
Action>>
<<Recovery
BlockAction>>
<<FailureOccurence>>
probability = 0.00012
<<InternalAction>>
<<ExternalCallAction>>
P(X<=1)
<<Signature>><<RecBlock
Behaviour>>
<<RecBlock
Behaviour>>
<<InternalAction>>
<<InternalAction>>
<<ServiceBehaviour>>
P(X>1)
Figure 4: Service behaviour example (partial view)
3.2 Fault-tolerance Meta-ModelTo model fault tolerance within component services, we
use the concept of recovery blocks, which is analogical tothe exception handling in object oriented programming. It
is represented with a RecoveryBlockAction that contains anacyclic sequence of at least two RecBlockBehaviours. Thefirst behaviour models the normal execution within a ser-vice, while the following behaviours handle failures of cer-tain types and initiate alternative actions (analogically totry and catch blocks in exception handling). Each behaviourcontains an inner sequence of AbstractActions that againcan contain any type of action and even nested recoveryblocks.
Handled Failures:
None
Possible Failures:
Failure A, B, C, D
RecBlockBehaviour1
Handled Failures:
B, C
Possible Failures:
Failure B, C
RecBlockBehaviour2
Handled Failures:
C, D
Possible Failures:
Failure B, D
RecBlockBehaviour3
<<RecBlock
Behaviour>>
1
<<Failure
Type>> A
Success
<<Failure
Type>> B
<<Failure
Type>> C
<<Failure
Type>> D
<<RecBlock
Behaviour>>
2
<<RecBlock
Behaviour>>
3
No Failure
No Failure
No Failure
Failure B
Failure CFailure C
Failure A
Failure D
Failure B Failure DFailure B
Figure 5: Recovery-block example and its semantics
Consider the example in Fig. 5 of a recovery block ac-tion with three RecBlockBehaviours and four different fail-ure types A, B, C, D. During the first behaviour all failuretypes can occur. Failures of type A cannot be recovered andlead to a failure of the whole block. The second behaviourhandles failures of type B and C and the third behaviourhandles failures of type C and D. In the second behaviour,failures B and C are again possible, whereas failures of type
B lead to a failure of the block, while failures of type C andD are handled by the third behaviour. Notice that failuresof type C cannot occur from this recovery block as they arehandled by behaviour 3 and cannot again occur during theexecution of behaviour 3.
As a RecBlockBehaviour can contain arbitrary be-havioural constructs, such as internal actions, calls,branches, loops, and nested recovery block actions, they area flexible means to model FT mechanisms. They allow mod-elling exception handling, checkpoint and restarting, processpairs, recovery blocks, or consensus recovery blocks and aretherefore capable of making reliability predictions for a largeclass of existing FT mechanisms. If an external call action isembedded into a RecBlockBehaviour, errors from the calledcomponent (and any other component down the caller stack)can be handled. The case studies in Section 5 show differentpossible usages of recovery block actions.
3.3 Resource EnvironmentA ResourceEnvironment contains ResourceContainers
modelling server nodes and LinkResources modelling net-work connections. Each resource container can contain anumber of ProcessingResources like CPUs or hard disks.To include hardware availability into the calculation of thesystem reliability, each processing resource contains a meantime to failure (MTTF) and a mean time to repair (MTTR)attribute. These values can be determined from vendor spec-ification and/or former experience [23]. Link resources con-tain failure probabilities for network problems, which can bedetermined using simple test series.
A resource environment model can be reused across dif-ferent Products using allocation references (i.e. mappingof components to resources). Furthermore, it is possible tohave multiple resource environments (e.g., different serversizes or high availability servers), which then constitute anadditional variation point or feature of the product line.Components in the core asset base are decoupled from con-crete resource environments, because they only refer to ab-stract ProcResourceTypes, but not to concrete Processing-Resources. Thus, it is possible to connect a product to aspecific resource environment through an allocation refer-ence without the need to alter the core asset base.
3.4 ProductA Product contains a number of ComponentInstances
wired through AssemblyConnectors and accessed througha UserInterface. Only components with matching inter-faces may be composed in a product. It is possible to con-nect different component instances complying to the sameinterfaces at different points in the architecture, thus im-plementing architecture-level variation points. Through thecompositions, the overall system behaviour is defined as aconnection of all service behaviours (i.e., after composition,external call actions in service behaviours can be replacedby the service behaviours of the called services).
Component instances can contain Component-
Configurations that realize implementation-level variationpoints. We introduced these parameters in our formerwork [4]. They are data values that can model differentconfigurations or features of a component implementationand change component behaviour.
Component instances within a Product must be allocatedto ResourceContainers through an allocation reference that
models a deployment on a certain server. Thus, the com-ponent reliability depends on the availability of the server’shardware resources, employed by the component instance.The ProductUsage contains a number of UserCalls withdifferent probabilities, which model the system usage profileof a product.
4. RELIABILITY PREDICTIONThis section first describes how the system reliability (un-
der a specified usage profile) is calculated on the softwarelayer (Section 4.1) and then integrated with calculations onthe hardware and network layers (Sections 4.2, 4.3).
4.1 Software LayerFor the software layer, the approach derives the reliabil-
ity from a Markov model that reflects all possible executionpaths through a Product architecture, and their correspond-ing probabilities.
The Approach: POFOD Calculation
Franz Brosch – Proposal Presentation – Dagstuhl, March 2009
Action 1 Action 2 Action n-1 Action n
A1 An
Service Behaviour
Discrete-time Markov chain
1.0 1 – ∑fpi(A1) 1 – ∑fpi(An)
fp1(A1) fpm(A1)fp2(A1)
fp1(An) fp2(An) fpm(An)
…
…
F1 F2 Fm
SI
Figure 6: Markov model for service behaviours
Fig. 6 shows how a ServiceBehaviour representedthrough a sequence of n AbstractActions is transformedinto an absorbing discrete-time Markov chain. The chaincontains an initial state I, an absorbing success state S, onestate Ai for each action, and one absorbing failure state Fj
for each of the m user-defined software FailureTypes. Atransition from Ai to Fj denotes that action i might exhibita failure of type j upon execution, with a probability offpj(Ai). The probability of failure of the whole behaviourfp(Beh) is the probability to reach any of the failure statesFj (and not the success state S) from the initial state I:
fp(Beh) = 1−n∏
i=1
(1−m∑
j=1
fpj(Ai))
For the success probability of the behaviour sp(Beh), wehave:
sp(Beh) = 1− fp(Beh)
The calculation of fpj(Ai) depends on the type of the ac-tion Ai. For InternalActions, the failure probabilitiesfpj(Ainternal) are given as a direct input to the model(cf. Fig 2). LoopActions, BranchActions, ExternalCall-Actions, and RecoveryBlockActions have nested Service-
Behaviours Beh (again sequences of AbstractActions),which need to be evaluated in a recursive step first. Forloops, we have:
fpj(Aloop) =
k∑i=1
(P (ci) ·ci−1∑l=0
(sp(Beh)l · fpj(Beh)))
with a finite set of loop iteration counts {c1, . . . , ck} ⊆ N,each with its probability of occurrence P (ci). For branches,we have:
fpj(Abranch) =
k∑i=1
(P (Behi) · fpj(Behi))
with a finite set of nested behaviours{Beh1, . . . , Behk
}and
their probabilities of occurrence P (Behi). An External-
CallAction fails if the called behaviour fails:
fpj(Acall) = fpj(Beh)
Recovery blocks are characterized through a sequence ofnested behaviours [Beh1, . . . , Behk]. Having fpj(Behi) forevery behaviour and failure type, each Behi can be repre-sented with a trivial Markov model T (Behi), as illustratedin Fig. 7, and the recovery-block model constructed as theircombination (following the semantics illustrated in Fig. 5).
Forest of Behaviour Trees for a Recovery Block
I1
1-∑fpj(B1) fpm(B1)fp1(B1) …
S1 F11 F1m…
T(B1): I2
1-∑fpj(B2) fpm(B2)fp1(B2) …
S2 F21 F2m…
Ik
1-∑fpj(Bk) fpm(Bk)fp1(Bk) …
Sk Fk1 Fkm…
Handles: none T(B2): Handles: j21, j22,… T(Bk): Handles: jk1, jk2,…
Figure 7: Markov models for recovery behaviours
Let for each model T (Behi) be Ii its initial state, Si itssuccess state and Fij its failure state for each failure typej. To connect the isolated trees into a single Markov chain,we add j + 2 states, namely I, S and Fj for each j, and thefollowing transitions:
• I1.0−−→I1,
• Si1.0−−→S for each i ∈ {1, . . . , k},
• Fij1.0−−→ Ix where x ∈ {i + 1, . . . , k} is the index of the
closest tree handling failure type j if such a tree exists,
• and if not, Fij1.0−−→ Fj for each Fij with no outgoing
transitions.Finally, the failure probabilities fpj(Arecovery) are com-puted as the probability of reaching Fj from I (via sum-mation of the multiplied probabilities over available paths).
Based on the described calculations, the success probabil-ity of the topmost behaviour sp(BehT ) (sequence of systemservices invoked by the user) can be calculated in a recursiveway, yielding the system-level reliability with respect to thesoftware layer.
4.2 Hardware LayerFor ProcessingResources, the approach employs a hard-
ware availability model, and integrates this model with thesoftware layer for a combined consideration of software andhardware failures. Having the MTTFi and MTTRi valuesfor all resources {r1, . . . , rp}, we first calculate the steady-state availability Ai of ri:
Ai = MTTFi/(MTTFi + MTTRi)
and interpret it as the probability of ri being available whenrequested at an arbitrary point in time. Furthermore, we de-fine the set of physical system states as the set {s1, . . . , sq}where each sj is a combination of the individual resourcestates. We distinguish two possible states for each resource,
being either available (OK) or unavailable (FAIL). Assum-ing independent hardware failures, the probability P (sj) forthe system to be in state sj is the product of the individ-ual state probabilities. For example, for a system with 2resources, the probability for r1 being OK and r2 beingFAIL is:
P ((r1 = OK) ∧ (r2 = FAIL)) = A1 · (1−A2)
Combined SW/HW Layers
Franz Brosch – Proposal Presentation – Dagstuhl, March 2009
I
F1 Fm SR1 Rp
s1 sq
P(s1) P(sq)
fp1(BehT|s1) sp(BehT|sq)… …
Figure 8: Combined HW/SW consideration
Fig. 8 illustrates the combined consideration of the soft-ware and the hardware layer. Upon a user call, the systemmight be in any of the physical system states sj . This isreflected by a transition from the initial state I to sj withthe corresponding state probability P (sj). Being in sj , theexecution might either fail due to an unavailable hardwareresource accessed by system control flow (sj to Ri), or witha software failure (represented through transitions from sj
to Fk), or it might succeed (sj to S).To calculate the failure probabilities fpk(BehT |sj), k ∈{1, . . . , m + p}, we need to incorporate the failures due tohardware unavailability into the software layer model (asshown in Fig. 6). To this end, we set the hardware-causedfailure probability fpk(BehT |sj), k > m, of Internal-
Actions to 1.0 if they require a ProcessingResource thatis not available under the given physical system state sj .If an internal action requires two or more unavailable re-sources, the failure probability is distributed evenly amongthese. In this manner, the software layer is evaluated in-dependently for each physical system state, and the overallsuccess probability of service execution (represented by itstopmost behaviour BehT ) is computed as the weighted sumover the success probabilities of all physical system states:
sp(BehT ) =
q∑j=1
(P (sj) · sp(BehT |sj))
where for each physical system state we have:
sp(BehT |sj) = 1−m+p∑k=1
fpk(BehT |sj)
Thanks to the separation of service behaviour according tothe individual physical system states, the approach can bet-ter reflect the correlation of subsequent resource demands(i.e. a resource accessed twice within a single service exe-cution is likely to be either OK in both cases or FAIL inboth cases), caused by significantly longer resource failureand repair times compared to the execution time of a singleuser-triggered service [22].
4.3 Network LayerTo incorporate the network layer into the model, we as-
sume that each call routed over a LinkResource involvestwo message transports (request and return), and that each
transport might fail with the given failure probability of thelink fp(L). We adapt the software layer model (see Sec-tion 4.1) by a differentiation of ExternalCallActions intolocal calls and remote calls. We keep the failure probabilitiesfor local calls:
fpj(AlocalCall) = fpj(Beh)
but incorporate the link failure probability into the calcula-tion for remote calls:
fpj(AremoteCall) = fpj(Beh) · fp(L)2
Thus, we enable coarse-grained consideration of network re-liability, without going into the details of more sophisticatednetwork simulations, which are out of scope of this paper.
5. EVALUATION
5.1 Goals and SettingThis section serves to validate our reliability and fault
tolerance modelling approach. The goals of the validationare (i) to demonstrate how the new FT modelling techniquescan support design decisions, (ii) to provide a rationale forthe validity of our models and their resilience to impreciseinput parameters, and (iii) to show the effectiveness of ourmodels for SPLs.
Regarding goal (ii), validating reliability predictionagainst measured values is inherently difficult, as failuresare rare events, and the necessary time to observe a sta-tistically relevant number of failures is infeasibly long forhigh-reliability systems. Several existing approaches there-fore limit their validation to demonstrating examples andsensitivity analyses (e.g., [8, 10, 12, 22, 25]), showing howthe approaches can be used to learn about system failurebehaviour, and proving the robustness of prediction resultsagainst imprecise input data at design time. A numberof authors involve real-world industrial software systems intheir validation (e.g., [5, 16, 26]). We follow the same path,and additionally compare our numerically calculated predic-tions against a more realistic, but also more time-consumingqueueing network simulation [4], in order to at least partiallyvalidate prediction accuracy. The simulation has fewer as-sumptions than the analytical solution. It takes system ex-ecution times (encoded into ResourceDemands) into accountand lets resources fail and be repaired according to theirMTTF and MTTR, not based on the simplified steady-stateavailability.
We have validated our approach on a number of differentsystems: a distributed business reporting system, the com-mon component modelling example (CoCoME), a web-basedmedia store product line [2], an industrial control systemproduct line [17], and the SLA@SOI Open Reference Case.The models for these systems can be retrieved from our web-site 1. In the following, we describe the predictions for theweb-based media store (Section 5.2) and the industrial con-trol system (Section 5.3) in detail. The media store allowsus to present all reliability predictions, while the predictionsfor the industrial control system have been obfuscated forconfidentiality reasons.
5.2 Case Study I: Web-based Media StoreThe media store model is inspired by common web-
service-based data storage solutions and has similar func-tionality to the ITunes Music Store [2]. The system provides
a centralized storage for media files, such as audio or videofiles, and a corresponding up- and download functionality.The media store product line contains three standard prod-uct configurations: standard, comfort, and power. More con-figurations are possible by instantiating the feature model inFig. 9.
Media Store
UserInteractionChoice FileLoaderChoice EncoderChoice DataAccessChoice
UserInteractionComfort
UserInteraction
UserInteractionFT
FileLoaderPower
FileLoader
FileLoaderFT
Encoder
EncoderPower
EncoderFT
DataAccess
DataAccessFT
MandatoryOptional Or
AlternativeCacheHitRate
Figure 9: Feature model of the media store SPL
Fig. 10 summarizes the different products and design al-ternatives of the media store product line. The core func-tionality is provided through four component types: User-
Interaction, FileLoader, Encoding, and DataAccess. Forsome of these components alternative comfort or power vari-ants are present in the core asset base for the different prod-uct variants (cf. Fig. 9). The database and the DataAccess
component are deployed on one (or optionally two) sepa-rated database server(s).
IUser-
IFileLoader
IFileLoader
IDataAccess
IEncoding
IEncoding
Inter-action
IUser-
Inter-action
IData-
Access
IData-
Access
«Component Instance»
UserInteractionFT
«Component Instance»
UserInteraction[Comfort]
«ComponentInstance»
FileLoaderFT
«ComponentInstance»
FileLoader[Power]
«ComponentInstance»
DataAccessFT
«ComponentInstance»
Encoding[Power]
«ComponentInstance»
EncodingFT
«Processing Resource»
CPU
«ProcessingResource»
HDD
«Processing Resource»
CPU
«ProcessingResource»
HDD
«Processing Resource»
CPU
«ProcessingResource»
HDD
«Component Instance»
DataAccess
«Component Instance»
DataAccess
Figure 10: Media store products and design alter-natives
During up- and download of media files, different typesof failures may occur in the involved component instances:A BusinessLogicFailure may occur during the processing ofuser requests in the UserInteraction[Comfort] component.A CacheAccessFailure may occur in the FileLoader[Power]
component induced by malfunctioning cache memory. Bugsin the compression algorithm of the Encoder[Power] compo-nent may lead to an EncodingFailure. A DataAccessFailuremay occur in the DataAccess component due to internaldatabase errors or faults in the database server’s file system.Additionally, as hardware failures, a CommunicationFailure,CPUFailure, and/or HDDFailure can occur.
<<RecoveryBlockAction>>
<<ExternalCallAction>>DataAccess[Main].retrieveFile()
<<RecBlockBehaviour>> 2Handles CPUFailure
<<RecBlockBehaviour>> 1
<<ServiceBehaviour>> DataAccessFT.retrieveFile()
<<ExternalCallAction>>DataAccess[Backup].retrieveFile()
Figure 11: Service behaviour of DataAccessFT
For illustrative purposes, we set the software-level failureprobabilities to 10−5 for each individual failure occurrence inthe model, with the following exceptions distinguishing the
99,84%
99,88%
99,92%
99,96%
standard comfort power
[no FT]
UserFT
FileFT
EncFT
DataFT
system reliability
producttype
0%
5%
10%
15%
20%
25%
standard comfort power
UserFT
FileFT
EncFT
DataFT
system failurereduction
producttype
(a) System reliability for all design alternatives
0,00%
0,02%
0,04%
0,06%
0,08%
0,10%
0,12%
0,14%
standard comfort power
HDDFailure
CPUFailure
CommunicationFailure
DatabaseFailure
CacheAccessFailure
EncodingFailure
BusinessLogicFailure
failureprobability
product type
0,00%
0,01%
0,02%
0,03%
0,04% standard
comfort
power
failure probability
failure type
99,80%
99,84%
99,88%
99,92%
99,96%
standard comfort power
prediction
simulation
system reliability
producttype[UserFT]
(b) Failure probabilities per failure type
99,88%
99,90%
99,92%
99,94%[noFT]
EncFT
FileFT
UserFT
DataFT
system reliability
alteredfailure type
(c) System reliability changes for altered input data
0,000%
0,005%
0,010%
0,015%
0,020%
0,025%
2 4 6 8 10 12 14 16 18 20
Business
Encoding
Cache
Database
Comm
CPU
HDD
failure probability
# of reques-ted files
99,992%
99,993%
99,994%
99,995%
99,996%
99,997%
99,998%
2 4 6 8 10 12 14 16 18 20
[no FT]
UserFT
FileFT
EncFT
DataFT
systemreliability
number ofrequested files
(d) Failure probabilities depending on usage profile
Figure 12: Media store prediction results
products: in the UserInteractionComfort component, theprobability of BusinessLogicFailures rises to 10−4 because ofthe more complex business logic compared to the standardvariant. Compression algorithms are generally complex andmay fail with a probability of 10−4 in the Encoder compo-nent, and with 2 × 10−4 in the EncoderPower component.For all hardware resources, we assume the MTTF being oneyear and the MTTR 50 minutes, implying an availabilityof 99.99%. In other settings, these values could have beenextracted from log files of existing similar systems.
Fault tolerance mechanisms may be optionally introducedinto each media store product, in terms of additional com-ponents which are shown in grey in Fig. 10. For example, aUserInteractionFT component may be put in front of theUserInteraction[Comfort] component. It has the abilityto buffer incoming requests, to re-initialise the business logicin case of a BusinessLogicFailure, and to retry the failed re-quest. As another example, the DataAccessFT componentmay be used to handle CPUFailures on the main DB serverby redirecting calls to the backup server. Fig. 11 illustratesthe service behaviour of the file retrieval call, which con-sists of a single RecoveryBlockAction with two RecBlock-
Behaviours.Each described fault tolerance mechanism can be used for
each product, and more than one mechanism may be appliedin parallel. We focus on cases where at most one mechanismis used.
The usage profile of the media store consists of 20% uploadcalls and 80% download calls, an average of 10 requestedfiles per call, and a probability of 0.3 for files to be large,i.e. requiring compression during upload. We calculatedthe expected system reliability for each product and designalternative. Each calculation took below one second on astandard PC with a 2.2 GHz CPU and 2.00 GB RAM.
To provide evidence about the possible decision supportfor different design alternatives (goal (i) in Section 5.1),Fig. 12(a) shows the system reliability for each product andfault tolerance alternative. The comfort product has thelowest reliability, because of the included statistics function-
ality, which involves additional computing and requests tothe database for storage and retrieval. The power producthas the highest system reliability, as the high hit rate in theFileLoaderPower cache decreases the number of necessarydatabase accesses. Employing the DataAccessFT componenthas the highest effect compared to the design alternativeswithout fault tolerance. Notice that the FT mechanismshave different influences in the different variants. For exam-ple, the UserInteractionFT is most effective for the comfortvariant.
Fig. 12(b) provides more detail and shows the probabilityof a system failure due to a certain failure type. Summarizedover all products, CommunicationFailures, CPUFailures andHDDFailures most probably cause a system failure. The riskof a CommunicationFailure is especially high for the comfortproduct, which requires many database accesses and corre-sponding network traffic. Thus, a software architect mayrecognize the need to introduce new fault tolerance mecha-nisms for these failures.
To demonstrate the robustness of our model to impreciseinput data (goal (ii) in Section 5.1), we first examined therobustness of the reliability prediction to alterations in theinput failure probabilities. We changed the failure probabil-ities of the components of the comfort product one at a timeby multiplying them with 10−1. We also increased the hard-ware resource availability 99.99% to 99.999% one at a timefor each resource. Fig. 12(c) shows new system reliabilitiesfor the different fault tolerance variants, indicating that theranking of the design alternatives is almost identical over thedifferent failure type alterations. The DataAccessFT is al-ways top-ranked. However, rank changes do occur in case ofaltered BusinessLogicFailure and CacheAccessFailure prob-abilities, indicating that these probabilities should be esti-mated as careful as possible.
As a second sensitivity analysis, Fig. 12(d) focuses on thepower product without fault tolerance. It shows the sen-sitivity of the failure probabilities per failure type to thenumber of requested files (i.e., a change to the usage pro-file). For most failure types, the failure probabilities rise
with the number of files, as more database accesses and mes-sage transports over network are required, as well as morefile compressions and cache accesses. The business logic,however, is independent of the number of files, which keepsthe BusinessLogicFailure probability constant. CPUFailuresand HDDFailures do not influence the system reliability formore than 8 files. For cases with 2 to 8 files there is a chancethat all files are found in the cache, and no database accessis necessary, thus lowering the failure probability.
To analyse prediction accuracy, we ran the reliabilitysimulation for each of the three products in the User-
InteractionFT variant for 107 simulated seconds, and got adeviation from the analytical results between 0.0006% and0.0067%. Fig. 13 shows the results. The ranking of the threeconsidered variants is confirmed by the simulation, which in-dicates that the analytical results are sufficiently accurate.
0,00%
0,02%
0,04%
0,06%
0,08%
0,10%
0,12%
0,14%
standard comfort power
HDDFailure
CPUFailure
CommunicationFailure
DatabaseFailure
CacheAccessFailure
EncodingFailure
BusinessLogicFailure
failureprobability
product type
0,00%
0,01%
0,02%
0,03%
0,04% standard
comfort
power
failure probability
failure type
99,80%
99,84%
99,88%
99,92%
99,96%
standard comfort power
prediction
simulation
system reliability
producttype[UserFT]
Figure 13: Media store simulation results
The effectiveness of the approach for SPLs (goal (iii)in Section 5.1) is demonstrated by the fact that nearlyall model parts can be reused throughout the media storeproducts and design alternatives. Only some Component-
Instances are specific to certain alternatives and need to beconnected via additional AssemblyConnectors: the User-
InteractionComfort, EncodingPower, FileLoaderPower,and all FT components.
5.3 Case Study II: Industrial Control SystemAs a second case study, we analysed the reliability of a
large-scale industrial control system from ABB, which isused in many different domains, such as power generation,pulp and paper handling, or oil and gas processing. The sys-tem is implemented in several millions lines of C++ code.On a high abstraction level, the core of the system consists ofeight software components that can be flexibly deployed onmultiple servers depending on the system capacity requiredby customers. Fig. 14 depicts a possible configuration of thesystem with four servers. The names of the components andtheir failure probabilities have been obfuscated for confiden-tiality reasons.
The upper part of Fig. 14 shows the ServiceBehaviours
for the components C7 and FT2. The components FT1 andFT2 have been introduced into model inspired by existingFT mechanisms. FT1 is able to restart component C1 uponfailed requests. FT2 is able to query two redundant instancesof component C4, which are deployed on different servers,thereby implementing fault tolerance against hardware fail-ures.
The reliability of the core system has been analysed ina former study [17], where no fault tolerance mechanismsor product variants were considered. For this case study, wereused the failure probabilities from the former study, whichhad been determined using software reliability growth mod-els based on bug tracker data. We also reused the transi-tion probabilities between the components, which had been
C1
C7
C8
C6
C5
C3
C2
FT1C4 Ext1
Ext2
C4'
FT2
Server1 Server2 Server3 Server4
Server3'
<<Internal
Action>>
<<ServiceBehaviour>> C7
<<Branch
Action>>
<<ExternalCallAction>> C5
<<ExternalCallAction>> C6
<<InternalAction>>
0.805
0.193
0.002
<<FailureOccurence>>
probability = ...
<<ServiceBehaviour>> FT2
<<RecBlockAction>>
<<RecBlock
Behaviour>>
<<RecBlock
Behaviour>>
<<ExternalCall
Action>> C4
<<ExternalCall
Action>> C4'
Figure 14: Control system product line model withtwo exemplary service behaviours
determined using source code instrumentation and systemexecution. The hardware reliability parameters were basedon vendor specifications.
The industrial control system is realized as a product lineand sold to customers in different variants depending ontheir requirements. Fig. 15 shows a small excerpt of variantsin terms of a feature model. There are many more possiblevariants, as third party components can be integrated intothe system via standardized interfaces. The components C1-C8 are mandatory. For component C4, there are two alterna-tive implementations (C41 and C42), which address differentcustomer requirements. There are two external componentsExt1 and Ext2, which can be optionally included into thecore system. The feature model also includes the differentFT mechanisms as variants.
FT
Control System
C1 C2 C3
C5 C6 C7 C8
C4
C41 C42
Ext
Ext1 Ext2 FT1 FT2
[…]
Mandatory
Optional Or
Alternative
Figure 15: Feature model of the control systemproduct line variants (excerpt)
For the scope of this paper we restricted the reliabilityanalysis to the core system (standard) and three differentvariants. Variant 1 uses component C42 instead of C41 butis otherwise identical to the core system. Variant 2 incorpo-rates the external component Ext1, which is only connectedto component C4 (cf. Fig. 14). Variant 3 incorporates com-ponent Ext2, which is connected to component C1, C2, C4,and C6. These variants correspond to realistic configura-tions, which have formerly been sold to customers.
To demonstrate the decision support for different alter-natives (goal (i) in Section 5.1) we analysed how the pre-dicted system reliability varies for the different variants andFT mechanisms (Fig. 16(a)). The actual values are obfus-cated for confidentiality reasons. Variant 1 is the predictedas being the most reliable. Introducing FT1 generally bearsa higher increase in reliability than introducing FT2, whichincludes adding an additional server for the redundant in-stance of component C4. The impact on system reliabilityof FT2 is less pronounced for variant 1 than for the othervariants, because it already uses a higher reliable versionof component C4. Thus, the software architect can decidewhether the increased costs for adding an additional serverfor realising FT2 in this variant are justified.
To show the robustness of the models against imprecise
Standard Variant 1 Variant 2 Variant 3
Syst
em
Re
liab
ility
Product Type
no FT
FT1
FT2
FT1+FT2
(a) Prediction results for different variants
Syst
em
Re
liab
ility
Component Failure Probability
C1
C2
C4
C5
(b) Sensitivity to component failure probabilities
Figure 16: Exemplary prediction results for the industrial control system
input parameters (goal (ii) in Section 5.1), we conducteda sensitivity analysis modifying the failure probabilities ofselected components (Fig. 16(b)). The system reliability ismost sensitive to the component reliability of C1 as the curvehas the steepest slope. The system reliability is robust tothe component reliability of C5. Overall, the model behaveslinearly and the deviations of the system reliability are com-parably small to changes in individual component failureprobabilities. In this case, the ranking of the design al-ternatives remained robust against uncharacteristically highvariations of the component failure probabilities.
For a comparison between numerically computed predic-tions and simulation data, we ran a simulation for each vari-ant for 106 simulated seconds. The mean error betweenthe numerically computed and simulated system reliabilityacross all variants was 0.0077 percent. The ranking of thevariants remained was the same for the simulation results asfor the numerical results. We conclude that the numericalcalculations were sufficiently precise in this case.
To show the effectiveness of the approach for SPLs (goal(iii) in Section 5.1), we quantified the amount of changes nec-essary to model the product variants in our case. For Variant1, a single ComponentInstance and AssemblyConnector hadto be added to the standard Product and deployed to therespective ResourceContainer. This did not require the ad-justment of transition probabilities. For Variants 2 and 3,also only single ComponentInstances had to be added to thestandard Product.
6. RELATED WORKOur method for architectural fault tolerance modelling
is related to approaches on software architecture reliabil-ity modelling [13, 21], fault tolerance modelling on the levelof software architecture [18], and reliability engineering forsoftware product lines [7].
Multiple surveys on software architecture reliability mod-elling are available [11, 13, 15]. R.C. Cheung [6] was amongthe first to propose architectural reliability modelling usingMarkov chains. Some recent approaches refine such mod-els to support compositionality [21], and different failuremodes [10], but do not regard fault tolerance mechanisms.L. Cheung et al. [5] use hidden Markov models to deter-mine component failure probabilities for Markov chain ar-chitecture models. Further approaches in this area apply theUML modelling language [8, 12] or are specifically tailoredto service-oriented systems [27], but also do not include faulttolerance mechanisms or support for reusing model artefactsin different contexts, such as product configurations.
Some approaches do tackle the problem of incorporat-ing fault tolerance mechanisms into the architectural pre-diction models [18]. Sharma and Trivedi [25] includes addi-tional states and recovery transitions into architecture levelMarkov model to model component restarts or system re-boots. Wang et al. [26] provides constructs for Markovchains to model replicated components. Kanoun et al. [16]model fault tolerance of hardware/software systems usinggeneralized stochastic Petri nets. These approaches do notconsider component-internal control and data flow, and howit is influenced by error handling constructs. Thus, theymay yield inaccurate predictions when fault tolerant soft-ware behaviour deviates from the specific cases consideredby the authors. Furthermore, none of these approaches sup-ports reusing model artefacts.
Considering reliability during the design of a softwareproduct line is a major challenge, because different prod-uct variants may have different influences on the expectedreliability. Immonen [14] proposes the ’reliability and avail-ability prediction’ (RAP) method for SPLs. RAP, however,does not support compositional models, hardware reliability,or explicit fault tolerance mechanisms. Olumofin et al. [19]tailor the architecture trade-off analysis method to evaluateSPLs for different quality attributes. They focus on the iden-tification of scenarios but provide no architectural model orpredictions. Dehlinger et al. [9] introduce the PLFaultCATtool to analyse SPL safety using fault tree analysis. Theirmodels do not reflect the software architecture and thereforecomplicate evaluating different design alternatives. Auer-swald et al. [1] model product families of embedded systemsusing block diagrams, but provide no usage profile model orquantitative reliability prediction.
7. CONCLUSIONSWe presented an approach to support the design of reliable
and fault-tolerant software architectures and software fam-ilies. Our approach allows modelling different architecturalalternatives and product line configurations from a sharedcore asset base and offers a flexible way to include manydifferent fault tolerance mechanisms. A tool transforms themodels into Markov chains and calculates the system relia-bility involving both software and hardware reliabilities. Weevaluated our approach in multiple case studies and demon-strated its value to support architectural design decisions,its robustness against imprecise input data, and its effec-tiveness for SPLs.
Our approach provides a new perspective for designingsoftware architectures and families. It allows software ar-
chitects to validate their designs during early developmentstages and supports their design decisions quantitatively. Asthe effectiveness of different fault tolerance mechanisms ishighly context dependent, our approach enables software ar-chitects to quickly analyse many different alternatives andrule out poor design choices. This can potentially lead tomore reliable systems, which are built more cost-effectivelybecause late life-cycle changes for better reliability can beavoided.
In future work, we aim to include more sophisticated hard-ware reliability modelling techniques into our approach to of-fer more refined predictions. We will extend our tool for au-tomated sensitivity analyses and design optimisation. Ourprediction approach can potentially be extended for otherquality attributes, such as performance or security.
8. ACKNOWLEDGMENTSThis work was supported by the European Commission as
part of the EU-projects SLA@SOI (grant No. FP7-216556)and Q-ImPrESS (grant No. FP7-215013). Furthermore,we thank Igor Lankin for his support in developing the ap-proach.
9. REFERENCES[1] M. Auerswald, M. Herrmann, S. Kowalewski, and
V. Schulte-Coerne. Software Product-FamilyEngineering, volume 2290 of LNCS, chapterReliability-Oriented Product Line Engineering ofEmbedded Systems, pages 237–280. Springer, 2001.
[2] S. Becker, H. Koziolek, and R. Reussner. The PalladioComponent Model for Model-Driven PerformancePrediction. Journal of Systems and Software,82(1):3–22, 2009.
[3] S. Bernardi, J. Merseguer, and D. Petriu. Adependability profile within MARTE. Software andSystems Modeling, pages 1–24, 2009.
[4] F. Brosch, H. Koziolek, B. Buhnova, and R. Reussner.Parameterized Reliability Prediction forComponent-based Software Architectures. In Proc. ofQoSA’10, volume 6093 of LNCS, pages 36–51.Springer, 2010.
[5] L. Cheung, R. Roshandel, N. Medvidovic, andL. Golubchik. Early prediction of software componentreliability. In Proc. of ICSE’08, pages 111–120. ACMPress, 2008.
[6] R. C. Cheung. A User-Oriented Software ReliabilityModel. IEEE Trans. Softw. Eng., 6(2):118–125, 1980.
[7] P. Clements and L. Northrop. Software Product Lines:Practices and Patterns. Addison-Wesley, 2001.
[8] V. Cortellessa, H. Singh, and B. Cukic. Earlyreliability assessment of UML based software models.In Proc. of WOSP’02, pages 302–309. ACM, 2002.
[9] J. Dehlinger and R. R. Lutz. PLFaultCAT: AProduct-Line Software Fault Tree Analysis Tool.Automated Software Engineering, 13(1):169–193, 2006.
[10] A. Filieri, C. Ghezzi, V. Grassi, and R. Mirandola.Reliability Analysis of Component-Based Systemswith Multiple Failure Modes. In Proc. of CBSE’10,volume 6092 of LNCS, pages 1–20. Springer, 2010.
[11] S. S. Gokhale. Architecture-Based Software ReliabilityAnalysis: Overview and Limitations. IEEE Trans. onDependable and Secure Computing, 4(1):32–40, 2007.
[12] K. Goseva-Popstojanova, A. Hassan, A. Guedem,W. Abdelmoez, D. E. M. Nassar, H. Ammar, andA. Mili. Architectural-Level Risk Analysis Using UML.IEEE Trans. on Softw. Eng., 29(10):946–960, 2003.
[13] K. Goseva-Popstojanova and K. S. Trivedi.Architecture-based approach to reliability assessmentof software systems. Performance Evaluation,45(2-3):179–204, 2001.
[14] A. Immonen. Software Product Lines, chapter AMethod for Predicting Reliability and Availability atthe Architecture Level, pages 373–422. Springer, 2006.
[15] A. Immonen and E. Niemela. Survey of reliability andavailability prediction methods from the viewpoint ofsoftware architecture. Software and Systems Modeling,7(1):49–65, 2008.
[16] K. Kanoun and M. Ortalo-Borrel. Fault-tolerantsystem dependability-explicit modeling of hardwareand software component-interactions. IEEETransactions on Reliability, 49(4):363–376, 2000.
[17] H. Koziolek, B. Schlich, and C. Bilich. A Large-ScaleIndustrial Case Study on Architecture-based SoftwareReliability Analysis. In Proc. 21st InternationalSymposium on Software Reliability Engineering(ISSRE’10). IEEE Computer Society, 2010. To appear.
[18] H. Muccini and A. Romanovsky. Architecting FaultTolerant Systems. Technical Report CS-TR-1051,University of Newcastle upon Tyne, 2007.
[19] F. G. Olumofin and V. B. Misic. Extending theATAM Architecture Evaluation to Product LineArchitectures. In Proc. of WICSA’05, pages 45–56.IEEE Computer Society, 2005.
[20] B. Randell. System structure for software faulttolerance. In Proc. Int. Conf. on Reliable software,pages 437–449. ACM, 1975.
[21] R. H. Reussner, H. W. Schmidt, and I. H. Poernomo.Reliability prediction for component-based softwarearchitectures. Journal of Systems and Software,66(3):241–252, 2003.
[22] N. Sato and K. S. Trivedi. Accurate and efficientstochastic reliability analysis of composite servicesusing their compact Markov reward modelrepresentations. In Proc. of SCC’07, pages 114–121.IEEE Computer Society, 2007.
[23] B. Schroeder and G. A. Gibson. Understanding diskfailure rates: What does an MTTF of 1,000,000 hoursmean to you? ACM Trans. Storage, 3(3):8, 2007.
[24] V. Sharma and K. Trivedi. Quantifying softwareperformance, reliability and security: Anarchitecture-based approach. Journal of Systems andSoftware, 80:493–509, 2007.
[25] V. S. Sharma and K. S. Trivedi. Reliability andPerformance of Component Based Software Systemswith Restarts, Retries, Reboots and Repairs. In Proc.of ISSRE’06, pages 299–310. IEEE, 2006.
[26] W.-L. Wang, D. Pan, and M.-H. Chen.Architecture-based software reliability modeling.Journal of Systems and Software, 79(1):132–146, 2006.
[27] Z. Zheng and M. R. Lyu. Collaborative reliabilityprediction of service-oriented systems. In Proc. ofICSE’10, pages 35–44. ACM Press, 2010.