seng 521 software reliability & software...

SENG 521SENG 521SENG 521SENG 521Software Reliability & Software Reliability & Software QualitySoftware Quality

Ch tCh t 5 O i f S ft5 O i f S ftChapter Chapter 5: Overview of Software 5: Overview of Software Reliability EngineeringReliability Engineering

Department of Electrical & Computer Engineering, University of Calgary

B.H. Far （[email protected]）

[email protected] 1

http://www.enel.ucalgary.ca/People/far/Lectures/SENG521/

Reliability TheoryReliability TheoryReliability TheoryReliability Theory Reliability theory developed apart from the y y p p

mainstream of probability and statistics, and was used primarily as a tool to help nineteenth century maritime and life insurance companies compute profitable rates t h th i t E t d thto charge their customers. Even today, the terms “failure rate” and “hazard rate” are often used interchangeablyoften used interchangeably.

Probability of survival of merchandize after one MTTF is 1 0 37R e

[email protected] 2

one MTTF isFrom Engineering Statistics Handbook

0.37R e

Reliability: Natural SystemReliability: Natural SystemReliability: Natural SystemReliability: Natural System Natural system y

life cycle. Aging effect: g g

Life span of a natural system is limited by the maximum reproduction ratereproduction rate of the cells.

[email protected] 3

Figure from Pressman’s book

Reliability: HardwareReliability: HardwareReliability: HardwareReliability: Hardware Hardware life

cycle. Useful life span p

of a hardware system is limited by the age (wear out) of the system.

[email protected] 4


Reliability: SoftwareReliability: SoftwareReliability: SoftwareReliability: Software Software life cycle.y Software systems

are changed (updated) many(updated) many times during their life cycle.

Each update adds to the structural deterioration of thedeterioration of the software system.

[email protected] 5


Software vs HardwareSoftware vs HardwareSoftware vs. HardwareSoftware vs. Hardware

Software reliability doesn’t decrease with Software reliability doesn t decrease with time, i.e., software doesn’t wear out.

Hardware faults are mostly physical faults Hardware faults are mostly physical faults, e.g., fatigue.S ft f lt tl d i f lt Software faults are mostly design faultswhich are harder to measure, model, detect

d tand correct.

[email protected] 6

Software vs HardwareSoftware vs HardwareSoftware vs. HardwareSoftware vs. Hardware Hardware failure can be “fixed” by replacing a y p g

faulty component with an identical one, therefore no reliability growth. S ft bl b “fi d” b h i th Software problems can be “fixed” by changing the code in order to have the failure not happen again, therefore reliability growth is present.e e o e e b y g ow s p ese .

Software does not go through production phase the same way as hardware does.

Conclusion: hardware reliability models may not be used identically for software.

[email protected] 7

Reliability: Science Reliability: Science Reliability: Science Reliability: Science

Exploring ways of implementing “reliability” Exploring ways of implementing reliability in software products.

Reliability Science’s goals: Reliability Science s goals: Developing “models” (regression and

aggregation models) and “techniques” to buildaggregation models) and techniques to build reliable software.

Testing such models and techniques for adequacy Testing such models and techniques for adequacy, soundness and completeness.

[email protected] 8

What is Engineering?What is Engineering?What is Engineering?What is Engineering?

Engineering = What is the problem to be solved? Engineering Analysis + Design +

What is the problem to be solved? What characters of the entity are

used to solve the problem? How will the entity be realized? Design +

Construction + Verification +

How will the entity be realized? How is it constructed? What approach is used to uncover

i d i d t ti ? Verification + Management

errors in design and construction? How will the entity be supported in

the long term?

[email protected] 9

Reliability: Engineering /1Reliability: Engineering /1Reliability: Engineering /1Reliability: Engineering /1

Engineering of “reliability” in software Engineering of reliability in software products.

Reliability Engineering’s goal: Reliability Engineering s goal:developing software to reach the market With “minimum” development time With minimum development time With “minimum” development cost With “maximum” reliability With maximum reliability With “minimum” expertise needed With “minimum” available technology

[email protected] 10

gy

Reliability: Engineering /2Reliability: Engineering /2Reliability: Engineering /2Reliability: Engineering /2

Software quality means getting the rightSoftware quality means getting the right balance among development cost, development time people technology and reliabilitytime, people, technology and reliability.

Minimum & Maximum

Cost Time PeopleSRE Cost, Time, People, Technology, Reliability

Optimum

Pick quantitative representations for the 5 factors (cost, time, people, technology and reliability) and measure them!


them!

What is SRE? /1What is SRE? /1What is SRE? /1What is SRE? /1 Software Reliability Engineering (SRE) is a multi-y g g ( )

faceted discipline covering the software product lifecycle.

It involves both technical and managementactivities in three basic areas: Software Development and Maintenance Measurement and Analysis of reliability data

F db k f li bilit i f ti i t th ft Feedback of reliability information into the software lifecycle activities.


What is SRE ? /2What is SRE ? /2What is SRE ? /2What is SRE ? /2 SRE is a practice for quantitatively planning and p q y p g

guiding software development and test, with emphasis on reliability and availability.SRE i lt l d th thi SRE simultaneously does three things: It ensures that product reliability and availability meet

user needs. It delivers the product to market faster. It increases productivity, lowering product life-cycle cost.

In applying SRE, one can vary relative emphasis placed on these three factors.


S ft R li bilitS ft R li bilitSoftware Reliability Software Reliability Engineering (SRE) ProcessEngineering (SRE) Process


ReferenceReferenceReferenceReference Dr. Musa’s SoftwareDr. Musa s Software

Reliability Engineering, 2 Ed

Chapter 1


SRE: Process /1SRE: Process /1SRE: Process /1SRE: Process /1 There are 5 steps in p

SRE process (for each system to test):test): Define necessary

reliability Develop

operational profiles Prepare for test Prepare for test Execute test Apply failure data

id d i i


to guide decisions

SRE: Process /2SRE: Process /2SRE: Process /2SRE: Process /2

Modified version of the SRE Process Modified version of the SRE Process


Ref: Musa’s book 2nd Ed

SRE: Process /2SRE: Process /2SRE: Process /2SRE: Process /2 The Develop Operational Profiles, and Prepare for p p , p

Test activities all start during the Requirements (and perhaps architectural analysis) phase of the software development processdevelopment process.

They all extend to varying degrees into the Design and Implementation phase, as they can be affected d p e e o p se, s ey c be ec edby it.

The Execute Test and Guide Test activities coincide with the Test phase.


SRE: Necessary ReliabilitySRE: Necessary ReliabilitySRE: Necessary ReliabilitySRE: Necessary Reliability Define what “failure” means for the software product.p Choose a common measure for all failure intensities, either

failures per some natural unit or failures per hour.h l f il i i bj i ( ) f h Set the total system failure intensity objective (FIO) for the

software/hardware system. Compute a developed software FIO by subtracting the total Compute a developed software FIO by subtracting the total

of the FIOs of all hardware and acquired software components from the system FIOs.

Use the developed software FIOs to track the reliability growth during system test (later on).


F il I t it Obj ti (FIO)F il I t it Obj ti (FIO)Failure Intensity Objective (FIO)Failure Intensity Objective (FIO)

Failure intensity (λ) is defined as failure per natural y ( ) punits (or time), e.g. 3 alarms per 100 hours of operation. 5 failures per 1000 transactions, etc.

Failure intensity of a cascade (serial) system is the sum of failure intensities for all of the components of the system.

i l d l For exponential model:

1 2

n

system n iz t


1i

How to Set FIO?How to Set FIO?How to Set FIO?How to Set FIO? Setting FIO in terms of system reliability (R) or availability

(A):

1ln 0.95RR or for R

1

ft tA

t A

λ is failure intensityR is reliability

mt Aλ R

R is reliabilityt is natural unit (time, etc.) tm is downtime per failure

A


p

Reliability Reliability vs vs Failure IntensityFailure IntensityReliability Reliability vs. vs. Failure IntensityFailure Intensity

Reliability for 1 hour Failure intensityReliability for 1 hour mission time

Failure intensity

0.36800 1 failure / hour0.90000 105 failure / 1000 hours0.95900 1 failure / day0 99000 10 failure / 1000 hours0.99000 10 failure / 1000 hours0.99400 1 failure / week0.99860 1 failure / month0.99900 1 failure / 1000 hours0.99989 1 failure / year


SRE: OperationSRE: OperationSRE: OperationSRE: Operation An operation is a major system logical task, which p j y g ,

returns control to the system when complete. An operation is an input event affects the course of

b h i f ftbehavior of software. Example: operations for a Web proxy server

Connect internal users to external Web Connect internal users to external Web Email internal users to external users Email external users to internal users DNS request by internal users Etc.


SRE: Operational ModeSRE: Operational ModeSRE: Operational ModeSRE: Operational Mode Operational mode is a distinct pattern of system p p y

use and/or set of environmental conditions that may need separate testing due to likelihood of stimulating different failuresstimulating different failures.

Example: Time (time of year, day of week, time of day) Time (time of year, day of week, time of day) Different user types (customer or user) Users experiences (novice or expert)

The same operation may appear in different operational mode with different probabilities.


SRE: Operational ProfileSRE: Operational ProfileSRE: Operational ProfileSRE: Operational Profile An operational profile is a complete set of operations with their

b biliti f (d i th ti l f th ft )probabilities of occurrence (during the operational use of the software). An operational profile is a description of the distribution of input events

that is expected to occur in actual software operation. The operational profile of the software reflects how it will be used in

practice. Probabilityof occurrence

Operational mode


Operation

SRE S t O ti l P filSRE S t O ti l P filSRE: System Operational ProfileSRE: System Operational Profile System operational profile must be developed for all of its

important operational modes. There are four principal steps in developing an operational

profile:p Identify the operation initiators (i.e., user types, external systems, and

the system itself) List the operations invoked by each initiatorp y Determine the occurrence rates Determine the occurrence probabilities by dividing the occurrence

rates by the total occurrence rate


SRE: Prepare for TestSRE: Prepare for TestSRE: Prepare for TestSRE: Prepare for Test The Prepare for Test activity uses the operational p y p

profiles to prepare test cases and test procedures. Test cases are allocated in accordance with the

ti l filoperational profile. Test cases are assigned to the operations by

selecting from all the possible intra-operationselecting from all the possible intra-operation choices with equal probability.

The test procedure is the controller that invokes test pcases during execution.


SRE: Execute TestSRE: Execute TestSRE: Execute TestSRE: Execute Test Allocate test time among the associated systems and g y

types of test (feature, load, regression, etc.). Invoke the test cases at random times, choosing , g

operations randomly in accordance with the operational profile.

Identify failures, along with when they occur. This information will be used in Apply Failure Data

and Guide Test.


Types of TestTypes of TestTypes of TestTypes of Test Certification Test: Certification Test: Accept or reject (binary

decision) an acquired component for a given target failure intensity.

FeatureFeature (Unit) Test(Unit) Test:: A single execution of an Feature Feature (Unit) Test(Unit) Test:: A single execution of an operation with interaction between operations minimized.Load Test:Load Test: T ti ith fi ld d t d Load Test:Load Test: Testing with field use data and accounting for interactions

Regression Test:Regression Test: Feature tests after every build gg yinvolving significant change, i.e., check whether a bug fix worked.


SRE: Apply Failure DataSRE: Apply Failure DataSRE: Apply Failure DataSRE: Apply Failure Data

Plot each new failure as it occurs on a Plot each new failure as it occurs on a reliability demonstration chart.

Accept or reject software (operations) using Accept or reject software (operations) using reliability demonstration chart.T k li bilit th f lt d Track reliability growth as faults are removed.


Release CriteriaRelease CriteriaRelease CriteriaRelease Criteria

Consider releasing the product when:Consider releasing the product when:1. All acquired components pass certification

testtest2. Test terminated satisfactorily for all the

d t i ti d t ith thproduct variations and components with the failure intensity reaching the target λF

For better confidence, we usually allow λ/λF ratio be below 0.5 (Confidence


factor)

Collect Field DataCollect Field DataCollect Field DataCollect Field Data SRE for the software product lifecycle. Collect field data to use in succeeding releases either using

automatic reporting routines or manual collection, using a random sample of field sites.p

Collect data on failure intensity and on customer satisfaction and use this information in setting the failure intensity objective for the next releaseobjective for the next release.

Measure operational profiles in the field and use this information to correct the operational profiles we estimated.

Collect information to refine the process of choosing reliability strategies in future projects.


However However However …However … Practical implementation of an effective SRE

program is a non-trivial task. Mechanisms for collection and analysis of data on

software product and process quality must be insoftware product and process quality must be in place.

Fault identification and elimination techniques must b i lbe in place.

Other organizational abilities such as the use of reviews and inspections, reliability based testing p , y gand software process improvement are also necessary for effective SRE.


seng 521 software reliability & software...

Documents