seng 521 software reliability & software...
TRANSCRIPT
SENG 521SENG 521SENG 521SENG 521Software Reliability & Software Reliability & Software QualitySoftware Quality
Ch tCh t 5 O i f S ft5 O i f S ftChapter Chapter 5: Overview of Software 5: Overview of Software Reliability EngineeringReliability Engineering
Department of Electrical & Computer Engineering, University of Calgary
B.H. Far ([email protected])
http://www.enel.ucalgary.ca/People/far/Lectures/SENG521/
Reliability TheoryReliability TheoryReliability TheoryReliability Theory Reliability theory developed apart from the y y p p
mainstream of probability and statistics, and was used primarily as a tool to help nineteenth century maritime and life insurance companies compute profitable rates t h th i t E t d thto charge their customers. Even today, the terms “failure rate” and “hazard rate” are often used interchangeablyoften used interchangeably.
Probability of survival of merchandize after one MTTF is 1 0 37R e
one MTTF isFrom Engineering Statistics Handbook
0.37R e
Reliability: Natural SystemReliability: Natural SystemReliability: Natural SystemReliability: Natural System Natural system y
life cycle. Aging effect: g g
Life span of a natural system is limited by the maximum reproduction ratereproduction rate of the cells.
Figure from Pressman’s book
Reliability: HardwareReliability: HardwareReliability: HardwareReliability: Hardware Hardware life
cycle. Useful life span p
of a hardware system is limited by the age (wear out) of the system.
Figure from Pressman’s book
Reliability: SoftwareReliability: SoftwareReliability: SoftwareReliability: Software Software life cycle.y Software systems
are changed (updated) many(updated) many times during their life cycle.
Each update adds to the structural deterioration of thedeterioration of the software system.
Figure from Pressman’s book
Software vs HardwareSoftware vs HardwareSoftware vs. HardwareSoftware vs. Hardware
Software reliability doesn’t decrease with Software reliability doesn t decrease with time, i.e., software doesn’t wear out.
Hardware faults are mostly physical faults Hardware faults are mostly physical faults, e.g., fatigue.S ft f lt tl d i f lt Software faults are mostly design faultswhich are harder to measure, model, detect
d tand correct.
Software vs HardwareSoftware vs HardwareSoftware vs. HardwareSoftware vs. Hardware Hardware failure can be “fixed” by replacing a y p g
faulty component with an identical one, therefore no reliability growth. S ft bl b “fi d” b h i th Software problems can be “fixed” by changing the code in order to have the failure not happen again, therefore reliability growth is present.e e o e e b y g ow s p ese .
Software does not go through production phase the same way as hardware does.
Conclusion: hardware reliability models may not be used identically for software.
Reliability: Science Reliability: Science Reliability: Science Reliability: Science
Exploring ways of implementing “reliability” Exploring ways of implementing reliability in software products.
Reliability Science’s goals: Reliability Science s goals: Developing “models” (regression and
aggregation models) and “techniques” to buildaggregation models) and techniques to build reliable software.
Testing such models and techniques for adequacy Testing such models and techniques for adequacy, soundness and completeness.
What is Engineering?What is Engineering?What is Engineering?What is Engineering?
Engineering = What is the problem to be solved? Engineering Analysis + Design +
What is the problem to be solved? What characters of the entity are
used to solve the problem? How will the entity be realized? Design +
Construction + Verification +
How will the entity be realized? How is it constructed? What approach is used to uncover
i d i d t ti ? Verification + Management
errors in design and construction? How will the entity be supported in
the long term?
Reliability: Engineering /1Reliability: Engineering /1Reliability: Engineering /1Reliability: Engineering /1
Engineering of “reliability” in software Engineering of reliability in software products.
Reliability Engineering’s goal: Reliability Engineering s goal:developing software to reach the market With “minimum” development time With minimum development time With “minimum” development cost With “maximum” reliability With maximum reliability With “minimum” expertise needed With “minimum” available technology
gy
Reliability: Engineering /2Reliability: Engineering /2Reliability: Engineering /2Reliability: Engineering /2
Software quality means getting the rightSoftware quality means getting the right balance among development cost, development time people technology and reliabilitytime, people, technology and reliability.
Minimum & Maximum
Cost Time PeopleSRE Cost, Time, People, Technology, Reliability
Optimum
Pick quantitative representations for the 5 factors (cost, time, people, technology and reliability) and measure them!
them!
What is SRE? /1What is SRE? /1What is SRE? /1What is SRE? /1 Software Reliability Engineering (SRE) is a multi-y g g ( )
faceted discipline covering the software product lifecycle.
It involves both technical and managementactivities in three basic areas: Software Development and Maintenance Measurement and Analysis of reliability data
F db k f li bilit i f ti i t th ft Feedback of reliability information into the software lifecycle activities.
What is SRE ? /2What is SRE ? /2What is SRE ? /2What is SRE ? /2 SRE is a practice for quantitatively planning and p q y p g
guiding software development and test, with emphasis on reliability and availability.SRE i lt l d th thi SRE simultaneously does three things: It ensures that product reliability and availability meet
user needs. It delivers the product to market faster. It increases productivity, lowering product life-cycle cost.
In applying SRE, one can vary relative emphasis placed on these three factors.
S ft R li bilitS ft R li bilitSoftware Reliability Software Reliability Engineering (SRE) ProcessEngineering (SRE) Process
ReferenceReferenceReferenceReference Dr. Musa’s SoftwareDr. Musa s Software
Reliability Engineering, 2 Ed
Chapter 1
SRE: Process /1SRE: Process /1SRE: Process /1SRE: Process /1 There are 5 steps in p
SRE process (for each system to test):test): Define necessary
reliability Develop
operational profiles Prepare for test Prepare for test Execute test Apply failure data
id d i i
to guide decisions
SRE: Process /2SRE: Process /2SRE: Process /2SRE: Process /2
Modified version of the SRE Process Modified version of the SRE Process
Ref: Musa’s book 2nd Ed
SRE: Process /2SRE: Process /2SRE: Process /2SRE: Process /2 The Develop Operational Profiles, and Prepare for p p , p
Test activities all start during the Requirements (and perhaps architectural analysis) phase of the software development processdevelopment process.
They all extend to varying degrees into the Design and Implementation phase, as they can be affected d p e e o p se, s ey c be ec edby it.
The Execute Test and Guide Test activities coincide with the Test phase.
SRE: Necessary ReliabilitySRE: Necessary ReliabilitySRE: Necessary ReliabilitySRE: Necessary Reliability Define what “failure” means for the software product.p Choose a common measure for all failure intensities, either
failures per some natural unit or failures per hour.h l f il i i bj i ( ) f h Set the total system failure intensity objective (FIO) for the
software/hardware system. Compute a developed software FIO by subtracting the total Compute a developed software FIO by subtracting the total
of the FIOs of all hardware and acquired software components from the system FIOs.
Use the developed software FIOs to track the reliability growth during system test (later on).
F il I t it Obj ti (FIO)F il I t it Obj ti (FIO)Failure Intensity Objective (FIO)Failure Intensity Objective (FIO)
Failure intensity (λ) is defined as failure per natural y ( ) punits (or time), e.g. 3 alarms per 100 hours of operation. 5 failures per 1000 transactions, etc.
Failure intensity of a cascade (serial) system is the sum of failure intensities for all of the components of the system.
i l d l For exponential model:
1 2
n
system n iz t
1i
How to Set FIO?How to Set FIO?How to Set FIO?How to Set FIO? Setting FIO in terms of system reliability (R) or availability
(A):
1ln 0.95RR or for R
1
ft tA
t A
λ is failure intensityR is reliability
mt Aλ R
R is reliabilityt is natural unit (time, etc.) tm is downtime per failure
A
p
Reliability Reliability vs vs Failure IntensityFailure IntensityReliability Reliability vs. vs. Failure IntensityFailure Intensity
Reliability for 1 hour Failure intensityReliability for 1 hour mission time
Failure intensity
0.36800 1 failure / hour0.90000 105 failure / 1000 hours0.95900 1 failure / day0 99000 10 failure / 1000 hours0.99000 10 failure / 1000 hours0.99400 1 failure / week0.99860 1 failure / month0.99900 1 failure / 1000 hours0.99989 1 failure / year
SRE: OperationSRE: OperationSRE: OperationSRE: Operation An operation is a major system logical task, which p j y g ,
returns control to the system when complete. An operation is an input event affects the course of
b h i f ftbehavior of software. Example: operations for a Web proxy server
Connect internal users to external Web Connect internal users to external Web Email internal users to external users Email external users to internal users DNS request by internal users Etc.
SRE: Operational ModeSRE: Operational ModeSRE: Operational ModeSRE: Operational Mode Operational mode is a distinct pattern of system p p y
use and/or set of environmental conditions that may need separate testing due to likelihood of stimulating different failuresstimulating different failures.
Example: Time (time of year, day of week, time of day) Time (time of year, day of week, time of day) Different user types (customer or user) Users experiences (novice or expert)
The same operation may appear in different operational mode with different probabilities.
SRE: Operational ProfileSRE: Operational ProfileSRE: Operational ProfileSRE: Operational Profile An operational profile is a complete set of operations with their
b biliti f (d i th ti l f th ft )probabilities of occurrence (during the operational use of the software). An operational profile is a description of the distribution of input events
that is expected to occur in actual software operation. The operational profile of the software reflects how it will be used in
practice. Probabilityof occurrence
Operational mode
Operation
SRE S t O ti l P filSRE S t O ti l P filSRE: System Operational ProfileSRE: System Operational Profile System operational profile must be developed for all of its
important operational modes. There are four principal steps in developing an operational
profile:p Identify the operation initiators (i.e., user types, external systems, and
the system itself) List the operations invoked by each initiatorp y Determine the occurrence rates Determine the occurrence probabilities by dividing the occurrence
rates by the total occurrence rate
SRE: Prepare for TestSRE: Prepare for TestSRE: Prepare for TestSRE: Prepare for Test The Prepare for Test activity uses the operational p y p
profiles to prepare test cases and test procedures. Test cases are allocated in accordance with the
ti l filoperational profile. Test cases are assigned to the operations by
selecting from all the possible intra-operationselecting from all the possible intra-operation choices with equal probability.
The test procedure is the controller that invokes test pcases during execution.
SRE: Execute TestSRE: Execute TestSRE: Execute TestSRE: Execute Test Allocate test time among the associated systems and g y
types of test (feature, load, regression, etc.). Invoke the test cases at random times, choosing , g
operations randomly in accordance with the operational profile.
Identify failures, along with when they occur. This information will be used in Apply Failure Data
and Guide Test.
Types of TestTypes of TestTypes of TestTypes of Test Certification Test: Certification Test: Accept or reject (binary
decision) an acquired component for a given target failure intensity.
FeatureFeature (Unit) Test(Unit) Test:: A single execution of an Feature Feature (Unit) Test(Unit) Test:: A single execution of an operation with interaction between operations minimized.Load Test:Load Test: T ti ith fi ld d t d Load Test:Load Test: Testing with field use data and accounting for interactions
Regression Test:Regression Test: Feature tests after every build gg yinvolving significant change, i.e., check whether a bug fix worked.
SRE: Apply Failure DataSRE: Apply Failure DataSRE: Apply Failure DataSRE: Apply Failure Data
Plot each new failure as it occurs on a Plot each new failure as it occurs on a reliability demonstration chart.
Accept or reject software (operations) using Accept or reject software (operations) using reliability demonstration chart.T k li bilit th f lt d Track reliability growth as faults are removed.
Release CriteriaRelease CriteriaRelease CriteriaRelease Criteria
Consider releasing the product when:Consider releasing the product when:1. All acquired components pass certification
testtest2. Test terminated satisfactorily for all the
d t i ti d t ith thproduct variations and components with the failure intensity reaching the target λF
For better confidence, we usually allow λ/λF ratio be below 0.5 (Confidence
factor)
Collect Field DataCollect Field DataCollect Field DataCollect Field Data SRE for the software product lifecycle. Collect field data to use in succeeding releases either using
automatic reporting routines or manual collection, using a random sample of field sites.p
Collect data on failure intensity and on customer satisfaction and use this information in setting the failure intensity objective for the next releaseobjective for the next release.
Measure operational profiles in the field and use this information to correct the operational profiles we estimated.
Collect information to refine the process of choosing reliability strategies in future projects.
However However However …However … Practical implementation of an effective SRE
program is a non-trivial task. Mechanisms for collection and analysis of data on
software product and process quality must be insoftware product and process quality must be in place.
Fault identification and elimination techniques must b i lbe in place.
Other organizational abilities such as the use of reviews and inspections, reliability based testing p , y gand software process improvement are also necessary for effective SRE.