in-process metrics for software testing - jazz.net metrics for software testing in this section, ......

In-process metricsfor software testing

by S. H. KanJ. ParrishD. Manlove

In-process tracking and measurements play acritical role in software development, particularlyfor software testing. Although there are manydiscussions and publications on this subject andnumerous proposed metrics, few in-processmetrics are presented with sufficient experiencesof industry implementation to demonstrate theirusefulness. This paper describes several in-process metrics whose usefulness has beenproven with ample implementation experiencesat the IBM Rochester AS/400® softwaredevelopment laboratory. For each metric, wediscuss its purpose, data, interpretation, and useand present a graphic example with real-lifedata. We contend that most of these metrics,with appropriate tailoring as needed, areapplicable to most software projects and shouldbe an integral part of software testing.

Measurement plays a critical role in effectivesoftware development. It provides the scien-

tific basis for software engineering to become a trueengineering discipline. As the discipline has beenprogressing toward maturity, the importance of mea-surement has been gaining acceptance and recog-nition. For example, in the highly regarded softwaredevelopment process assessment and improvementframework known as the Capability Maturity Model,developed by the Software Engineering Institute atCarnegie Mellon University, process measurementand analysis and utilizing quantitative methods forquality management are the two key process activ-ities at the Level 4 maturity.1,2

In applying measurements to software engineering,several types of metrics are available, for example,process and project metrics versus product metrics,

or metrics pertaining to the final product versus met-rics used during the development of the product.From the standpoint of project management in soft-ware development, it is the latter type of metrics thatis the most useful—the in-process metrics. Effectiveuse of good in-process metrics can significantly en-hance the success of the project, i.e., on-time deliv-ery with desirable quality.

Although there are numerous discussions and pub-lications in the software industry on measurementsand metrics, few in-process metrics are describedwith sufficient experiences of industry implementa-tion to demonstrate their usefulness. In this paper,we intend to describe several in-process metrics per-taining to the test phases in the software develop-ment cycle for release and quality management.These metrics have gone through ample implemen-tation experiences in the IBM Rochester AS/400* (Ap-plication System/400*) software development labo-ratory for a number of years, and some of them likelyare used in other IBM development organizations aswell. For those readers who may not be familiar withthe AS/400, it is a midmarket server for e-business.To help meet the demands of enterprise e-commerceapplications, the AS/400 features native support forkey Web-enabling technologies. The AS/400 systemsoftware includes microcode supporting the hard-ware, the Operating System/400* (OS/400*), and manylicensed program products supporting the latest tech-

rCopyright 2001 by International Business Machines Corpora-tion. Copying in printed form for private use is permitted with-out payment of royalty provided that (1) each reproduction is donewithout alteration and (2) the Journal reference and IBM copy-right notice are included on the first page. The title and abstract,but no other portions, of this paper may be copied or distributedroyalty free without further permission by computer-based andother information-service systems. Permission to republish anyother portion of this paper must be obtained from the Editor.

KAN, PARRISH, AND MANLOVE 0018-8670/01/$5.00 © 2001 IBM IBM SYSTEMS JOURNAL, VOL 40, NO 1, 2001220

nologies. The size of the AS/400 system software iscurrently about 45 million lines of code. For eachnew release, the development effort involves abouttwo to three million lines of new and changed code.

It should be noted that the objective of this paperis not to research and propose new software met-rics, although it may not be that all the metrics dis-cussed are familiar to everyone. Rather, its purposeis to discuss the usage of implementation-provenmetrics and address practical issues in the manage-ment of software testing. We confine our discussionto metrics that are relevant to software testing afterthe code is integrated into the system library. We donot include metrics pertaining to the front end ofthe development process such as design review, codeinspection, or code integration and driver builds. Foreach metric, we discuss its purpose, data, interpre-tation and use, and where applicable, pros and cons.We also provide a graphic presentation where pos-sible, based on real-life data. In a later section, wediscuss in-process quality management vis-a-vis thesemetrics and a metrics framework that we call theeffort/outcome paradigm. Before the conclusion of thepaper, we also discuss the pertinent question: Howdo you know your product is good enough to ship?

Since the examples in this paper are based on ex-periences with the AS/400, it would be useful to out-line the software development and test process forthe AS/400 as the overall context. The software de-velopment process for the AS/400 is a front-end fo-cused model with emphases on key intermediate de-liverables such as architecture, design and designverification, code integration quality, and driverbuilds. For example, the completion of a high-leveldesign review is always a key event in the systemschedule and managed as a key intermediate deliv-erable. At the same time, testing (development andindependent testing) and customer validation are thekey phases with an equally strong focus. As Figure1 shows, the common industry model of testing in-cludes functional test, system test, and customer betatrials before the product is shipped. Integration andsolution test can occur before or after a product isfirst shipped and is often conducted by customersbecause a customer’s integrated solution may con-sist of products from different vendors. For AS/400,the first formal test after unit test and code integra-tion into the system library consists of componenttest (CT) and component regression test (CRT), whichis equivalent to functional test. Along the way, a stresstest is also conducted in a large network environ-ment with performance workload running in the

background to stress the system. When significantprogress is made in component test, the product leveltest (PLT), which focuses on the products and sub-systems of the AS/400 system (e.g., Transmission Con-trol Protocol/Internet Protocol, database, client ac-cess, clustering), starts. Network test is a specificproduct level test focusing on communications andrelated error recovery processes. The above tests areconducted by the development teams (CT, CRT) orby joint effort between the development teams andthe independent test team (PLT, stress test). The testsdone by the independent test group include the in-stall test and the software system integration test(called Software RAISE—for reliability, availability,install, service, environment) which is preceded byits acceptance test—the system test acceptance test(STAT). Because AS/400 is a highly integrated system,these different tests each play an important role inachieving a quality deliverable for each release ofthe software system. As Figure 1 shows, several earlycustomer programs also start in the back end of ourdevelopment process, and some of them normallyrun until 30 days after general availability (GA) ofthe product.

In-process metrics for software testing

In this section, we discuss what in-process metricsare used in software testing.

Test progress S curve (plan, attempted, actual). Testprogress tracking is perhaps the most important andbasic tracking for managing software testing. Themetric we recommend is a test progress S curve overtime with the x-axis representing the time unit (pref-erably in weeks) and the y-axis representing the num-ber of test cases or test points. By S curve we meanthat the data are cumulative over time and resem-ble an “s” shape as a result of the period of intensetest activity, causing a steep planned test ramp-up.For the metric to be useful, it should contain the fol-lowing information on the same graph:

● Planned progress over time in terms of number oftest cases or number of test points to be completedsuccessfully by week

● Number of test cases attempted by week● Number of test cases completed successfully by

week

The purpose of this metric is obvious—to track ac-tual testing progress against plan and therefore tobe able to be proactive upon early indications thattesting activity is falling behind. It is well-known that

IBM SYSTEMS JOURNAL, VOL 40, NO 1, 2001 KAN, PARRISH, AND MANLOVE 221

when the schedule is under pressure in software de-velopment, it is normally testing (especially devel-opment testing, i.e., unit test and component test orfunctional verification test) that is impacted (cut orreduced). Schedule slippage occurs day by day andweek by week. With a formal metric in place, it ismuch more difficult for the team to ignore the prob-lem, and they will be more likely to take actions.

From the project planning perspective, the requestfor an S curve forces better planning (see further dis-cussion in the following paragraphs).

The example shown in Figure 2 shows the compo-nent test (functional verification test) metric at theend of the test for a major release of the AS/400 soft-ware system.

Figure 1 AS/400 software testing cycle

DEVELOPMENTTESTCOMPLETE

COMMONINDUSTRYMODEL:

AS/400 SWDEVELOPMENT:

SUPPORTED BY LARGE NETWORK ENVIRONMENT (LNE)

AS/400INDEPENDENTTEST

EARLYCUSTOMERPROGRAMS

FUNCTION TEST SYSTEM TEST

BETAGA

INTEGRATIONSOLUTION TEST

COMPONENT TEST (CT)

SW STRESS TEST

PRODUCT LEVEL TEST (PLT)

NETWORK TEST

SW STRESS (REGRESSION)

CRT

GA

SW INSTALL TEST (SIT)

STAT SW RAISE

CUSTOMER INVITATIONAL PROGRAM

ISNONPRODUCTION

IS PRODUCTION

BETANONPRODUCTION

BETA PRODUCTION

BUSINESS PARTNER BETA

KAN, PARRISH, AND MANLOVE IBM SYSTEMS JOURNAL, VOL 40, NO 1, 2001222

As can be seen from the figure, the testing progressplan is expressed in terms of a line curve, which isput in place in advance. The lightly shaded bars arethe cumulative number of test cases attempted, andthe red bars represent the number of successful testcases. With the plan curve in place, each week whenthe test is in progress, two more bars (one for at-tempted and one for successful completion) areadded to the graph. This example shows that duringthe rapid test ramp-up period (the steep slope of thecurve), for some weeks the test cases attempted wereslightly ahead of plan (which is possible), and thesuccesses were slightly behind plan.

Because some test cases may be more important thanothers, it is not a rare practice in software testing toassign test scores to the test cases. Using test scoresis a normalized approach that provides more accu-rate tracking of test progress. The assignment ofscores or points is normally based on experience, andfor AS/400, teams usually use a 10-point scale (10 forthe most important test cases and 1 for the least).As mentioned before, the unit of tracking for thisS curve metric can be any unit that is appropriate.To make the transition from basic test-case S-curvetracking to test-point tracking, the teams simply en-

ter the test points into the project management tool(for the overall plan initially and for actuals everyweek). The example in Figure 3 shows test pointtracking for a product level test, which was under-way for a particular function of the AS/400 system. Itshould be noted that there is always an element ofsubjectivity in the assignment of weights. The weightsand the resulting test scores should be determinedin the testing planning stage and remain unchangedduring the testing process. Otherwise, the purposeof this metric will be lost in the reality of schedulepressures. In software engineering, weighting and testscore assignment remains an interesting area wheremore research is needed. Possible guidelines fromsuch research will surely benefit the planning andmanagement of software testing.

For tracking purposes, testing progress can also beweighted by some measurement of coverage. For ex-ample, test cases can be weighted by the lines of codetested. Coverage weighting and test score assignmentconsistency become increasingly important in pro-portion to the number of development groups in-volved in a project. Lack of attention to tracking con-sistency across functional areas can result in amisleading view of true system progress.

120

100

80

60

40

20

0

Figure 2 Testing progress S curve example

TES

TC

AS

ES

4 1 29 27 24 21 19 16 14 11

MAY199_

JUN JUL AUG SEP OCT NOV DEC JAN199_WEEK ENDING

ATTEMPTEDSUCCESSFULPLANNED

20000

40000

60000

80000

100000

120000


When a plan curve is in place, an in-process targetcan be set up by the team to reduce the risk of sched-ule slippage. For instance, a disparity target of 15percent or 20 percent between attempted (or suc-cessful) and planned test cases can be used to trig-ger additional actions versus a business-as-usual ap-proach. Although the testing progress S curves, asshown in Figures 2 and 3, give a quick visual statusof the progress against the total plan and plan-to-date (the eye can quickly determine if testing is aheador behind on planned attempts and successes), it maybe difficult to discern the exact amount of slippage.This is particularly true for large testing efforts, wherethe number of test cases is in the hundreds of thou-sands, as in our first example. For that reason, it isuseful to also display testing status in tabular form,as in Table 1. The table also shows underlying databroken out by department and product or compo-nent, which helps to identify problem areas. In somecases, the overall test curve may appear to be onschedule, but because some areas are ahead of sched-ule, they may mask areas that are behind when pro-gress is only viewed at the system level. Of course,testing progress S curves are also used for functionalareas and for specific products.

When an initial plan curve is put in place, the curveshould be subject to brainstorming and being chal-lenged. For example, if the curve shows a very steepramp-up in a short period of time, the project man-ager may challenge the team with respect to how do-able the plan curve is or what the team’s specific ac-tions to execute the plan are. As a result, betterplanning will be achieved. It should be cautioned thatbefore the team settles on a plan curve and uses itas a criterion to track progress, a critical evaluationof what the plan curve represents should be made.For example, is the total test suite considered effec-tive? Does the plan curve represent high test cov-erage? What are the rationales behind the sequencesof test cases in the plan? This type of evaluation isimportant because once the plan curve is in place,the visibility of this metric tends to draw the wholeteam’s attention to the disparity between attempts,successes, and the plan.

Once the plan line is set, any changes to the planshould be reviewed. Plan slips should be evaluatedagainst the project schedule. In general, the base-line plan curve should be maintained as a reference.Ongoing changes to the planned testing schedule canmask schedule slips by indicating that attempts are

100

80

60

40

20

0

WEEK ENDING DATES

TES

T P

OIN

TS

Figure 3 Testing progress S curve — test points tracking

ATTEMPTED

SUCCESSFUL

PLANNED

12/10/99 12/17/99 12/24/99 12/31/99 01/07/00 01/14/00 01/21/00 01/28/00 02/04/00 02/11/00 02/18/00 02/25/00 03/03/00


on track, whereas the plan is actually moving for-ward.

In addition to what is described above, this metriccan be used for release-to-release or project-to-proj-ect comparisons, as the example in Figure 4 shows.

For release-to-release comparisons, it is importantto use weeks before product general availability (GA)

as the time unit for the x-axis. By referencing the GAdates, the comparison provides a true in-process sta-tus of the current release. In Figure 4, we can seethat the release represented by the red, thicker curveis more back-end loaded than the release representedby the blue, thinner curve. In this context, the met-ric is both a quality and a schedule statement for therelease, as late testing will affect late cycle defect ar-rivals and hence the quality of the final product. With

WEEKS TO GA

110

100

90

80

70

60

50

40

30

20

10

0–42 –39 –36 –33 –30 –27 –24 –21 –18 –15 –12 –9 –6 –3 0

% O

F C

Ts P

LAN

NED

Figure 4 Testing plan curve—release-to-release comparison

Table 1 Test progress tracking—plan, attempted, successful

Plan to Date(# of Test

Cases)

PercentAttempted

of Plan

PercentSuccessful

of Plan

Plan NotAttempted

(# of Test Cases)

PercentAttempted

of Total

PercentSuccessful

of Total

System 60577 90.19 87.72 5940 68.27 66.1Dept. A 1043 66.83 28.19 346 38.83 15.6Dept. B 708 87.29 84.46 90 33.68 32.59Dept. C 33521 87.72 85.59 4118 70.6 68.88Dept. D 11275 96.25 95.25 423 80.32 78.53Dept. E 1780 98.03 94.49 35 52.48 50.04Dept. F 4902 100 99.41 0 96.95 95.93Product A 13000 70.45 65.1 3841 53.88 49.7Product B 3976 89.51 89.19 417 66.82 66.5Product C 1175 66.98 65.62 388 32.12 31.4Product D 277 0 0 277 0 0Product E 232 6.47 6.47 217 3.78 3.7


this type of comparison, the project team can planahead (even before the actual start of testing) to mit-igate the risks.

To implement this metric, the testing execution planneeds to be laid out in terms of the weekly target,and actual data need to be tracked on a weekly ba-sis. For small to medium projects, such planning andtracking activities can use common tools such as Lo-tus 1-2-3** or other project management tools. Forlarge, complex projects, a stronger tools support fa-cility normally associated with the development envi-ronment may be needed. For AS/400, the data aretracked via the DCR (Design Change Request—a toolfor project management and technical informationfor all development items) tool, which is a part ofthe IDSS/DEV2000 (Integrated Development SupportSystem) environment. The DCR is an on-line tool totrack important characteristics and the status of eachdevelopment item. For each line item in the releaseplan, one or more DCRs are created. Included in theDCRs are descriptions of the function to be devel-oped, components that are involved, major designconsiderations, interface considerations (e.g., appli-cation programming interface and system interface),performance considerations, Double Byte Charac-ter Set (DBCS) requirements, component work sum-mary sections, lines of code estimates, machine read-able information (MRI), design review dates, codeintegration schedule, testing plans (number of testcases and dates), and actual progress made (by date).For the system design control group reviews, DCRsare the key documents with the information neededto ensure design consistency across the system. Forhigh-level design reviews, DCRs are often includedin the review package together with the design doc-ument itself. The DCR process also functions as thechange-control process at the front end of the de-velopment process. Specifically, for each integrationof code into the system library by developers, a rea-son code (a DCR number) is required. For test pro-gress tracking, development or test teams enter datainto DCRs, databases are updated daily, and a set ofreporting and analysis programs (VMAS, or VirtualMachine Application System, SAS**, Freelance*) areused to produce the reports and graphs.

PTR (test defects) arrivals over time. Defect track-ing during the testing phase is highly recommendedas a standard practice for any software testing. AtIBM Rochester, defect tracking is done via the Prob-lem Tracking Report (PTR) tool. With regard to met-rics, what is recommended here is tracking the testdefect arrival pattern over time, in addition to track-

ing by test phase. Overall defect density during test,or for a particular test, is a summary indicator, butnot really an in-process indicator. The pattern of de-fect arrivals over time gives more information. Evenwith the same overall defect rate during testing, dif-ferent patterns of defect arrivals imply different qual-ity levels in the field. We recommend the followingfor this metric:

● Always include data for a comparable release orproject in the tracking chart if possible. If there isno baseline data or model data, at the minimum,some expected level of PTR arrivals at key pointsof the project schedule should be set when track-ing starts (e.g., system test entry, cut-off date forcode integration for GA driver).

● The unit for the x-axis is weeks before GA.● The unit for the y-axis is the number of PTR ar-

rivals for the week, or related measures.

Figure 5 is an example of this metric for several AS/400releases. For this example, release-to-release com-parison at the system level is the main goal. The met-ric is also often used for defect arrivals for specifictests, i.e., functional test, product level test, or sys-tem integration test.

The figure has been simplified for presentation. Theactual graph has much more information, includingvertical lines depicting the key dates of the devel-opment cycle and system schedules such as last newfunction integration, development test completion,start of system test, etc. There are also variations ofthe metric: total PTR arrivals, high-severity PTRs, PTRsnormalized to the size of the release (new andchanged code plus a partial weight for ported code),raw PTR arrivals versus valid PTRs, and number ofPTRs closed each week. The main chart, and the mostuseful one, is the total number of PTR arrivals. How-ever, we also include a high-severity PTR chart anda normalized view as mainstays of our tracking. Thenormalized PTR chart can help to eliminate some ofthe visual guesswork of comparing current progressto historical data. In conjunction with the high-se-verity PTR chart, a chart that displays the relative per-centage of high-severity PTRs per week can be use-ful. Our experience indicated that the percentage ofhigh-severity problems normally increases as the re-lease moves toward GA, while the total number de-creases to a very low level. Unusual swings in thepercentage of high-severity problems could signal se-rious problems and should be investigated.


When do the PTR arrivals peak relative to time toGA and previous releases? How high do they peak?Do they decline to a low and stable level before GA?Questions such as these are crucial to the PTR ar-rivals metric, which has significant quality implica-tions for the product in the field. The ideal patternof PTR arrivals is one with higher arrivals earlier, anearlier peak (relative to previous releases or the base-line), and a decline to a lower level earlier beforeGA. The latter part of the curve is especially impor-tant since it is strongly correlated with quality in thefield, as long as the decline and the low PTR levelsbefore GA are not attributable to an artificial reduc-tion or even to the halt of testing activities. High PTRactivity before GA is more often than not a sign ofquality problems. Myers3 has a famous counterin-tuitive principle in software testing, which states thatthe more defects found during testing, the more thatremain to be found later. The reason underlying theprinciple is that finding more defects is an indica-tion of high error injection during the developmentprocess, if testing effectiveness has not improveddrastically. Therefore, this metric can be interpretedin the following ways:

● If the defect rate is the same or lower than thatof the previous release (or a model), then ask: Isthe testing for the current release less effective?

– If the answer is no, the quality perspective is pos-itive.

– If the answer is yes, extra testing is needed. Andbeyond immediate actions for the current proj-ect, process improvement in development andtesting should be sought.

● If the defect rate is substantially higher than thatof the previous release (or a model), then ask: Didwe plan for and actually improve testing effective-ness significantly?– If the answer is no, the quality is negative. Iron-

ically, at this stage of the development cycle,when the “designing-in quality” phases are past(i.e., at test phase and under the PTR change con-trol process), any remedial actions (additionalinspections, re-reviews, early customer pro-grams, and most likely more testing) will yieldhigher PTR rates.

– If the answer is yes, then the quality perspectiveis the same or positive.

The above approach to interpretation of the PTR ar-rival metric is related to the framework of oureffort/outcome paradigm for in-process metrics qual-ity management. We discuss the framework later inthis paper.

The data underlying the PTR arrival curve can be fur-ther displayed and analyzed in numerous ways to gain

Figure 5 Testing defect arrivals metric

WEEKS TO GA

# OF PROBLEMS

–71 –66 –61 –56 –51 –46 –41 –36 –31 –26 –21 –16 –11 –6 –1

BET

TER

V4R1

V4R2

V4R3


additional understanding of the testing progress andproduct quality. Some of the views that we havefound useful include the pattern of build and inte-gration of PTR arrivals and the pattern of activitysource over time. The later charts can be used in con-junction with testing progress to help assess qualityas indicated by the effort/outcome paradigm.

Most software development organizations have testdefect tracking mechanisms in place. For organiza-tions with more mature development processes, theproblem tracking system also serves as a tool forchange control, i.e., code changes are tracked andcontrolled by PTRs during the formal test phases.Such change control is very important in achievingstability for large and complex software projects. ForAS/400, data for the above PTR metric are from thePTR process that is a part of the IDSS/DEV2000 devel-opment environment. The PTR process is also thechange control process after code integration intothe system library. Integration of any code changesto the system library to fix defects from testing re-quires a PTR number.

PTR (test defects) backlog over time. We define thenumber of PTRs remaining at any given time as the“PTR backlog.” Simply put, the PTR backlog is theaccumulated difference between PTR arrivals andPTRs that were closed. PTR backlog tracking and man-agement are important from the perspective of bothtesting progress and customer rediscoveries. A largenumber of outstanding defects during the develop-ment cycle will likely impede testing progress, andwhen the product is about to go to GA, will directlyaffect field quality performance of the product. Forsoftware organizations that have separate teams con-ducting development testing and fixing defects,defects in the backlog should be kept at the lowestpossible level at all times. For those software orga-nizations in which the same teams are responsiblefor design, code, development testing, and fixing de-fects, however, there are appropriate timing windowsin the development cycle for which the highest pri-ority on what to focus may vary.

Although the PTR backlog should be managed at areasonable level at all times, it should not be the high-est priority during a period when making headwayin functional testing is the most important develop-ment activity. During the prime time for develop-ment testing, the focus should be on test effective-ness and test execution, and defect discovery shouldbe encouraged to the maximum possible extent. Fo-cusing too early on overall PTR reduction may run

into a conflict with these objectives and may lead tomore latent defects not being discovered or beingdiscovered later at subsequent testing phases (e.g.,during the independent testing time frame), or PTRsnot opened to record those defects that are discov-ered. The focus should be on the fix turnaround ofthe critical defects that impede testing progress in-stead of the entire backlog. Of course, when testingis approaching completion, there should be a strongfocus on drastic PTR backlog reductions.

For software development projects that build on ex-isting systems, a large backlog of “aged” problemscan develop over time. These aged PTRs often rep-resent fixes or enhancements that developers believewould legitimately improve the product, but whichare passed over during development because of re-source or design constraints. They may also repre-sent problems that have already been fixed or areobsolete because of other changes. Without a con-certed effort, this aged backlog can build over time.This area of the PTR backlog is one which can andshould be focused on earlier in the release cycle, evenprior to the start of development testing.

Figure 6 shows an example of the PTR backlog met-ric for AS/400. Again, release-to-release comparisonsand actuals versus targets are the main objectives.The specific targets in the graph are associated withkey dates in the development schedule, for exam-ple, cut-off dates for fix integration control and forthe GA driver.

It should be noted that for this metric, solely focus-ing on the numbers is not sufficient. In addition tothe overall reduction, what types and which specificdefects should be fixed first play a very importantrole in terms of achieving system stability early. Inthis regard, the expertise and ownership of the de-velopment and test teams are crucial.

Unlike PTR arrivals that should not be controlled ar-tificially, the PTR backlog is completely under thecontrol of the development organization. For thethree metrics we have discussed so far, the overallproject management approach should be as follows:

● When a test plan is in place and its effectivenessevaluated and accepted, manage test progress toachieve early and fast ramp-up.

● Monitor PTR arrivals and analyze the problems(e.g., defect cause analysis and Pareto analysis ofproblem areas of the product) to gain a better un-derstanding of the test and the quality of the prod-


uct, but do not artificially manage or control PTRarrivals, which is a function of test effectiveness,test progress, and the intrinsic quality of the code(the amount of defects latent in the code). (Pa-reto analysis is based on the 80–20 principle thatstates that normally 20 percent of the causes ac-counts for 80 percent of the problems. Its purposeis to identify the few vital causes or areas that canbe targeted for improvement. For examples of Pa-reto analysis of software problems, see Chapter 5in Kan.4)

● Strongly manage PTR backlog reduction, especiallywith regard to the types of problems being fixedearly and achieving predetermined targets that arenormally associated with the fix integration datesof the most important drivers.

The three metrics discussed so far are obviously in-terrelated, and therefore special insights can begained by viewing them together. We have alreadydiscussed the use of the effort/outcome paradigm foranalyzing PTR arrivals and testing progress. Otherexamples include analyzing testing progress and thePTR backlog together to determine areas of concernand examining PTR arrivals versus backlog reductionto project future progress. Components or develop-ment groups that are significantly behind in their test-

ing and have a large PTR backlog to overcome war-rant special analysis and corrective actions. A largebacklog can mire progress and, when testing progressis already lagging, prevent rudimentary recovery ac-tions from being effective. Combined analysis of test-ing progress and the backlog is discussed in moredetail in a later section. Progress against the back-log, when viewed in conjunction with the ongoingPTR arrival rate, gives an indication of the total ca-pacity for fixing problems in a given period of time.This maximum capacity can then be used to helpproject when target backlog levels will be achieved.

Product/release size over time. Lines of code or an-other indicator of product size or function (e.g., func-tion points) can also be tracked as a gauge of the“effort” side of the development equation. Duringdevelopment there is a tendency toward productgrowth as requirements and designs are fleshed out.Functions may also continue to be added because oflate requirements or a developer’s own desire to makeimprovements. A project size indicator, tracked overtime, can serve as an explanatory factor for testing pro-gress, PTR arrivals, and PTR backlog. It can also relatethe measurement of total defect volume to per unit im-provement or deterioration.

TARGET = X

TARGET = Y

WEEK BEFORE GA

# OF PROBLEMS

Figure 6 Testing defect backlog tracking


CPU utilization during test. For computer systemsor software products for which a high level of sta-bility is required in order to meet customers’ needs(for example, systems or products for mission-crit-ical applications), it is important that the productperforms well under stress. In testing software dur-ing the development process, the level of CPU uti-lization is an indicator of the extent that the systemis being stressed.

To ensure that our software tests are effective, theAS/400 development laboratory sets CPU utilizationtargets for the software stress test, the system testacceptance test (STAT), and the system integrationtest (the RAISE test). The stress test starts at the mid-dle of component test and may run into the RAISEtime frame with the purpose of stressing the systemin order to uncover latent defects that cause systemcrashes and hangs that are not easily discovered innormal testing environments. It is conducted with anetwork of systems. RAISE test is the final system in-tegration test, with a customer-like environment. Itis conducted in an environment that simulates an en-terprise environment with a network of systems in-cluding CPU-intensive applications and interactivecomputing. During test execution, the environment

is run as though it were a 24-hour-a-day, seven-day-a-week operation. STAT is used as a preparatory testprior to the formal RAISE test entry.

The example in Figure 7 demonstrates the trackingof CPU utilization over time for the software stresstest. We have a two-phase target as represented bythe step line in the chart. The original target was setat 16 CPU hours per system per day on the average,with the following rationale:

● The stress test runs 20 hours per day, with fourhours of system maintenance.

● The CPU utilization target is 80%1.

For the last couple of releases, the team felt that thebar should be raised when the development cycle isapproaching GA. Therefore, the target from threemonths to GA is now 18 CPU hours per system perday. As the chart shows, a key element of this met-ric, in addition to actual versus target, is release-to-release comparison. One can observe that the V4R2curve had more data points in the early developmentcycle, which were at higher CPU utilization levels. Itresults from conducting pretest runs prior to whenthe new release content became available. As men-

Figure 7 CPU utilization metric

WEEKS BEFORE GA

CPU HOURS PER SYSTEM PER DAY

TARGET

V4R1

V4R2

V3R7


tioned before, the formal stress test starts at the mid-dle of the component test cycle, as the V3R7 and V4R1curves indicate. The CPU utilization metric is usedtogether with the system crashes and unplanned IPL(initial program load, or “reboot”) metric. We willdiscuss this relationship in the next subsection.

To collect actual CPU utilization data, we rely on aperformance monitor tool that runs continuously (a24-hour-a-day, seven-day-a-week operation) on eachtest system. Through the communication network,the data from the test systems are sent to an AS/400nontest system on a real-time basis, and by meansof a Lotus Notes** database application, the finaldata can be easily tallied, displayed, and monitored.

System crashes and unplanned IPLs. Going handin hand with the CPU utilization metric is the metricfor system crashes and unplanned IPLs. For softwaretests whose purpose is to improve the stability of thesystem, we need to ensure that the system is stressedand testing is conducted effectively to uncover thelatent defects leading to system crashes and hangsor, in general, any IPLs that are unplanned. Whensuch defects are discovered and fixed, stability of thesystem improves over time. Therefore, the metrics

of CPU utilization (stress level) and unplanned IPLsdescribe the effort aspect and the outcome (findings)aspect of the effectiveness of the test, respectively.

Figure 8 shows an example of the system crashes andhangs metric for the stress test for several AS/400 re-leases. The target curve, which was derived afterV3R7, was developed by applying statistical tech-niques (for example, a simple exponential model ora specific software reliability model) to the datapoints from several previous releases including V3R7.

In terms of data collection, when a system crash orhang occurs and the tester re-IPLs the system, theperformance monitor and IPL tracking tool will pro-duce a screen prompt and request information aboutthe last IPL. The tester can ignore the prompt tem-porarily, but it will reappear regularly after a cer-tain time until the questions are answered. One ofthe screen images of this prompt is shown in Figure9. Information elicited via this tool includes: test sys-tem, network identifier, tester name, IPL code andreason (and additional comments), system referencecode (SRC) (if available), data and time system wentdown, release, driver, PTR number (the defect thatcauses the system crash or hang), and the name of

WEEKS BEFORE GA

# OF UNPLANNED IPLs

Figure 8 System crashes/hangs metric

V4R1

V4R2

V3R7

TARGET


the product. The IPL reason code consists of the fol-lowing categories:

● 001 hardware problem (unplanned)● 002 software problem (unplanned)● 003 other problem (unplanned)● 004 load fix (planned)

Because the volume and trend of system crashes andhangs are germane to the stability performance ofthe product in the field, we highly recommend thisin-process metric for any software for which stabil-ity is considered an important attribute. These datashould also be used to make release-to-release com-parisons in similar time frames prior to GA, and caneventually be used as leading indicators to GA read-iness. Although CPU utilization tracking definitely re-quires a tool, tracking of system crashes and hangscan start with pencil and paper if a proper processis in place.

Mean time to IPL (MTI). Mean time to failure(MTTF) or mean time between failures (MTBF) is thestandard measurement in the reliability literature.5

In the software reliability literature, this metric andvarious models associated with it have also been dis-cussed extensively. The discussions and use of thismetric are predominantly related to academic re-search or specific-purpose software systems. To theauthors’ knowledge, implementation of this metricis rare in organizations that develop commercial sys-tems. This may be due to several reasons, includingissues related to single-system versus multiple-sys-tems testing, the definition of a failure, the feasibil-ity and cost in tracking all failures and detailed time-

related data (note: failures are different from defectsor faults because a single defect can cause failuresmultiple times and in different machines) in largecommercial projects during testing, and the value andreturn on investment of such tracking.

System crashes and hangs and therefore unplannedIPLs (or rebooting) are the more severe forms of fail-ures. Such failures are more clear-cut and easier totrack, and metrics based on such data are moremeaningful. Therefore, for the AS/400, we use meantime to unplanned IPL (MTI) as the key software re-liability metric. This metric is used only during theSTAT/RAISE tests time period, which, as mentionedbefore, is a customer-like system integration testprior to GA. Using this metric for other tests earlierin the development cycle is possible but will not beas meaningful since all the components of the sys-tem cannot be addressed collectively until the finalsystem integration test. To calculate MTI, one cansimply take the total number of CPU run hours foreach week (Hi) and divide it by the number of un-planned IPLs plus 1 (I i 1 1). For example, if the to-tal CPU run hours from all systems for a specific weekwas 320 CPU hours and there was one unplanned IPLcaused by a system crash, the MTI for that week wouldbe 320/(1 1 1) 5 160 CPU hours. In the AS/400 im-plementation, we also apply a set of weighting fac-tors that were derived based on results from priorbaseline releases. The purpose of using the weight-ing factors is to take the outcome from the priorweeks into account so that at the end of the RAISEtest (with a duration of 10 weeks), the MTI repre-sents an entire RAISE statement. It is up to the prac-titioner whether or not to use weighting or how todistribute the weights heuristically. Deciding factorsmay include: type of products and systems under test,test cycle duration, or even how the testing periodis planned and managed.

Weekly MTIn 5 Oi51

n

Wi p S Hi

Ii 1 1Dwhere n 5 the number of weeks that testing has beenperformed on the current driver (i.e., the currentweek of test), H 5 total of weekly CPU run hours,W 5 weighting factor, and I 5 number of weekly(unique) unplanned IPLs (due to software failures).

Figure 10 is a real-life example of the MTI metric forthe STAT and RAISE tests for a recent AS/400 release.The x-axis represents the number of weeks to GA.

Figure 9 Screen image of an IPL tracking tool


The y-axis on the right side is MTI and on the leftside is the number of unplanned IPLs. Inside thechart, the shaded areas represent the actual num-ber of unique unplanned IPLs (crashes and hangs)encountered. From the start of the STAT test, the MTImetric is shown tracking to plan until week 10 (be-fore GA) when three system crashes occurred dur-ing a one-week period. From the significant drop ofthe MTI, it was evident that with the original test plan,there would not be enough burn-in time for the sys-tem to reach the MTI target. Since this lack of burn-intime typically results in undetected critical problemsat GA, additional testing was added, and the RAISEtest was lengthened by three weeks with the productGA date unchanged. (Note that in the original plan,the duration for STAT and RAISE was three weeksand ten weeks, respectively. The actual test cycle was17 weeks.) As in the case above, discrepancies be-tween actual and targeted MTI should be used to trig-ger early, proactive decisions to adjust testing plansand schedules to make sure that GA criteria forburn-in can be achieved, or at a minimum, that therisks are well understood and a mitigation plan is inplace.

Additionally, MTI can also be used to provide a“heads-up” approach for assessing weekly in-process

testing effectiveness. If weekly MTI results are sig-nificantly below planned targets (i.e., 10–15 percentor more), but unplanned IPLs are running at or be-low expected levels, this situation can be caused byother problems affecting testing progress (just notthose creating system failures). In this case, low MTItends to highlight how the overall effect of these in-dividual products or functional problems (possiblybeing uncovered by different test teams using thesame system) are affecting overall burn-in. This isespecially true if it occurs near the end of the testingcycle. Without an MTI measurement and planned tar-get, these problems and their individual action plansmight unintentionally be assessed as acceptable toexit testing. But to reflect the lack of burn-in timeseen as lower CPU utilization numbers, MTI targetsare likely to not yet be met. Final action plans mightinclude:

● Extending testing duration or adding resources, orboth

● Providing for a more exhaustive regression test-ing period if one were planned

● Adding a regression test if one were not planned● Taking additional actions to intensify problem res-

olution and fix turnaround time (assuming that

WEEKS TO GA (WEEK ENDING ON SUNDAY)

RAISE SW

12

8

4

0 0

200

400

600

HOURS (MTI)UNPLANNED IPLs

–21 –20 –19 –18 –17 –16 –15 –14 –13 –12 –11 –10 –9 –8 –7 –6 –5 –4 –3 –2 –1 0

Figure 10 MTI (mean time to unplanned IPL) metric

3 WEEKS ADDED

+ 3 WEEKS

STAT (3 WEEKS)

NEWDRIVER

MTI RESTARTED FOR REMAINING WEEKS OF RAISE DUE TO EARLY INSTABILITY (BLUE SHADED AREAS REPRESENT UNPLANNED OUTAGES)

TARGET

ACTUAL


there is still enough elapsed time available untilthe testing cycle is planned to end)

Note that MTI targets and results, because they rep-resent stability, are only as reliable as the conditionunder which the system(s) was tested. MTI does notsubstitute for a poorly planned or executed test, butinstead tends to highlight the need for improvementsin such areas as stress testing capabilities (i.e., num-bers of real or simulated users, new automation tech-nologies, network bandwidth, understanding system,and testing limitations, etc.), and providing forenough burn-in time—with all testing running at “fullthrottle”—for test exit criteria to be satisfied.

Another positive side effect from the visibility of re-porting weekly MTI results is on the core system com-ponents that are used to support the performancemonitoring and CPU data collection. These tools arealso performing an indirect usage test of many basesystem functions, since MTI data are also collectedor retrieved by a central system on fixed hourly in-tervals, seven days a week. This requires those keyinterfaces of the system(s) and network to not onlystabilize early, but to remain so for the duration ofthe testing cycle. Problems in those core areas arequickly noticed, identified, and resolved because ofthe timing and possible effect on the validity of MTIdata.

Critical problems—show stoppers. The show stop-per parameter is very important because the sever-ity and impact of software defects varies. Regard-less of the volume of total defect arrivals, it only takesa few show stoppers to render the product dysfunc-tional. Compared to the metrics discussed previously,this metric is more qualitative. There are two aspectsto this metric. The first is the number of critical prob-lems over time, with release-to-release comparison.This dimension is as quantitative as all other met-rics. More importantly, the second dimension is con-cerned with the types of critical problems and theanalysis and resolution of each of them.

The AS/400 implementation of this tracking and fo-cus is based on the general criterion that any prob-lem that will impede the overall progress of the proj-ect or that will have significant impact on acustomer’s business (if not fixed) belongs to such alist. The tracking normally starts at the middle ofcomponent test when a critical problem meeting isheld weekly by the project management team (withrepresentatives from all functional areas). When itcomes close to system test and GA time, the focus

intensifies and daily meetings take place. The ob-jective is to facilitate cross-functional teamwork toresolve the problems swiftly. Although there is noformal set of criteria, problems on the critical prob-lem list tend to be related to install, system stability,security, data corruption, etc. Before GA, all prob-lems on the list must be resolved—either fixed ora fix underway with a clear target date for the fix tobe available.

In-process metrics and quality management

On the basis of the previous discussions of specificmetrics, we have the following recommendations forimplementing in-process metrics in general:

● Whenever possible, it is preferable to use calen-dar time as the measurement unit for in-processmetrics, versus using phases of the developmentprocess. There are some phase-based defect mod-els or defect cause or type analysis methods avail-able, which were also extensively used in AS/400.However, in-process metrics and models based oncalendar time provide a direct statement on thestatus of the project with regard to whether it canachieve on-time delivery with desirable quality. Asappropriate, a combination of time-based metricsand phase-based metrics is desirable.

● For time-based metrics, use the date of productshipment as the reference point for the x-axis anduse weeks as the unit of measurement. By refer-encing the GA date, the metric portrays the truein-process status and conveys a “marching towardcompletion” message. In terms of time units, wefound that data at the daily level proved to havetoo much fluctuation, and data at the monthly levellost their timeliness; neither can provide a trendthat can be spotted easily. Weekly data proved tobe optimal in terms of both measurement trendsand cycles for actions. Of course, when the proj-ect is approaching the back-end critical time pe-riod, some metrics need to be monitored and ac-tions taken at the daily level. Examples are thestatus of critical problems near GA and the PTRbacklog a couple of weeks prior to the cut-off datefor the GA driver.

● Metrics should be able to tell what is “good” or“bad” in terms of quality or schedule in order tobe useful. To achieve these objectives, historicalcomparisons or comparisons to a model are oftenneeded.

● Some metrics are subject to strong managementactions; for others there should be no intervention.Both types should be able to provide meaningful


indications of the health of the project. For exam-ple, defect arrivals, which are driven by testing pro-gress and other factors, should not be artificiallycontrolled, whereas the defect backlog is com-pletely subject to management and control.

● Finally, the metrics should be able to drive im-provements. The ultimate question for metric workis what kind and how much improvement will bemade and to what extent the final product qualitywill be influenced.

With regard to the last item in the above list, it isnoted that to drive specific improvement actions,sometimes the metrics have to be analyzed at a moregranular level. As a real-life example, in the case ofthe testing progress and PTR backlog metrics, the fol-lowing was done (analysis and guideline for action)for an AS/400 release near the end of developmenttesting (Component Test, or CT, equivalent to Func-tional Verification Test [FVT] in other IBM labora-tories).

Components that were behind in CT were identifiedusing the following methods:

1. Sorting all components by “percentage attemptedof total test cases” and selecting those that areless than 65 percent. In other words, with less thanthree weeks to completion of development test-ing, these components have more than one-thirdof CT left.

2. Sorting all components by “number of plannedcases not attempted” and selecting those that are100 or larger, then adding these components tothose from 1. In other words, these several ad-ditional components may be on track or not se-riously behind percentage-wise, but because of thelarge number of test cases they have, there is alarge amount of work left to be done.

Because the unit (test case, or test variation) is notof the same weight across components, 1 was usedas the major criterion, supplemented by 2.

Components with double-digit PTR backlogs wereidentified. Guidelines for actions are:

1. If CT is way behind and PTR backlog is not toobad, the first priority is to really ramp-up CT.

2. If CT is on track and PTR backlog is high, the keyfocus is on PTR backlog reduction.

3. If CT is way behind and PTR backlog is high, thesecomponents are really in trouble. Get help (e.g.,

extra resources, temporary help from other com-ponents, or teams, or areas).

4. For the rest of the components, continue to keepa strong focus on both finishing up CT and PTRbacklog reduction.

Furthermore, analysis on defect cause, symptoms,defect origin (in terms of development phase) andwhere found can provide more information for pos-sible improvement actions. For examples of suchanalyses and the discussion on phase defect removaleffectiveness (design reviews, code inspections, andtest), see Kan.4

Metrics are a tool for project and quality manage-ment. In software development as well as many typesof projects, commitment by the teams is very impor-tant. Experienced project managers know, however,that subjective commitment is not enough. Is therea commitment to the system schedules and qualitygoals? Will delivery be on time with desirable qual-ity? Even with eyes-to-eyes commitment, these ob-jectives are often not met because of a whole hostof reasons, right or wrong. In-process metrics pro-vide the added value of objective indication. It is thecombination of subjective commitments and objec-tive measurement that will make the project success-ful.

To successfully manage in-process quality (and there-fore the quality of the final deliverables), in-processmetrics must be used effectively. We recommend anintegrated approach to project and quality manage-ment vis-a-vis these metrics, in which quality is man-aged as importantly as other factors such as sched-ule, cost, and content. Quality should always be anintegral part of the project status report and check-point reviews. Indeed, many of the examples we givein this paper are metrics for both quality and sched-ule (those weeks to GA measurements) as the twoparameters are often intertwined.

One common observation with regard to metrics insoftware development is that the project teams of-ten explain away the negative signs as indicated bythe metrics. There are two key reasons for this. First,in practice there are many poor metrics, which arefurther complicated by abundant abuses and misuses.Second, the project managers and teams do not takeownership of quality and are not action-oriented.Therefore, the effectiveness, reliability, and validityof metrics are far more important than the quantityof metrics. And once a negative trend is observed,an early urgency approach should be taken in order


to prevent schedule slips and quality deterioration.Such an approach can be supported via in-processquality targets, i.e., once the measurements fall be-low a predetermined target, special actions will betriggered.

Effort/outcome paradigm. From the discussions thusfar, it is clear that some metrics are often used to-gether in order to provide adequate interpretationof the in-process quality status. For example, test-ing progress and PTR arrivals, and CPU utilization andthe number of system crashes and hangs are two ob-vious pairs. If we take a closer look at the metrics,we can classify them into two groups: those that mea-sure the testing effectiveness or testing effort, andthose that indicate the outcome of the test expressedin terms of quality. We call the two groups the effortindicators (e.g., test effectiveness assessment, testprogress S curve, CPU utilization during test) and theoutcome indicators (PTR arrivals—total number andarrivals pattern, number of system crashes and hangs,mean time to IPL), respectively.

To achieve good test management, useful metrics,and effective in-process quality management, theeffort/outcome paradigm should be utilized. Theeffort/outcome paradigm for in-process metrics wasdeveloped based on our experience with AS/400 soft-ware development and was first introduced by Kan.4

Here we provide further details. We recommend the2 3 2 matrix approach such as the example shownin Figure 11.

For testing effectiveness and the number of defectsfound, cell 2 is the best-case scenario because it isan indication of the good intrinsic quality of the de-

sign and code of the software—low error injectionduring the development process. Cell 1 is also a goodscenario; it represents the situation that latent de-fects were found via effective testing. Cell 3 is theworst case because it indicates buggy code and prob-ably problematic designs—high error injection dur-ing the development process. Cell 4 is the unsurescenario because one cannot ascertain whether thelower defect rate is a result of good code quality orineffective testing. In general, if the test effective-ness does not deteriorate substantially, lower defectsare a good sign.

It should be noted that in the matrix, the better/worseand higher/lower designation should be carefully de-termined based on release-to-release or actual ver-sus model comparisons. This effort/outcome ap-proach also provides an explanation of Myers’counterintuitive principle of software testing as dis-cussed earlier.

The above framework can be applied to pairs of spe-cific metrics. For example, if we use testing progressas the effort indicator and the PTR arrivals patternas the outcome indicator, we obtain the followingscenarios:

Positive scenarios:

● Testing progress is on or ahead of schedule, andPTR arrivals are lower throughout the testing cy-cle (compared with the previous release).

● Testing progress is on or ahead of schedule, andPTR arrivals are higher in the early part of thecurve—chances are the PTR arrivals will peak ear-lier and decline to a lower level (compared withthe previous releases).

Negative scenarios:

● Testing progress is significantly behind schedule,and PTR arrivals are higher—chances are the PTRarrivals will peak later and the problem of late cy-cle defect arrivals will emerge.

● Testing progress is behind schedule, and PTR ar-rivals are lower in the early part of the curve—thisis an unsure scenario, but from the testing effortpoint of view it is not acceptable.

The interpretation of the pair of metrics for systemstability (CPU utilization and system crashes andhangs) is similar.


Generally speaking, outcome indicators are morecommon, whereas effort indicators are more diffi-cult to establish. Furthermore, different types of soft-ware and tests may need different effort indicators.Nonetheless, the effort/outcome paradigm forces oneto establish appropriate effort measurements, which,in turn, drive the improvements in testing. For ex-ample, the metric of CPU utilization is a good effortindicator for operating systems. In order to achievea certain level of CPU utilization, a stress environ-ment needs to be established. Such effort increasesthe effectiveness of the test.

For integration type software where a set of vendorsoftware is integrated together with new productsto form an offering, effort indicators other than CPUstress level may be more meaningful. One could lookinto a test coverage-based metric, including the ma-jor dimensions of testing such as:

● Setup● Install● Minimum/maximum configuration● Concurrence● Error recovery● Cross-product interoperability● Cross-release compatibility● Usability● DBCS

A five-point score (one being the least effective andfive being the most rigorous testing) can be assignedfor each dimension, and the sum total can representan overall coverage score. Alternatively, the scoringapproach can include the “should be” level of test-ing for each dimension and the “actual” level of test-ing per the current testing plan based on indepen-dent assessment by experts. Then a “gap score” canbe used to drive release-to-release or project-to-proj-ect improvement in testing. For example, assumingthe testing strategy for an offering calls for the fol-lowing dimensions to be tested, each with a certainsufficiency level: setup (5), install (5), cross-productinteroperability (4), cross-release compatibility (5),usability (4), and DBCS (3). Based on expert assess-ment of the current testing plan, the sufficiency lev-els of testing are: setup (4), install (3), cross-prod-uct interoperability (2), cross-release compatibility(5), usability (3), and DBCS (3). Therefore, the“should be” level of testing would be 26 and the “ac-tual” level of testing would be 20, with a gap scoreof 6. This approach may be somewhat subjective, butit also involves the experts in the assessment pro-cess—those who can make the difference. Although

it would not be easy in a real-life implementation,the point here is that the effort/outcome paradigmand the focus on effort metrics have direct linkageto improvements in testing. In AS/400, there was onlylimited experience in this approach while we wereimproving our product level test. Further researchin this area or implementation experience will be use-ful.

How you know your product is good enoughto ship

Determining when a product is good enough to shipis a complex issue. It involves the types of products(for example, shrink-wrap application versus an op-erating system), the business strategy related to theproduct, market opportunities and timing, custom-ers’ requirements, and many more factors. The dis-cussion here pertains to the scenario that quality isan important consideration and that on-time deliv-ery with desirable quality is the major project goal.

A simplistic view is that a target is established forone or several in-process metrics, and when the tar-get is not met, the product is not shipped per sched-ule. We all know that this rarely happens in real life,and for legitimate reasons. Quality measurements,regardless of their maturity levels, are never as blackand white as meeting or not meeting a GA date. Fur-thermore, there are situations where some metricsare meeting targets and others are not. There is alsothe question: If targets are not being met, how badis the situation? Nonetheless, these challenges do notdiminish the value of in-process measurements; theyare also the reason for improving the maturity levelof software quality metrics.

When various metrics are indicating a consistent neg-ative message, it is then evident that the product maynot be good enough to ship. In our experience, in-dicators from all of the following dimensions shouldbe considered together to get an adequate pictureof the quality of the product to be designated GA:

● System stability/reliability/availability● Defect volume● Outstanding critical problems● Feedback from early customer programs● Other important parameters specific to a partic-

ular product (ease of use, performance, install, etc.)

In Figure 12, we present a real-life example of anassessment of in-process quality of a recent AS/400release when it was near GA. The summary tables


outline the parameters or indicators used (column1), a brief description of the status (column 2), re-lease-to-release comparisons (columns 3 and 4), andan assessment (column 5). Although some of the in-dicators and assessments are based on subjective in-formation, for many parameters we have formal met-rics and data in place to substantiate the assessment.The assessment was done about two months beforeGA, and as usual, a series of actions was takenthroughout GA. The release now has been in the fieldfor over two years and has demonstrated excellentfield quality.

Conclusion

In this paper we discuss a set of in-process metricsfor the testing phases of the software developmentprocess. We provide real-life examples based on im-plementation experiences of the IBM RochesterAS/400 development laboratory. We also describe theeffort/outcome paradigm as a framework for estab-lishing and using in-process metrics for quality man-agement.

There are certainly many more in-process metricsfor software testing that we did not include, nor is


our intent to provide comprehensive coverage. Evenfor those that we discussed here, not each and everyone is applicable universally. However, we contendthat the several metrics that are basic to softwaretesting (such as the testing progress curve, defect ar-rivals density, and critical problems before GA)should be an integral part of any software testing.

It can never be overstated that it is the effectivenessof the metrics that matters, not the number of met-rics available. It is a strong temptation for quality ormetrics practitioners to establish more and moremetrics. However, in the practice of metrics and mea-

surements in the software engineering industry,abuses and misuses abound. It is ironic that softwareengineering is relying on the measurement approachto elevate its maturity level as an engineering dis-cipline, yet measurement practices when improperlyused pose a big obstacle and provide confusion andcontroversy to such an expected advance. Ill-foundedmetrics are not only useless, they are actually coun-terproductive, adding extra costs to the organizationand doing a disservice to the software engineeringand quality disciplines. Therefore, we must take aserious approach to metrics. Each metric to be usedshould be subjected to the examination of basic prin-


ciples of measurement theory. For example, the con-cept, the operational definition, the measurementscale, and validity and reliability issues should be wellthought out. At a macro level, an overall frameworkshould be used to avoid an ad hoc approach. We dis-cuss the effort/outcome framework in this paper,which is particularly relevant for in-process metrics.We also recommend the Goal/Question/Metric(GQM) approach in general for any metrics.6–8

At the same time, to enhance success, one shouldtake a dynamic and flexible approach, for example,tailor-make the metrics to meet the needs of a spe-cific team, product, and organization. There mustbe “buy-in” by the team (development and testing)in order for the metrics to be effective. Metrics area means to an end—the success of the project—notan end itself. The project team should have intel-lectual control and a thorough understanding of themetrics and data that they use, and should thereforemake the right decisions. Although good metrics canserve as a useful tool for software development andproject management, they do not automatically leadto improvement in testing and in quality. They dofoster data-based decision-making and provide ob-jective criteria for actions. Proper use and contin-ued refinement by those involved (for example, theproject team, the testing community, and the devel-opment teams) are therefore crucial.

Acknowledgment

This paper is based on a white paper written for theIBM Software Test Council. We are grateful to CarlChamberlin, Gary Davidson, Brian McCulloch, DaveNelson, Brock Peterson, Mike Santivenere, MikeTappon, and Bill Woodworth for their reviews andhelpful comments on an earlier draft of the whitepaper. We wish to thank the three anonymous re-viewers for their helpful comments on the revisedversion for the IBM Systems Journal. We wish tothank the entire AS/400 software team, especially therelease managers, the extended development teammembers, and the various test teams who made themetrics described in this paper a state of practice in-stead of just a theoretical discussion. A special thanksgoes to Al Hopkins who implemented the perfor-mance monitor tool on AS/400 to collect the CPU uti-lization and unplanned IPL data.

*Trademark or registered trademark of International BusinessMachines Corporation.

**Trademark or registered trademark of Lotus Development Cor-poration or SAS Institute, Inc.

Cited references

1. M. C. Paulk, C. V. Weber, B. Curtis, and M. B. Chrissis, TheCapability Maturity Model: Guidelines for Improving the Soft-ware Process, Addison-Wesley Longman, Inc., Reading, MA(1994).

2. W. S. Humphrey, A Discipline for Software Engineering, Ad-dison-Wesley Longman, Inc., Reading, MA (1995).

3. G. J. Myers, The Art of Software Testing, John Wiley & Sons,Inc., New York (1979).

4. S. H. Kan, Metrics and Models in Software Quality Engineering,Addison-Wesley Longman, Inc., Reading, MA (1995).

5. P. A. Tobias and D. C. Trindade, Applied Reliability, Van Nos-trand Reinhold Company, New York (1986).

6. V. R. Basili, “Software Development: A Paradigm for the Fu-ture,” Proceedings of the 13th International Computer Softwareand Applications Conference (COMPSAC), keynote address,Orlando, FL (September 1989).

7. Materials in the Software Measurement Workshop conductedby Professor V. R. Basili, University of Maryland, College Park,MD (1995).

8. N. E. Fenton and S. L. Pfleeger, Software Metrics: A Rigorousand Practical Approach, 2nd edition, PWS Publishing Company,Boston (1997).

Accepted for publication September 15, 2000.

Stephen H. Kan IBM AS/400 Division, 3605 Highway 52 N, Roch-ester, Minnesota 55901 (electronic mail: [email protected]). Dr.Kan is a Senior Technical Staff Member and a technical managerin programming at IBM in Rochester, Minnesota. He is respon-sible for the Quality Management Process in AS/400 software de-velopment. His responsibility covers all aspects of quality rang-ing from quality goal setting, supplier quality requirements, qualityplans, in-process metrics, quality assessments, and CI105 com-pliance, to reliability projections, field quality tracking, and cus-tomer satisfaction. Dr. Kan has been the software quality focalpoint for the software system of the AS/400 since its initial re-lease in 1988. He is the author of the book Metrics and Modelsin Software Quality Engineering, numerous technical reports, andarticles and chapters in the IBM Systems Journal, Encyclopediaof Computer Science and Technology, Encyclopedia of Microcom-puters, and other professional journals. Dr. Kan is also a facultymember of the University of Minnesota Master of Science in Soft-ware Engineering (MSSE) program.

Jerry Parrish IBM AS/400 Division, 3605 Highway 52 N, Roch-ester, Minnesota 55901 (electronic mail: [email protected]). Mr.Parrish is an advisory software engineer and has been a memberof the IBM Rochester AS/400 System Test team since the late1980s. He is currently the test lead for the AS/400 Software Plat-form Integration Test (also known as RAISE). As part of the over-all focus of the AS/400 on system quality, he has applied his ex-pertise since the early 1990s to focus on the areas of test processmethodology and applied models and metrics. Over this period,he has helped transform the System Test group in Rochester intoa significant partner in IBM Rochester’s total quality commit-ment. Mr. Parrish holds an undergraduate degree in computerscience. He has been employed by IBM since 1980 with expe-rience in software development, software system testing, and prod-uct assurance.

Diane Manlove IBM AS/400 Division, 3605 Highway 52 N, Roch-ester, Minnesota 55901 (electronic mail: [email protected]).


Ms. Manlove is an advisory software quality engineer for IBMin Rochester. Her responsibilities include management of releasequality during product development, system quality improvement,and product quality trend analysis and projections. Ms. Manloveis certified by the American Society for Quality as a Software Qual-ity Engineer, a Quality Manager, and a Reliability Engineer. Sheholds a master’s degree in reliability and an undergraduate de-gree in engineering. She has been employed by IBM since 1984and has experience in test, manufacturing quality, and productassurance.


in-process metrics for software testing - jazz.net metrics for software testing in this section, ......

Documents