real time system 08 philip a lapalante 2nd edition

8/9/2019 Real Time System 08 Philip A Lapalante 2nd Edition

1/25

1 1 . 1

Reliability,esting,andFault olerance

KEY POINTS OF TIIE CHAPTER1. Reliability is a subjectivemeasure.2. Inthe absence f "real" metrics, use he weight of the printout, code size,McCabe's metric, Halstead's metrics-anything that is better thanguessing.3. Reliability, by most definitions, can be increased hrough fault-tolerantdesign and rigorous testing.4. Testing should be performed throughout the Software Life Cycle byusing a mixed bag of techniques.

Reliable software is a direct result of a solid design process,good sofrwareengineeringpractice,and rigorous system esting 43]. In this chapterwe discussreliahility-and methods for increasing it-through system testing and faulttolerance. For a complete discussion of this topic see, for examplefll9]or [120].

FAULTS, AILURES,UGS,ANDDEFECTSSoftware engineersknow there is a world of difference in the follouing terms;fault, failure, bug, and defect.The term bug is considered aboo since t somehow'implies that an error crept nto the program hrough no one's action. The preferredterm for an error in requirement,design,or code s defect.The appearance f thisdefect during the operationof the software system s called afault. A fault that


2/25

256 Chap. 11 I Reiiability.Testing,and Fault Tolerance

causeshe softwaresystem o fail to meetone of its requirementss afailure.Inthis text, however, we use these terms somewhat cavalierly because heircolloquial nterchangeabilitys implied'

11.2RELIABILITYA reliable softwaresystem n generaican be defined informally in a number ofways.For example,one definitionmight be "a systemwhich a usercandependon" [50]. Other loose characteizationsof reliable softwaresystems nclude thefollowine:

It "stands he testof time."Downtime is below a certain threshold.There is an absenceof known catastrophicerrors-that is, errors thatrender he systemuseless.Results re predictable i.e., t is a deterministic ystem).It offers robustness-the ability to recover "gracefully" from errors'

For real-time systems,other informal characterizations f reliability arer Event determinismr Temporaldeterminismr "Reasonable" ime-loadingI "Reasonable"memory-loading

These characteristicsare desirable n a reiiable real-time software system,bulbecausesome characteristicsare difficult to measure,we need a more rigorou:definition.

11.2.1FormalDefinitionRather than define the loose conceptof "reliable software," we define the moreprecise measureof softwarereliability. Let S be a softwaresystem,and let I btthe time of system ailure. Then the reliability of S at time /, denotedr(r), is thcprobability that T is greater han t; that is,

r( t) = P17 11 (11.iIn words,this is the probability that a softwaresystemwill operatewithout failu-for a specifiedperiod of tirne.Thus,a systemwith reliability function r(t) = l will never fail' For exampiaNASA has suggestedhat computersused n civilian fly-by-wire aircraft have r

ITI

TI


3/25

l l.2 r Reliability

failure probability of no more than 10-e per how [85], which representsareliability function of r(r) = (0.99999999)/with r in hours (still as / -) m, r(t) -+0). We will suggest wo models or software eliability basedon the systemfailurefunction-that is, the probability that the system ails at time r.The first and more standard ailure function usesan exponentialdistributionwherethe abscissas time andthe ordinate representshe failure intensity at thattime. Here the failure intensity is initially high, as would be expected n a newpiece of software,and decreaseswith time, presumablyas errors are uncoveredand repaired.The second, less standard model (and the one to which the authorsubscribes)s the generalbathtubcurve given n Figure 11.1.Brooks 19] notesthat this curve is often used to describe the failure function of hardwarecomponents,and hasbeenused o describe henumber of errors ound in a certainreleaseof a software product.If we supposehat theprobability of system ailureis linearly related o the numberof errors detectedn the system, hen we have hefailure function definedby Figure 11.1.The interpretationof this failure functionis easy or hardware-a certain number of piecesof a particularcomponentwillfail early dueto manufacturingdefects,anda largenumber will fail late n life dueto aging.The applicability of the bathtub curve to software failure requiresa furtherstretch.Although a large number of errorswill be found in a particular softwareproduct early (during testingand development), t is unclearwhy a largenumberof errors appear ater in the life of the softwareproduct.These ater errors mightbe due to the effects of patching or to the increasedstress on the software byexpertusers 19].

'6cotrE=(!tt

TimeFigure 1l.l A typical ailure unction,


4/25

258 Chap. ll I Reliability, Testing, and Fault Tolerance

11.2.2Calculating ystemReliabilitySeveral simple techniques for approximating system reliability have beendeveloped, ncluding the processblock model, McCabe's metric, and Halstead'smetrics.

11.2.2.1Process Block Model Overall system reliability for a systemcomprisedof a seriesof subsystems, alledprocessblocks,connectedn paralleland series,can be calculatedusing simple rules of probability. It is assumedhatthe failure function of eachprocessblock is independentof the others; hat is, ifone fails, it doesnot necessarilymply that any other must fail. This might indeedbe an oversimplification, if, for example, two software modules are sharingdata.Suppose wo subsystemshave reliability functions r{t) and r2(l), respec-tively. If they are connected n parallel as in Figure I1.2, then the composite

r"og)= 11@ tzft) \(t). tz$)

Figure 11.2 Parallel ubsystemsnd hiequivalenteliability.

subsystem ill fail only if bothblocks ail. That s, thefailure unctionofequivalent ystem 1 - r"qxt) is the productof the failure functionsof theparallelprocess locks.Thus1 - r"q(t)= (1 - rr0)X1 - rz(t))

Solving or r"n(r)yjeldsr*(r) = r{t) + rz?)- r{t)r2Q)

Since the reliability functionsare alwaysbetween0 and 1, thereliability for parallel processblocks is always greater han or equal to eitbthe individual reliabilities.

(11


5/25

I l .2 I Reliabi l i ty

Figure ll.3 Seriessubsystems nd theirequivalent eliability.

259\(I)

r"o@ qQ- r2ft)

Forseries onnection,sdepictedn Figure11.3,heequivalentystemailsif oneor bothof theprocess locks ails.That sP.q(t)= p{t) + pzQ) p{r)pzG)

or(l -r.q(/))= (1- rr(t))+ (t _r2e\ _ (1_r1(r))(1 rz?))Solving or r"o(r)yields

reqj) = rr(t)rr(t) (1 .3 )Again noting that the reliability function is between0 and l, we see thatconnectingsystemsn serieswill decreasehe reliability of the equivalentoverallsystem.

To further illustrate the point, we examinea more complicatedsystem.I EXAMPLE1.1A software systemcan be broken up into subsystems hat interact as shown in Figure 11.4.Thereliability function for the seriessubsystems and2 is, from equation ll.3

rl(t)rze)The composite subsystem s in parallel with subsystem3, which, from equation 11.2, yields areliability function of

r r(t)r2Q)+ rr(t) - r r(t)r2e)rye)The new composite subsystem n serieswith subsystem4 yields, from equation 11.3, an overallreliability function ofr"o(f) = (r r(t)r (t) + 4Q) - r r(t) r(t)r r(t))ro(t) I

ts(t)Figure11.4Systemrokenntosubsystemsithassociatedeliabilities,

-- ---=-


6/25

260 Chap. 11 I Reliability,Testing,and FaultTolerance11.2.2.2McCabe's Metric Someexperts elieve hat he reliability of thesoftware can be determineda priori from characteristics f the sourcecode.Although such metricsare far from mainstream.t is interesting o discuss hem

and their possibleapplication o real-timesystems ere.Onesuchmetric,developedy McCabe 112]. s based n thecomplexity fthe flow-of-cclntroln the system.Suppose ve could depicta softrvare ystemasa directedgraph (or graphs),where each block of sequential oderepresentsnodeand eachsynchronous hange n flow-of-cont rol s depicted s a directedarcor edge.No provision s made or asvnchronouslow-of-control.In a multitaskingor multiprocessing ystem,each separateask would berepresented y a separate 'low graph. Interestingly, hese graphs could bedetermined irectly from the flow charts,dataflow diagrams,Fetri nets,or finitestateautomata sed o model the system.In anycase, he numberof edges, , andnodes,n, are counted.The numberof separatelow graphsor tasks, , alsoenter nto the relationship.We then ormthe cy-clontaticomplexity,C, as follows:

C = e - n + 2 p (11 .4 )McCabe asserts hat the complexity of the control flow graph, C, reflects thedifficulty in understanding, esting, and maintaining the software. He furthercontendshat well-structuredmoduleshavea cyclornatic omplexity n the rangeof 3 < C < 7. As judgedby empiricalevidence,C = l0 is an upper imit for thereliability of a single module.I EXAMPLE 1.2Considerhecode n Example .4. t wouldhavea flow gaph asseenn Figure11.5.Herep = lsince here s only I task,and hegraphhas our nodesand hreeedges. hus

C = 3 - 4 + 2 ' l = 1According o McCabe,we can herefore xpect he reliabilityof thiscode o be high. I

Although the McCabemetricprovidessomenonabstractmeasure f systemreliability, it cannot depict asynchronous hanges n flow-of-control, and thus, rhas imited utility in a real-time system.Readers nterested n forming their ownopinions are referred to |ll, 11271.

acase3

task

Figure 11.5Flow graph or Example .r


7/25

l l .2 I Reliabi l i ty

l. rr : the numberof distinctbegin-end pairs and Goro statements ortheir analogiesn other anguages) nown as ,.operators.,,2. \z : the number of distinct lines terminatedby semicolons n C orstaternentsn Pascal or their analogies n other languages) nown as"operands."

3' Nr : the total numberof occurrences f operatorsn the program.4. N2 : the tctal numberof occurrences f operandsn the program.I EXAMPLE1.3

261

Although he determination f the Halstead umberss largely ntuitive, n Example6,4,since hereare four distinct begin-end pairs and six unrque statements,he following numbers seemappropriate:Tr = 4, Tz = 6, N r = 4, Nz = 6. (Individualswill obtaindifferentnumbersdependingon how they count statements.) IHalstead has definedcharacteristics, hich can bea number of concretemeasures elated to theseused o determine ystem eliability.

Definition Theprogramocabulary,, is defined srt = rl i + rt2 (11.5)Definition The program ength,N, is definedas

N --N1+ N2 (11.6)Definition The programvolume, { is definedas

Y=N logz l f f i ] \Definit ionThe potential olume,V-, is defined sV' = (2 + r;2) log2 2 + nz) (11.8)Definilion The program evel,L, is definedas

L = V- V (11 .9 )

L is an attemptto measure he level of abstractionof the program. It is believedthat increasing his number will increasesystemreliability.


8/25

262 Chap.

Another Halsteadmetric attempts orequired n the developmentof the code.

Language Level (L)

ll I Reliability, Testing,and Fault Tolerance

measure he amount of mental effort

Etrort (E)1.002.003 . r93.596.02

Again, decreasing he effort level is believed o increase eliability as well aseaseof implementation.i{alstead apptied his metncs to a number of programs written in severallanguagesand, fo1 inruitive comparison, o the English used n the novel Mobl'Dick 1581.

EnglishPLITAlgol-58FORTRANAssembler

2.161.531 .21l . t 40.88

Note the relative advantage f FoRTRAN over assembler nd PLA over

referredo 51], 58],or[128].While t maysoundacetious,herea?e o viablemetricsor measuringeat-time reliability. However,many organizationswearby them. NASA uscrMccabe's metric for much of its softwareand it is now possible o b{4vcommerical tools to determine McCabe's and Halsteadlsmetrics' However' mtruth.anv metric is better hannothing.For example,do you believe hat a milli

McCabe's and Halsteadrsake this into account'

Definition Theeffort,E is defined sE = V L ( 1 1 .1 0 )


9/25

Sec. 11.3 I Testing

11.2.2.4FunctionPoints Function ointsarea widelyusedmetricset nnomembeddednvironments, nd they form the basisof many commercialsoftwareanalysis ackages. unctionpointsmeasurehe numberof interfacesbetweenmodulesand subsystemsn programs r systems. hesenterfaces redefined y measuringhe ollowing ive software haracteristicsor eachmodule,subsystemr system:r numberof inputs 1)I numberof outputs O)r numberof queries p)I numberof filesused F)

A weighted sum is then computed, giving the function point (Fp) metric, forexample:F P = 4 1 + 4 O + 5 Q + 1 0 F

The weightscan be adjusted o compensateor factorssuch asapplicationdomainand softwaredeveloperexperience.ntuitively, the higherFP, the moredifficult toimplement. And a great advantageof the function point metric is that it can becomputedbefore any coding occurs.From the definition, it is easy o seewhy the function point metric is highlysuited for business processing,but not necessarilyappropriate for embeddedsystems.However, here s increasing nterest n the use of function points n real-time embeddedsystems, specially n large-scaleeal-timedatabases,multimedia,and Intemet support.These systemsarc datadriven and often behave ike thelarge-scaletransaction-based ystems for which function points were devel-oped.

11.2.2.5 Other Metrics Many other metrics are used n the measurementof softwarequality and reliability. For a neat review of mahy of these metrics,along with somenew ones, and their application,see 128].

11.3TESTINGAlthough testing will flush out errors,this is not the goal of testing. Long ago,software esting was thought of in this way; howeveg testingcan only detect heprcsenceof errors, not the absenceof them. This is insufficient, particularly inreal-timesystems.nstead, he.goalof testing s "to ensure hat the softwaremeetsits requirements."This placesernphasis:onsolid design techniquesand a well-developed requirements document. A formal test plan must be developed *ratprovides criteria used. in deciding whether the system has satisfied frerequiremeqtsdocument.


10/25

264 Chap. 11 I Reiiability, Testing, and Fault Tolerance

The test plan should follow the requirernents ocument tem by item.providing criteria hat are used o judge whether he required tem has beenmet.A set of testcasesare then wfitten which are used o measure he criteria set outin the test plan. (This can be extremely difficult when a user interface s part ofthe requirements.)The test plan includescriteria for testing the software on a module-b1-module or unit level, and on a system or subsystemevel; both should beincorporatedn a good testingscheme. he system-levelestingprovidescriteriafor thehardware/softu'arentegration rocess seeChapter13).We follow with adiscussion f a varietyof test echniques. more horough reatment anbe oundin [65].

11.3.1UnitLevelTestingSeveralmethods anbe used o test ndividualmodulesor units.These echniquescanbe usedby the unit authorandby the ndependentest eam o exercise achunit in the system. These techniquescan also be applied to subsystems(collections of modules related to the same function). The techniques o bediscussedncludeblack box and white box testing.

11.3.1.1Black Box Testing In black box testing,only the inputs andoutputs of the unit are considered;how the outputs are generatedbased on aparticular set of inputs is ignored. Such a technique,being independentof theimplementationof the module. can be applied o any number of moduleswith thesame functionality. But this technique does not provide insight into theprogrammer's skill in implementingthe module.In addition,deador unreachablecode cannotbe detected.For eachmodule a nnmber of test casesneeds o be generated. his numberdependson the functionality of the module, the number of, nputs, and so on. If amodule fails to pass a single-module evel test, then the error must be repairedandall previous module-level est casesarererun andpassed,o prevent he repairfrom causingother errors.For black box testing,how do you obtain the test cases?Thereare a numberof techniques.

1. Exhaustion(brute force)-All possible combinationsof module inpusare tried. For a digital computer his is alwaysa finite, thoughpotentiallylarge, number of test cases.2. Corner cases-For example, minimum, maximum, or averagevaluesgiven for each nput to the module are tested.3. Pathologicalcases-These are unusualcombinationsof input values harmav lead to errors. Such combinations are often difficult to identify.


11/25

.3 I Testing

4. Statistically based testing-Random test casesor inputs are basedonunderlying probability distribution functions.The shortcomingsof thistechniqueare discussedshortly.When choosing he technique o be used or black box testing,brute forcetesting shouldbe used f feasible or small modules.For largermodules,somecombinationof the above methods s needed.For example, n mannedor criticalsystems exhaustive testing is still desirable but might not be feasible. Anacceptablesubstitutemight be comer case esting followed by pathologicalcasetesting. For user interfaces,statistically based esting appears o be reasonable.The point is that the test mix dependson the application and resourcesavailableto do the testine.

I EXAMPLE 11.4A software module used or a builrin test s passeda 16-bit statusword, STlflUS. If any of the firstthree(leastsignificant) bits are set, hen an error is indicatedby retuming the Boolean flag ERRORasTRUE. If, however, all three bits are set, hen it is assumed hat the hardwarediagnosticdeviceis in error, and ERROR is set as FALSE. The other 13 bits are to be ignored. It is decided hat theinput pattems given in Table I 1. (in hexadecimal)and correspondingexpectedoutputs will be usedto test the module. This test case seemssufficient to test the module. Exhaustive test would haverequired216 est cases. T

TABLE 11.1 BlackBox TestCasesfor BITS CodeError

TrueTrueTrueFalse

An important aspectof using black box testing techniques s that clearly.defined nterfaces o the modulesare required.This placesadditional emphasisonthe applicationof Parnaspartitioning principles to module design.

11.3.1.2 White Box Testing As we just mentioned,one disadvantageofblack box testing s that it can often.bypassunreachable r deadcode. n additim-it may not testall of the flow-of-control pathsthroughthe module. This can treadto latent enors. A technique called white box or clear box testing can be used tosolve this problem.

0001000200040007


12/25

266 Chap. 11 I Reliability, Testing' and Fault Tolerance

white box testsaredesigned o exerciseall paths n the module and thusarelogic driven. (Black box testsare datadriven.)The goal is to try to testevery lineof code.T EXAMPLE1.5Consider he previousexample.The set of test caseswe generated ested he "cornet cases" or themodule-thai is, one or all error bits set.A white box test set would seek o exercise he path in themodule where only two effol bits might be set. In addition it would exerciseany paths where the..don,t care" bits are nadvertentlytested. t is thereforedecided hat the white box input patterns inhexadecimal) and correspondingexpected outputs given in Table 11.2 will be used to test thernodule.This test scheme s more robust than the "comer case" testpicked in Example 11.4. I

TABLE ll.2 WhiteBox TestCasesfor BITS CodeStatus

000100020003000400050006FFFS

TrueTrueTrueFalseTrueTrueFalse

11.3.1.3 Group Walkthroughs Group walkthroughsot code inspectiorsare a kind of white box testing in which a number of persons nspect the codc"line-by.line, with the unit author.The authorpresentseach ine of code o a reviengroup; thus rethinking the logic and often finding errors in the process'ThG["tront in the review group include users of the modules under revispecificationwriters,testers,and peerprograrnners'This techniqueeicellent control of the coding techniquesusedand pennits a good check onlogical errors that may have been ntroduced. In addition, any unreachablecan be detected,and more elegantor faster code is sometimessuggested.Gwalkthroughs are recommended or. use in any testing strategy.An excediscussion of code inspection techniques can be found in [41]. IEEE1028-1988, oftware eviews ndAudits, rovides step-by-stepecipeorinspections.

11.3.1.4 Formal Program Proving Formal program proving is a kindwhite box testing that treats he specificationand code as a formal theorem oproved.We will not discuss his testhere or threereasoqs'First, someexperB


13/25

I Testing

skepticalof its viability in largesystems32],1451, 96]. Second,ormal programproving for real-time systemsrequires the use of methods including temporallogic or processalgebra,both of which are beyondthe scopeof this text. Finally,few commercial ools are available o facilitate this kind of testing.The interestedreadercan see [98], [102], or [112] for a discussionof some of these nstru-ments.

System-Level estingOnce individual moduleshavebeentested, hen subsystems r the entire systemneeds o be tested.n largersystems,heprocess an bebrokendown nto a seriesof subsystemestsand then a testof the overall system.System estingviews the entiresystemas a black box so that one or moreof the black box testing echniques an be applied.System-levelestingalwaysoccursafterall modulespass heir module-level est.At this point the codingteamhands he softwareover to the test team for validation.If an error occursduring system-level esting, the error must be repaired.Ideally, every test case involving the changedmodule must be rerun, and allprevioussystemevel estsmustbe passedn succession.he collectionof systemtest casess often called a sJ-stemestsuite'Burn-in testing is a type of system-level.estingdone in the factory, whichseeks o flush out those ailuresappearingearly in the life of the system,and thusto improve the reliability of the deliveredproduct.System-levelesting s usually ollowedby alpha testing,which is a typeofvalidation consisting of intemal distribution and exerciseof the software' Thistesting is followed by beta testing, where preliminary versions of validatedsoftwareare distributedto friendly customerswho test the softwareunder actualuse. Later in the life cycle of the software, if correctionsor enhancements readded, hen egression esting s performed.Regressionesting(which can alsobeperformedat the module evel) is used o validate he updatedsoftwareagainst heold setof testcasegthat avealreadybeenpassed. ny new testcases eeded orthe enhancements rethen added o the test suite,and the software s validated asif it were a new product.

StatisticallyBasedTestingA techniqup useful for both unit and system-level tests is statistically basedtesting.This kind of testingusesan underlyrng probability distribution functionfor eachsystem nput to generate andom test cases.This simulatesexecutionofthe software under realistic conditions. The statistics are usually collected byexpeftusersof Cimilarsystemsor, if noneexist,by educatedguessing.The theory


14/25

268 Chap. 1l I Reliability, Testing, and Fault Tolerance

is that system eliability will be enhanced f prolongedusageof the systemcanbesimulated n a controlledenvironment.The major drawback of such a technique s that the underlying probabilitydistribution functions for the input variablesmay be unavailableor incorrect. Inaddition,randomly generatedest casesare likely to miss conditionswith lowprobability of occurrence. recisely his kind of condition s usually overlookedin the design of the module. Failing to test thesescenarios s an invitation todisaster.

11.3.4Gleanroom estingSome current research[l39] focuseson a "cleanroom" software developmenttechnique o eliminate software errorsand reducesystem esting. n this approach.thedevelopmenteam s not allowed o testcodeas t is beingdevelopeC. ather.syntaxcheckers, ode walkthroughs, roup nspections, nd formal verifi cationsare used o ensureproduct ntegrity.Statisticallybased esting s then appliedatvarious stagesof product development y a separateest team. This techniquereportedly produces documentation and code that are more reliable andmaintainable nd easier o test than otherdevelopmentmethods.The principal tenet of cleanroom software development s that, givensufficient time and with care, error-free software can be written. Cleanroomsoftware development elies heavily on group walkthroughs, ode inspections.code reading by stepwiseabstraction,and formal program validation. It is takenfor granted that software specificationsexist that are sufficient to completelrdescribe he system.The program is developed by slowly "growing" features into the code.starting with some baseline of functionality. At this first and subsequerumilestones,an independent est team checks he code against a set of randomlrgeneratedestcasesbasedon a set of statisticsdescribing he frequencyof use ueach eaturespecified n the requirements.This group tests he code ncrementalllat predeterminedmilestones,and either acceptsor returns it to the developmerteam for correction. Once a f'unctional milestone has been reached, rhtdevelopment eam adds o the "clean" code, using the same echniquesasbefore-Thus, like an onion skin, new layers of functionality are added to the softw'arcsystemuntil it has completelysatisfied he requirements.

The programmersare not allowed to test any of their code on a compurcrother than to do syntax checking. Certain aspects f this techniquehnysfssn niigdsuccessfully, and several projects have been developed in this way, in borhacademic nd industrialenvironments139].This approach s experimental,and it does have severalproblems 96]. Thmain problem s the ack of studies ndicatingthe amount of overhead equiredthe cleanroom approach. It is surmised that personnel requirernentsmustincreasedand schedulesattenuatedo accommodate his technique.


15/25

Sec. 11.4 I Fault Tolerance

11.3.5StressTestingIn another ype of testing, stress esting, the system s subjected o a largedisturbance n the inputs (for example, a large burst of intemrpts), followed bysmaller disturbancesspreadout over a longer period of time.

11.4FAULT OLERANCEFault tolerance s the ability of the system o continue o function in the presenceof hardware or software failures [51]. In real-time systems, fault toleranceincludes design choices that transform hard real-time deadlines nto soft ones.These are often encountered n intemrpt driven systems,which can provide fordetecting arid reacting to a missed deadline.Fault tolerancedesigned o increase eliability in real-time systemscan beclassified in two varieties, spatial and temporal [85]. Spatial fault toleranceincludes methods nvolving redundanthardware or software, whereas ernporalfault tolerance nvolves techniques hat allow for tolerating rnisseddeadlines.Ofthe , wo, temporal fault tolerance is the more difficult to achieve because trequirescareful algorithmsdesign.We discussvariations of both techniquesn thenext severalsections.

11.4.1GeneralProblemHandlingThe reliability of most hardwae can be increasedusing spatial fault tolerance-three or more redundantdevicesconnectedvia a majority rule voting scheme.Another popular schemeuses wo or more pairs of redundanthardwaredevices.Eachpair compares ts output to its companion. f the results are unequal. he pardeclares tself in error and the outputsare ignored. In either case, he penalty isincreasedcost, space,and power requirements.Voting schemes can also be used in software to increase algorithmrobustness.Often information is processed from more than one source andreduced to some sort of best estimate of the actual value. For example. anaircraft's position can be determined via information from satellite positionngsystems, nertial navigation data, and ground information. A compositeof thesereadings s made using a mathematicalconstruct called a Kalman filter. Designand analysis of Kalman filters is complex and beyond the scopeof this text.

17.4.1.1Checkpoints At fixed locations n code. nterrnediateesultscanbe written to files, printers, or memory for diagnosticpurposes.These ocations.called checkpoints,can be used during system operation and durin-e svstemverification (seeFigure 11.6). f the checkpointsare usedonl1 during testin-e.henthis code s known asa testprobe.Testprobescan introduce subtletiming errors


16/25

270 Chap. 11 I Reliability, Testing, and Fault Tolerance

Checkpoint Checkpointresuhs resultsFigure 11.6 Checkpoint mplementation.

in the code which are difficult to diagnose(seethe discussionon the softwareHeisenberguncertaintyprinciple in Chapter 13).

11.4.1.2 Recovery Block Approach Fault tolerance can be furtherincreasedby using checkpoints n conjunction with predetermined esetpoints insoftware.These resetpoints mark recovery blocks n the software.At the end ofeach ecoveryblock, the checkpointsare ested or "reasonableness."f theresultsare reasonable, hen flow-of-control passes o the next recovery block. If theresults are not r'easonable,hen processingresumes at the beginning of thatrecovery block (or some other previous one) (see Figure 11.7). The point, ofcourse, s that somehardwaredevice(or anotherprocess hat s independent f theone in question) has provided faulty inputs to the block. By repeating theprocessing in the block, with presurnably valid data, the enor will not berepeated.In the processblock model, each recovery block representsa redundantparallel process o the block being tested.Equation 11.2 demonstrateswhy thismethod increases reliability. Unfortunately, although this strategy increasessystem eliability it can havea severe mpact on real-timeperformancebecause fthe overhead added by the checkpoint and repetition of the processingin ablock.

Figure 11.7Recovery lockimplementation.Processblock1

Restart


17/25

271I Fault Tolerance

N-Version rogrammingIn any system,a statecan be enteredwhere the system s rendered neffective orlocksup.This is usuallydue o someuntestedlow-of-conffol n the software orwh i ch th e r e i sn o " , " u p " . I n sys te m s te r m i n o l o g ywe wo u l d sa y th a te ve n tdeterminism asbeenviolated.In order to reduce he likelihood of this sort of catastrophicerror, redundantp r o ce sso r sa r e a d d ed to th e S ys te m .T h e se p r o ce sso rsa r e co d e d to th especificationsbut by differentprogramming eams. t is thereforehighly unlikelythat more than one of the ,ytt"*t can lock up under'the samecircumstances'Since eachof the SyStemSsually resetsa watchdog timer, it quickly becomesobviouswhenoneof them s lockedup, becauset fails to reset ts timer' The otherprocessorsn the systemcan then ignore this processor,and the overall systemcontinues o function.This technique s calledN-versionprogramming, and t hasbeenusedsuccessfullyn a numbei of projects ncluding the spaceshuttlegeneralpurposecomputer (GPC).The redundantprocessors an use a voting scheme o decideon outputs,or,more likely, thereare two processors-master and slave.The masterprocessor son-line and produces he ictual outputs o the systemunder control, whereas hes/aveprocessorshadows he masteroff-line. If the slavedetects hat the masterhas becomehung up, then the slavegoeson-line'Built-ln'TestSoftwareBuilt-in-test software,or BITS (alsocalled built-in-software est or BIST)' canenhancefault tolerance by providing ongoing diagnostics of the underlyinghardware for processmg|y the soitware. BITS is especially important inembeddedsystims. For lxample, if an I/O channel s functioniirg incorrectly asdeterminedby its on-board circuitry, the software may be able to shut off thechanneland redirect the I/O'A l t h o u g h B I T S i s an i m p o r t a n t p ar t o f e m b e d d ed s y s t e m s , i t a d d ssignificantly to the worst case time-loading analysis,.This must be considered*t"n ,"l""iing BITS and when interpretingthe time-loadingnumbers hat resultfrom the additional software.In the next subsectionO'we iscussbuilt-in testingfor a variety of hardwarecomponents'

11.4.4CPUTestingIt is probablymore mportant hat thehealthof the cPU be checked hananv otbr"o*pon"rrt of the system.A setof carefully constructedests

can be performd tqtest the efficacy of its instructionset n all addressingmodes'Sucha tesrgrite sillbe t imeconsumingandthusshou ldbere lega ted tobackgror . r r rdInterrupts should Jso be disabled during each sub-test to prcrcfl fu dam btmtused.


18/25

272 Chap. 11 I Reliability,Testing,and Fault Tolerance

There s a catch-zz nvolved n using he CPU to test tself. If, for example,the CPU detects n error n its instructionset,can t be believed? f the CPU doesnot detectan error that is actually present, hen this, too, is a paradox.Theseproblemsshouldnot be cause or ornitting the CPU instructionset est.

11.4.5Mernory estingAl l types of memory. including nonvolat ile memories,can be corrupted viaelectrostaticischarge, owersurging, ibration,or othermeans. his damage anmanifest tself either as a permutationof data stored n memory cells or aspermanentdamage o the cell itself. Corruption of both RAM and ROM byrandomly encountered hargedparticles s a particularproblem n space.Thesesingle-eventupsetsdo not usually happenon earth becauseeither the earth'smagnetosphereeflects he offendingparticleor the mean ree path of theparticleis not sufficient o bring it to the surface.A discussion f this problem,and howto deal with it via software,can be found in [84]. Many of theseconceptsarediscussed ere.Damage o the contents f memory s calleda soft error, whereas amage othe cell itself is called a hard error. In Chapter 2 we discussed ome of thecharacteristicsf memorydevices, nd referred o their tolerance o upset.We areparticularly nterestedn tech,niqueshat candetectan upset o a memorycell andthen correct t.

11.4.5.1ROM The contents of ROM memory are often checked bicomparinga known checksumwith a currentchecksum. he known checksum.which is usually a simple binary addition of all program-codememor\'locations, s computedat link time and stored n a specific ocation n RONI.The new checksum can be recomputed n a slow cycle or backgroundprocessing, nd comparedagainst he original checksum.Any deviationcan bereported as a memory error. Checksumsare not a very desirable orm of errorchecking becauseerrors to an even number of locations can result in errorcancellation.For example, an error to bit 12 of two different memory location-.may cancel out in the overall checksum,resulting in no error being detecteriIn addition, although an error may be reported, the location of the error irmemory is unknown.A reliable method for checking ROM memory uses a cyclic redundanrtcode(CRC).The CRC treats he contentsof memory as a streamof bits and eaci:of thesebits as he binary coefficientof a messag.eolynomial (an extremely ongone) (seeFigure 11.8).A secondbinary poiynomial of much lower order (I-c'rexample,16 or theCCITT or CRC-16standards)alled hegenerator olynomi;is divided (modulo-2) nto the message, roducinga quotientand a remainde:Beforedividing, the message olynornial is appendedwith a 0 bit for every ter:


19/25

.4 I Fault Tolerance 273

of the padded

( 1 1 . 1 1 )

Hioh-orderbit ofpolynomial

Memory Appended itsFigure 1.1.8Cyclic ReduncancyCode mplementation'

in the generator.The remainder from the modulo-2 divisionmessages the CRC checkvalue.The quotient s discarded.NOTE 11.1 The CCITT generator olynomial s

Y r 6 * Y t z + X 5 + 1

Remainder

Lowest rderbitof polynomial

whereashe CRC-16generator olynomial sy t 6 { - y t s + * + 1 ( 1 1 . 1 2 )

A CRC can detectal l 1-biterrorsandvirtually al l multiple-bit'errors. he sourceof the error, however, cannotbe pinpointed.I EXAMPLE 11.6ROM consists f 64 kilobytesof 16-bit memory.CRC-16 s to be employed o check he r alidii'of the memorycontents. he memorycontents epresent polynomial of at most order 65-i-15 i6= 1,048,576.Whether he polynomiaistarts rom high or low memorydoesnot matteras cn: ::you are consistent.) fter appendinghe polynomial with sixteen0s, the polynomial s at rn'r:: ::order 1,048,592. his so-calledmessage olynomial s then divided by the generator 'olir-.-:-:'at r a X'ts+ X2 + l,producing a quotient,which s discarded, nd he emainder. hich ti i lr: ;:.-:::CRC checkvalue. I

In addition to checking memory, the CRC can be emplo)ed to pIrorrr,nonvisual alidationof screens y comparinga CRC of theacrualourput r th the


20/25

274 Chap. 11 I Reliability,Testing,and Fault ToleranceCRC of the desiredoutput.The CRC of the screenmemory is called a screenstgnature.The CRC calculation s CPU-intensive, nd should only be performed nbackgroundor at extremely slow rates.An excellentset of programs n C forcomputingCRCs can be found in [211.

11.4.5.2RAM Because f the dynamic natureof RAM, checksums ndCRCsare not viable. One way of protectingagainsterors to memory s to equipit with extra bits used o implement a Hamming code. Dependingon the numberof extra bits, known as the syndrome, errors o one or more bits can be detectedand corrected. A good discussion of the Hamming code from a theoreticalprospective an be found in [107], and a more practicaldiscussion s given in[117].Such codingschemes an be used o protectROM memory as well.Chips that implement Hamming code error detectionand correction (EDCchip) are available commercially.Their operation s of some interest. During anormal fetch or store, the data must passthrough the chip before going into orout of memory. The chip compares he data against the check bits and makescorrection f necessary seeFigure 11.9).The chip also sets a readable lag,which indicates hat either a single or multiple bit error was found. Realize,however, that the error is not corrected in memory during a read cycle, so ifthe sameerroneousdata are fetchedagain, they must be corrected again.Whendata are stored in memory, however, the correct check bits for the data arecomputed and stored along with the word, fixing any errors, a processcalledRAM scrubbing.

Figure 11,9 Implementation of Hamming cdde error detectionandcorrection.

Memory

Sec. 11.4 I Fault Tolerance ) 1a


21/25

In RAM scrubbing, he contentsof a RAM location are simply readandwritten back. The error detection and correction occurs on the bus, and th.corrected data are loaded into ,a register. Upon writing the data baek to thememory location, the correct data and syndrome are stored. Thus, the error iscorrectedn memory as well as on the bus. RAM scrubbing s used n the spacshuttle nertial measurement nit (trMU)computer 90].This device significantly educes he number of soft errors,which will beremovedupon rewriting to the cell, and hard errors,which are causedby stuckbitsor peflnanent physical damage to the memory. The disadvantagesof errordetection and correction are as follows.Additional memory s neededor the scheme 6 bits for every 16 bits-a37Vo ncrease).Additional power is required.Acreage requirementsare imposed.An access ime penaltyof about50 nanosecondser accedss incurred fan error.corrections made.Multiple bit errors cannot be corrected.

In the absenceof error detecting and correcting hardware,basic techniquecanbe useil o test he ntegrity of RAM memory.These estsare usually run uponinitialization, but they can also be implemented n slow cycles if intemrpts areappropriately isabled.I EXAMPLE1.7Supposea computer systemhas 8-bit, data,and addressbuses o write to 8-bit memory locationsWe wish to exercise he addressand databusesas rvell as the memory cells. This is accomplishedby writing and then readingback certain bit pattems to every memory location. Traditionally, thefollowing hexadecimalbit pattemsare used:

The bit patternsare selectedso that any cross-talk betweenwires can be detected.Bus rr u'eiare not always laid out consecutively,however, so that other cross-talk situations can ari:e. Forinstance, the above bit pattems do not check for coupling between odd-numbered'*ire: Tbefollowing test set does:

This test set, however, doesnot isolate he problem to the offendinB rrire ,bur. R'r;.-r,lieecoverage of 8 bits we need,

I

rTI

276 Chap. I I I Reliability, Testing,and Fault Tolerance


22/25

7 + 6 + 5 + 4 + 3 + 2 + 1 = 2 8combinations f 2 bits at a time. Sincewe have8-bit words,we can est our of these ombinationsper test.Thus, we need

28n = '

8-bit patterns. hesearegiven in the following table:

IIn general, or n-bit dataand address useswriting to n-bit memory,where

n is a power of 2, a total of a/ lutt.-s of 2 are needed,which can beimplementedn n - I patternsof n bits each.

11.4.5.3Other Devices Devicessuchas A/D converters, /A converters,MUXs, I/O cards, and the like need to be testedconrinually.Many of thesedeviceshavebuilt-in watchdog imer circuitry to indicate hat the device s stillon-line.The software ancheck or watchdog imer overflowsandeither eset hedevice or indicate ailure.In addition, he BITS can rely on the ndividual builrin testsof the devicesin the system.Typically, these devices will send a Statusword via DMA toindicate heir health. The softwareshould check this statusword and indicatefailures as required.

11.4.6Spurious nd Missed nterruptsExtraneousand unwanted ntemrpts not due to time-loading are calledspuriousinterrupts. Spurious ntemrpts can destroy algorithmic integrity and cause un-time stack overflows or systemcrashes.Spurious ntemrpts can be causedb1-noisy hardware,power surges,electrostatic ischarges, r single-eventupsets.Missed interruptscan be caused n a similar way. In either case,hard real-timedeadlinescan be compromised, eading to system ailure. It is the goal, then, totransformthese hard errors into some kind of tolerable soft error.

11.4.6.1Handling Spurious Interrupts Spurious ntemrptscan be toler-atedby using redundant ntemrpt hardware n conjunction with a voting scheme.Similarly, the device is-suing he intemrpt can issuea redundantcheck, such as

I Fault Tolerance


23/25

11.4.7

usingDMA to senda confirming flag. Upon receiving the interrupt, the handlerroutinechecks he redundant lag. If the flag is set, he ntemrpt is legitimate'Thehandlershould hen clear the flag. If the flag is not set, hen the intemrpt is bogusand the handler routine should exit quickly and in an orderly fashion' Theadditional overheadof checking the redundant flag is minimal relative to thebenefit derived. Of course,extra stack spaceshould be allocated o allow for atleast one spurious nterrupt per cycle to avoid stack overflow. Stack overflowcausedby iepeatedspurious ntemrpts is called a deathspiral.Missed intemrptsare more difficult to deal with. Softwarewatchdog imerscan be constructed hat must be set or resetby the routine in question'Routinesrunning at higher priority or at a fasterrate can checkthesememory locations oensure-thathey arebeing accessed roperly. f not, the dead ask can be restartedor an error indicated.Tf," ,ur"rt method for sustaining ntegrity in the face ofmissedintemrptsisthroughthedesignofrobustalgorithms.

Dealingwith Bit FailuresUnwantedflipped bits in memory,deviceregisters, he cPU, and so forth can bethe source of many types of system problems ranging from performancedegradation o total systlm failure. These ailuresare due to a variety of sourcesincluding hardware ailures, chargedparticle collisions, and radiation effects' Athoroughdiscussionof dealingwiih these ypesof problemscan-be ound in [90]'However, he main findings oi this work and he relativecostsof the remediesaresummarizedn Tables11'3and 11'4'At the time of this writing, actual data on the relative efficiency of thesetechniqueswere unavailable,Uut it witt be interestingto note which techniquefaresbest.

TABLE 11'3 SEUProtectionMechanisms

AdverseEffect Remedy

Comrptionof RAM.dataComrptionof ROM dataComtptionof PCCPU atch-uPI/O circuitrySpuriousntemrPtsMissedntemrPtsMis-prioritizedntemrPts

EDC chip, RAM scrubbingEDC chipNoneWatchdogtimerNoneConfirmation flagsWatchdog timer, countersDouble check status regiser

278 Chap. 1l I Reliability,Testing,and Fault Tolerance


24/25

TABLE 11.4 Costsof SEU ProtectionMechanismsRemedy

RAM scrubbingEDCchipWatchdogimerConfirmationlagsDouble heck tatuseglster

NoneIncreasedIncreasedIncreasedIncreased

memoryaccess rmespower,space,weightrnte|rupt esponseimesintenuptresponseimes

11.5EXERCISES

3.4.5.6.1

I' Draw the subsystem onfiguration or a systemwith fbur subsystems nd an overallreliability function given byr r(t)r (t) + r.(t)a (t )- r r(t)r (t)r.(t)r oQ)

2. For the system depicted in Figure 11.10, calculate the overall system reliabilinfunction.

Figure 11.10 Systemdivided ntosubsystems ith associatedeliabilitier

For the systemdepicted n FigureFor the Pascal code depicted inMcCabe's metric for this code.Derive r"o(r) n equation11.2.Deriver"o(t) n equation11.3.For the Pascalcode depicted n Example 6.5, draw the flow graph f the numberof srnls 5 and the alphabetsize is 6. CalculateMcCabe's metric for this code. Whatto McCabe's metric as the numbef of states ncreases?What happensas he alphabelincreases?Can you draw any conclusionsabout the reliability of table-driven code?calculate the cyclomatic complexity for all the Pascal code fragments in chaptenand 8.

9. Calculate he Halsteadmetricsfor Example 6.4 using the numberscalculated nI l . J .

l l.lG calculate he overallsystem ailure function_Example 6.3, draw the flow graph and calcul-

8.

279


25/25

11.4 Exercises10. Calculate he Halsteadmetric for all the Pascalcodefragments n Chapters7 and 8. How

do the values or the level of abstraction,L, compare o the cyclomatic complexity,C, ofthe McCabe metric?

11. A software module i,s to take as inputs four signed 16-bit integers and produce twooutputs, the sum and average.How many test caseswould be needed or a brute-forcetesting scheme?How many would be needed f the minimum, rnaximum, and averagevalues for each nput were to be used?

12. A real-time systemhas a fixed number of resourcesof typesA, B, and C. Thereare fivetasks n the system, and the maximum amount of resourcesA, B, and C needed or eachtask is known. Implement a banker's algorithm scheme in the language of yourchoice.

13. Describe the effect of the following BITS and reliability schemeswithout appropriatelydisabling interrupts. How should intemrpts be disabled?(a) RAM scrubbing(b) CRC calculation(c) RAM pattern tests(d) CPU instruction set test

14. Suppose a computer systemhas 16-bit data and addressbuses.What test pattems arenecessaryand sufficient to test the addressand data lines and the memory cells?

15. Write a module in the languageof your choice that generatesa CRC checkword for arange of l6-bit memory. The module should take as input the starting and endingaddresses f the range,and output the 16-bit check word. Use either CCITT or CRC-16as generatorpolynomials.16. In N-version programming, the different programming teams codefrom the sameset ofspecifications.Disiuss the disadvantages f this (if any).

real time system 08 philip a lapalante 2nd edition

Documents