communications reliability: a historical perspective

13

Click here to load reader

Upload: ha

Post on 22-Mar-2017

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Communications reliability: a historical perspective

IEEE TRANSACTIONS 0 RELIABILITY, VOL. 47, NO. f S P 1998 SEPTEMBER 4 SP-333

Commubications Reliability: A Historical Perspective

Henry A. Malec, Senior Member IEEE 3Com Corporation, Mount Prospect

Key Words - CO munications reliability, Down-time, Availability, History, ‘$L9000.

net and World Wide

continuous service.

It discusses the notable contribu-

sion.” (Reizeman 199

MTBF mean ti PABX Private tic Branch Exchange

Units

In the 1950s, the d reference for communications was the “AT&T Pri f Electricity applied to Tele- phone and Telegrap [a]. However, there was no

lThe singular & pluraj of an acronym are always spelled the same.

2. THE 1960s In the early 1960s, the US Military Communications

were analyzed; the main reliability metrics were MTBF and the Up-Time Ratio (UTR). In 1962, typical predic- tion values for communication-system equipment were an MTBF of 90 hr, a repair time of 30 min, and a resulting UTR of 0.9942 for a simplex configuration, and an MTBF of 4000 hr and a resulting UTR of 0.9999 for a system configuration with standby capability.

The US Air Force sponsored an important reliability- program that addressed “System Aspects of Reliable Com- munications” [4] which resulted in the issuance of a final report in 1965. This was one of the first approaches aimed at predicting system reliability with mean downtime and availability considered as important reliability measures. Unavailability formulas were presented for n equipments with 1 repairman for series, parallel and standby configu- rations’. The optimization of repeater-spacing for yielding the highest availability was also presented. Communica- tion systems studied included Dew Drop, Dew Line, Dew East, White Alice, and the TD-2 stretching from Iceland to Thule through to Alaska. The study set the standard for detailed reliability analysis, and presented findings such as “the unavailability time for tropospheric scatter equip- ment is from 10 to 30 times that for TD-2 line-of-sight equipment. ”

The year 1965 was also important for the reliability com- munity with the launching of the first commercial satellite [5], Early Bzrd, into a geosynchronous orbit 22240 miles above the earth. The satellite had a design life of 18 months and could provide 240 2-way voice circuits be- tween Europe and the USA. This satellite was later called INTELSAT I.

During the 1960s, commercial telephone-switching sys- tems dominated system reliability advances in the commu- nication industry, following the leading studies of the mil- itary. Concerns were categorized [6] into 5 basic schools:

1. System outage, or downtime, over 40 yr, 2. System outage and degradation of service, 3. faults per line per calendar unit, 4. MTBF for service or equipment, 5. equipment duplication at specific levels.

Many of these schools still exist; the most influential one is the Bell Telephone [7] objective: ‘no more than 2 hr downtime in 40 yr’.

2The terms, serzes & parallel are used in their logic-diagram sense, irrespective of the schematic-diagram or physical-layout.

I 001 8-9529/98/$10.00 01998 IEEE

Page 2: Communications reliability: a historical perspective

334-SP IEEE TRANSA(3TIONS ON RELIABILITY, VOL. 47, NO. 3-SP 1998 SEPTEMBER

The reliability objective of ‘no more than 2 hr downtime in 40 yr’ which converts into ‘no more than 3 min/yr’ while applying to voice communication, has been followed for the past 35 yr and continues even today. In 1964, the L.M. Erickson Company [8] of Sweden stated the reliability goals for the AKE telephone switching series in terms of MTBF with: ‘The average time of 40 yr between service interruptions’. In 1967 April the L.M. Erickson Company [9] modified their requirements to compare with Bell’s and then stated: “Reliability should be such that not more than 1-2 hr downtime should be tolerated over a period of 40 yr”.

During these 1960s, the majority of telephone switching systems communication manufacturers adopted the sys- tem outage format for their reliability objectives. In 1967, the Stromberg-Carlson Corporation [I 01 quoted reliability objectives for their electronic systems as: “Catastrophic failure, which means no service, shall not be more than once in 40 yr”. Northern Electric of Canada [ll] joined this school of reliability for their SP-1 System in 1969 with: “RELIABILITY: A downtime of less than 2 hr in 40 yr”.

In the majority of the communication switching systems that carried the ‘no more than 2 hr downtime in 40 yr’ type of reliability objective, there were usually additional reliability performance goals. Often the additional goals were in specifying the amount of allowed mishandled calls or amount of calls that are lost in terms of 1,2,3,5, or 10 per 104 unit calls. For example, Bell Telephone [7] stated, “The calls handled incorrectly should not exceed 0.02%”.

In 1967, GTE - Automatic Electric Laboratories defined the school # 2 of system outage and degradation of service. In addition to system-outage goals and specifications for the allowed number of mishandled calls, this school as- signed goals to levels of degradation of service in terms of average units per terminal per year. Toll systems and end offices would have differing specific goals; however, both would address the elements of school #2. For example, the goal structure, defined for the GTE-AE No.2 EAX [la] was:

1. average 2 hr or less of extreme loss of service over a 40-yr period with an average of 5 years or greater between such occurrences,

2. average of 1 hr or less per year, of loss of service, or serious degradation of service, per line or incoming trunk, averaged over all of subscriber-lines and incoming-trunks,

3. no more than 1 in lo4 of all locally originating and incoming calls are misdirected, unsuccessfully terminated, or prematurely disconnected as a result of equipment mal- function and/or failures,

4. no more than 1 in lo4 of all locally originating and incoming calls are misdirected, unsuccessfully terminated, or prematurely disconnected as a result of all other causes (eg, transients, noise, or design deficiencies).

School #3 concentrated on the frequency-of-failures or faults-per-line per calendar unit in order to determine the

reliability of communication systems. In 1967, ITT - Stan- dard Elektrik Lorenz AG of Stuttgart, Germany [13], re- ported measuring switching-system reliability in terms of “Failures per 100 lines per month. This school further evolved in the 1960s, and reported results in the early 1970s. For example, a paper in the BHG Telecommuni- cations Review from Budapest, Hungary [14] reported on the reliability of Quasi-Electronic exchanges in terms of errors per 100 subscribers per year. Erroneous operation included: . number of errors reported by subscribers, . number of errors experienced during test calls, . metered error rates, . errors revealed & eliminated, . a quantity derived from disturbances, error maintenance,

or idling time which considered inconvenience resulting from erroneous operation (in 1971, L.M. Erickson used a similar measure for the ARK 522 rural exchange [15] with a reliability measure of “Faults per 1000 subscribers per year” ) .

School #4 defined terms simply with the measurement of MTBF. In 1969, the British Post Office Department of Telecommunications [16] specified 4 categories of reliability requirements:

1. service to individual customers in terms of MTBF, 2. service to groups of customers - x% in terms of MTBF, 3. MTBF for complete exchange failure, 4. reduction in grade of service - various reductions in

terms of MTBF. The US military joined this school #4 with the Overseas AUTOVON Switch - 490-L Communications System with requirements: . the anticipated mean time to switching-center outage

shall be no less than 1 yr, . the mean time between part failures shall be no less than

8 hr.

School #5 required that the system-design incorporate equipment duplication at various levels. In 1967, ITT- Bell Telephone in Antwerp, Belgium [17] specified that in order to achieve high reliability, the system would have, “Dupli- cation of all circuits serving more than 64 subscribers”. Also in 1967, another ITT unit, Standard Electrik Lorenz AG specified [13]: “All control equipment serving more than 100 lines have been subdivided into individual units and duplicated”.

In the late 1960s, there were many modifications & ex- pansions to the various schools of reliability. Bell Labora- tories [18] published reliability results on the No.1 ESS in terms of the: . maintenance interrupts per switch per week, . number of customer complaints per 100 telephones per

week. Later [19], they discussed reliability results in terms of the: . number of maintenance interrupts per month, . number of customer reports per lo4 main stations per

month.

Page 3: Communications reliability: a historical perspective

MALEC: COMMUNICATIOLS RELIABILITY: A HISTORICAL P E R S P E ~ I V E SP-335

Product or System Telephone Instrument Electronic Key System

PABX

Traffic Service Positim

Class 5 , 4, or 3 Office

munication Goal-Categories

+ system outage ] DOS + degradation of service ]

Goal Categories MTBF a. Complete LOS b. Major LOS c. Minor LOS a. Complete LOS b. Major LOS c. Minor LOS d. Mishandled calls

System d. Mishandled calls e. SysOut e. SysOut f. LOS g. DOS

8 . THE 1970s

ommunication industry, which reliability-performance stratifica-

tion for the dec come. The following years ex-

signs reliability go the telecommunica

various products & systems in ena [20]. Telephone reliability is

measured the ‘Complete Loss of

. Mishandled calls.

necessary to address the market on a cost effective basis [20]. Some of the variables of the 4 products types to meet this wide offering were:

~ 3 systems used solid-state digital networks and 1 used a magnetic-reed analog network, . 3 different central processor configurations - simplex,

hot-standby, and micro- synchronized, 2 different processor equipment types: microprocessor

and mini-processor. The reliability efforts in the 1970s were very focused on what reliability level of service that the customer/user would obtain. However, the reliability planning, design, and measurement areas had a definite hardware perfor- mance focus.

The emphasis on hardware reliability resulted in the Cost-Effectiveness Parameter Tree in figure 1. Figure 1 presents the relationship [20] of the parameters with de- tailed reliability goal-categories.

This cost-effectiveness parameter tree was implemented in the 1970s and many telecommunication systems by var- ious manufacturers could be mapped into the reliability performance model [20], including the No. 2 EAX, No. 3 EAX, Metaconta L, AKE, and No. 1 ESS. Each com- munications market-area applied the cost-effectiveness pa- rameter tree to their specific market focus, and determined the detailed reliability specifications for the projected mar- kets/applications. For example, table 2 shows the detailed PABX reliability specifications for a wide range of product sizes.

Reliability objectives, such as those in table 2, were sub- sequently allocated to specific hardware or system units (subsystems) using a set of design goals based on the Tree in figure 1, with the development of a market matrix similar to table 11. For example, the 3 detailed system- reliability design-goals of a communication system are:

1. The average loss of service, or serious degradation of service, experienced by lines or incoming trunks connected to the system is I 1 hr/yr, averaged over all of the lines and incoming trunks.

2. No more than 1 in lo4 of all locally originating and incoming calls will be misdirected, unsuccessfully termi- nated, or prematurely disconnected, as a result of equip- ment malfunction or failure.

3. There will be 2 hours or less of system outage averaged over a period of 40 yr, for system outage > 1 min.

Table 3 shows the initial allocation of subsystem objec- tives [20] for goals #1 & #3. This method for subsys- tem allocation has continued to be the preferred method for reliability allocation. However, allocations for software & procedural issues were generally added in the 1980s & 199Os, but without an agreed-upon method of modeling either software reliability and/or procedural issues.

The question was often asked in the 1970s about how the plethora of switching systems with their many diverse reliability goals were performing against their stringent reliability design goals. In 1972, Bell Laboratories [21]

Page 4: Communications reliability: a historical perspective

336-SP

Common Control Performance Catastrophic Failure - MTBF SysOut Time per 20 yr SysOut Frequency Complete LOS - MTBF

Service Level Major LOS - MTBF Minor LOS - MTBF DOS - Time/Year

IEEE TRANSACTIONS ON RELIABILITY, VOL. 47, NO. 3-SP 1998 SEPTEMBER

Table 2: PABX Reliability Specifications

Lines < 120 200 400 600 800 1200 3000 5000

10 yr 1 hr 1 hr 1 hr

2 5 y r 2 5 y r > 5 y r 5 yr 10 yr 40 yr 40 yr 40 yr

200 d 400 d 300 d 200 d 150 d 1 yr 1 yr 60 d 60 d 50 d 40 d 30 d 30 d 15 d

1 hr

Table 3: Subsystem Reliability Objectives

[Misc 3 Line Circuits, Trunks, Junctors, Service Circuits] [Serv + Service]

Subsystem Unavailability Condition Control & Memory 40 min/20 yr ServDenied Main Power 5 min/20 yr ServDenied Network 10 min/20 yr ServDenied Network 1 hr/yr ServDegraded Peripheral Matrix 5 min/20 yr ServDenied Misc 1 hr/yr ServDegraded

reported that the No 1. ESS was averaging about 12 hr downtime in 40 yr, which is 6 times the specified down- time. In 1973, they [22] stated: “the reliability has been improved to the point where less than 10 hr of total down- time is projected over a 40-yr operating span”. The ma- jority of manufacturers in the 1970s used downtime cri- teria as the main measure of communication reliability. Northern Electric [23] specified that their Toll SP-1 “will have an anticipated downtime of less than 2 hr per 40 yr”. However, some of the manufacturers were continu- ing on with more specific reliability criteria that reflected service to both individual customers and groups of cus- tomers. L.M. Erickson [24] published figures on the num- ber of faults per 100 subscribers per year. Performance in France [25] was reported in similar terms: 6 faults per IO4 calls, with additional information on people interventions for repairs/month. IBM France [26] followed the system outage reliability school and stated for the IBM 3750 PBX: “The system reliability is first of all characterized by the average interval of time between two successive system in- terruptions, so called Mean Time Between Failure. To be competitive with existing systems, the design objective of MTBF for the 3750 was 20 yr”.

4. THE 1980s

In the 1980s, software reliability was not only the major concern, but it dominated the communication-system re- liability activities. The foundation was laid in the 1970s; table 4 presents the 1980 Generic Quality Metrics [27] for a large complex system with built-in fault tolerance. These quality metrics include reliability objectives and reflect the individual design/life cycle phases for a communication- system product.

Table 4 shows that field performance can see a lower level of Q&R performance than observed in either the Lab Test or in Field Performance. It was important that customer usage after field cut-over and the characteristic mix of their service requests could not be simulated ei- ther in the Lab Test or in initial field trials. Software Flood-Testing was performed with the goal of Breaking- the-Software; however, success was limited in the 1970s and is still limited today.

The amount of software code being written for systems such as shown in table 4 often exceeded lo6 lines of devel- oped non-comment code. I t was stated [28] that for the system whose Q&R performance was detailed in table 4, it was not “technically or economically feasible to detect or fix all software problems in a system as large as the No. 4 ESS. Consequently, a strong emphasis has been placed on

a successful operation and fault recovery in an environ- ment containing software problems.” This issue continues today and will continue into the next millennium with the customers of communication systems that requiring new complex features with delivery the next day, even though the software is not fully tested. In the late 199Os, cus- tomers began to demand engineering releases as the first operational shipments of a new product.

A major contribution to software reliability was the text, Software Relzabzlity - Measurement, Predactaon, Applzca- tzon in 1987 [as]. The main author had been involved

making it sufficiently tolerant of software errors to provide

Page 5: Communications reliability: a historical perspective

MALEC: COMMUNICATI

I System Cosh

r - l

System Effectivenew

[In order to enter the I

in software reliability ries with data gatherej software projects. defect-density, of the number-of-defc uct. This intense mat contrary to the approach ated in AT&T Bell ” 7 1 ’ . 1 P.

1s RELIABILITY: A HISTORICAL PERSPECTIVE SP-337

since 1973 at AT&T Bell Laborato- on over 35 communication systems

The main reliability predictors were software-failure-intensity, and the estimate

cts-remaining in the shipped prod- hematical/theoretical approach was

in table 4 which was also gener- L2,boratories but had the premise that

. 1-. J __. . l l i l - . . - :-- L l L - i - - L

I I Cost Effectivenus

i Service Level

Figure 1: Cost-Effectiveness Parameter Tree

Table 4: Subsystem Reliability Objectives

:ign Phase from the Requirements Phase, no Open-Questions are allowed]

Project Phases Design Lab System Field Field

Performance Problems Fixed 1/500 words 1/1000 words 1/1000 words Problems Open 1/5000 words 1/5000 words 1/2000 words 1/2000 words T , , nn 1.1. , nn I d - . . . or I J - . . interrupts Audits Service-Affecting

Incidents Re-init ialization Cut-Off Calls Denied Calls

7 Lv/uay \ LIv/uay 7 LJ/uay 0 < lO/day < lO/day < 25/day

0 lB/office-month 0 1 /month 0 < 0.2110~ 0 < 0.7/104 n

The majority of software models developed in the 1980s & 1990s focused on the viewpoint of the developer and not the end user/customer. The development-oriented soft- ware reliability models asked: “What could be improved in the development process to produce software with fewer bugs?” An approach to looking a t software reliability from the system-customer viewpoint is addressed in ta- ble 5 [31] and is based upon what reaction an individual I--.- - - . . - -A :- +I-- +:,.-..l -.--+-- :- +LA G,lA Th, nooa-testing ana son are-overioau CoI iu iL iwib III L I I ~ LWL- uug c,aubtzu 111 b l i t z upwcmiuiiai ayabGlll 111 LIK 11Glu. 1 1 1 0

ing phases would root out all important software bugs. On software reliability model in table 5 considers that a given work in the 1980s on I easuring software, it was concluded software bug (defect), in general, has various manifesta- that “new models for orecasting and estimating reliability tion rates and their resultant frequency 1s characteristic of should be drawn up fc r approaching reality as it occurs in the level of criticality. If a bug causes a transient condi- industry, both as regards the data that it is realistic to be tion (reboot/reconfigure) or is catastrophic (requires man- able to collect and the activities implemented during the ual/remote intervention), the results can be characterized terminal phases of a roftware’s life cycle” [30]. from errors coming & going through the system being in- I

Page 6: Communications reliability: a historical perspective

338-SP IEEE TRANSACTIONS ON RELIABILITY, VOL. 47, NO. 3-SP 1998 SEPTEMBER

operable. This model can be structured into predicting re- liability performance from the users/customers viewpoint of service performance.

Software reliability is often controlled in communica- tion-system design by establishing a software-reliability design-process. The final measure of such a development is the System Test which includes the: . number of outstanding priority problems, . performance of the system as defined by audits, inter-

rupts, re-initialization and other measures in table 4. However, neither this process & testing approach nor the developmental approach of predicting the remaining de- fects will yield the desired prediction of field operation from the viewpoint of the customer. A model3 based on table 5, has recently proved successful in prediction in a data-over-voice system which included redundancy.

In 1981, a combined hardware/software modeling tech- nique was established for the ITT 1240 System [32]. Figure 2 is the example model presented in a block diagram for line-to-line call- processing showing the hardware & soft- ware involved for this fully distributed system. The system was quite different from the usual large central-controlled communications and used distributed processors for call- control and met the objective of 1 hr downtime in 20 yr. The model also considered reload times and the availabil- ity of processors that were pooled.

In 1985, the Quality Assurance Management Committee (QAMC) of the IEEE Communications Society initiated a series of workshops to: . assess the state-of-the art of Q&R in the communication

industry, . promote informal discussions and research on various as-

pects of Q&R in communications. Participants in these international workshops were tech- nology leaders and/or senior management in the commu- nication industry. The QAMC had its beginning in dis- cussions among members of the Board of Governors early in 1983. The view emerged from those discussions that it was important, even urgent, to raise the consciousness of the Q&R of telecommunications products, quality of networks, and quality of service for the communication engineering community. This committee is still viable in 1998 with a name change to the IEEE Communications Society (COMSOC) Technical Committee on Communi- cations Quality and Reliability (CQR), which reflects the continual increasing importance of reliability.

The QAMC efforts were documented with 4 special is- sues [33 - 361 during the 1980s to 1990s of the IEEE J. Se- lected Areas in Communicatzons (JSAC). The first IEEE JSAC sponsored by the QAMC was issued in 1986 [33] and focused on: . reliability field-results for communication systems, reli-

ability case studies, . performance of networks, . evolving telecommunications.

3By the author of this paper.

SOFTWARE

Calling Line: Line Module

Calling Line: Call Control

Service Circuits Module

Called Line: Line Call Control

CaIled Line: Line Module

Figure 2: Line-to-Line Call Processing Model

This issue contained 30 papers and 6 editorials, and set the direction for the QAMC for the year to come.

Since the launch of the Early Bard in 1965, more than 100 commercial communication satellites [5] had been placed into geosynchronous orbit by 1985. The reliability design-life had grown from 18 months to an estimated 15 yr, with part increases from 3.5k to about 70k, and with a circuit capacity increase of 125 times. The Intelstat space segment had achieved from 1971 through 1986, a continu- ity of service of greater than 0.99996 - primarily because of the redundancy within each satellite and the use of con- tingency satellites within the space segment [5].

In 1986, ‘procedural’ problems were indicated to be a major concern for system outages, in an analysis of stored program control switching systems [3’7]. The frac- tion of downtime that was related to maintenance & administrative-procedural problems was 42% and the re- covery software was accountable for 30%. Thus, 72% of the downtime, as reported in the study conducted by Bell Communications Research on various exchanges in Bell Operating Companies, was not due to the design or opera- tion of the basic stored program control switching-system. This analysis concluded that the operating-system soft-

Page 7: Communications reliability: a historical perspective

MALEC: COMMUNICATIC~NS RELIABILITY: A HISTORICAL PERSPECTIVE

Bug Manifestation

4/day

2/week 1 /month

0.5/yr

SP-339

Defect Level of Failure Failure Removal Criticality Type Characteristic l/month 5 Transient Errors come & go

l/rronth 3 Transient or Catastrophic Service is affected 2,‘yr 2 Transient or Catastrophic Service is partially down 1//Yr 1 Catastrophic System stops

l/week 4 Transient Errors are replicated

ltable 5: Software Reliability Criticality Index

was recommended for

to reduce the level ntenance & administrative prob- lems which occur growth of the offices. The reli- ability community ucceeded in achieving reliability

liability prediction a mature technique.

dicatecl: “Some st A critical assess network-reliability models in-

day, because most lity models consider 100% of the

then, many have appeared us-

measurement of tel unication systems/services per- formance. Italtel st their reliability database con- struction [40] in 197 completed the Fourth Phase

information on parts, nd systems, the database included both steady- mance and growth trends

removal rates for com led to the correlation

. This database activity and prediction models.

for modeling ISDN C 1980s. Petri-Net mot

allelism between functions and describing precisely both the synchronization and dynamics of processes; commu- nications between entities, in message mode, and exam- ination of problems related to the conflicts, collisions, losses, message duplication, and loss of message sequenc- ing”. This effort made in modular structuring and formal modeling provided the reliability analyst with an easy-to- use tool for reliability-enhancement and the overall im- provement of reliability.

At this time VSAT (Very Small Aperture Terminals) in the satellite area began to offer cost- effective solutions for corporate networks for voice & data applications. The VSAT technology provided similar services to the Inte- grated Services Digital Networks (ISDN). One of the ad- vantages for the reliability of VSAT over ISDN was that one did not have to deal with various vendors for leased- lines, including fiber optics [42]. The availability specifi- cations were in terms of . channel availability, . circuit availability, . total user availability.

Channel-availability concerns itself with radio-wave propa- gation where degrading effects of precipitation in the trans- mission path is a major concern. Additional variables for VSAT channel availability are the [42]: . specific climatic region, . local geography, . elevation angle of the ground station to the satellite.

Circuit availability specifies a metric for performance as seen by an single subscriber for the ability to complete an end-to-end connection. Total user availability of a VSAT Unit is concerned with the performance of the VSAT Net- work over the long term and includes spares availabil- ity /supply.

Internationally, the term reliability was being replaced by dependability in many countries in the 1980s. System Dependability was defined:

I . in terms of a design target of system down-time of 3 min/yr at each office, with individual line downtimes of 30 min/line/yr (depending upon the size of the office),

2. in terms of measures for the continuation of service, 3. for characteristics of file-protection and easy-mainte-

nance (based on advanced fault- tolerance techniques) [43]. The formal definitions, based upon the telecommunication

Page 8: Communications reliability: a historical perspective

340-SP IEEE TRANSACTIONS ON RELIABILITY, VOL. 47, NO. 3-SP 1998 SEITEMBER

industry international standard CCITT Recommendation G.106 [44], were formulated in the late 1980s with Depend- ability [45] being the collective term used to describe the availability performance and its influencing factors: . reliability, . maintainability, . maintenance support

as presented in figure 3 [45]. In figure 3,

. Availability is: the ability of an item to perform a re- quired function at a given instant of time or a t any instant of time within a given time interval, assuming that the ex- ternal resources, if required, are provided. . Reliability is: the ability of an item to perform a required

function under given conditions for a given time interval. . Maintainability is: the ability of an item, under stated

conditions of use, to be retained in, or restored to, a state in which it can perform a required funct,ion, when main- tenance is performed under given conditions, and using stated procedures and resources. . Maintenance Support is: the ability of a maintenance

organization, under given conditions, to provide upon de- mand the resources required to maintain ai1 item, under a given maintenance policy.

In the USA, Bellcore published the RQMS - Reliability and Quality Measurements for Telecommunications Sys- tems [46] in 1989 June. It did not follow the interna- tional change to Dependability, but retained the tradi- tional Reliability concept developed by the US military agencies/services. This divergence of Reliability and De- pendability is still prevalent. The RQMS shifted the mea- surements of Q&R to be customer-oriented. The mea- surements applied [47] to network switching elements (lo- cal switch, tandem, signal transfer point, service control point, packet switch, adjunct, etc), operations systems, and transport systems [47]. The scope covered system, software, hardware, firmware, and product support, and was directed toward the system test, first office applica- tion (FOA), and general availability life-cycle stages [47].

5. THE 1990s In the 199Os, the measurement & control of the relia-

bility of product [48] purchased in the telecommunication transmission category shifted in one case to product re- turns from a specific manufacturer. The measurements were: . overall annualized field return rate, . short-term or 4-quarter rolling annualized field return

. rate of re-return. rate,

A major purchaser of communication equipment used this concept, and ranked all of its suppliers of switching sys- tems and transmission equipment. The intent was to ap- ply the customer-viewpoint to reliability, because the ex- tensive lack of correlation of reliability predictions to field

performance left the industry in a quandary. Thus, relia- bility predictions were not needed for LRU (Least Replace- able Units). For example, at this time the customer could purchase line-cards for its switching system from many manufacturers, and instead of requiring MTBF, they mea- sured the 3 return-rates and ranked the suppliers, with the intention of dropping the ‘suppliers with the highest re- turn rates’ from the approved vendor lists. Thus, a unique method of driving the reliability of LRU continuously up- ward resulted.

In 1990, software-services reliability models [49] were formalized. Figure 4 presents a software-services reliability model based on adapting a hardware-reliability diagram to software services. The events occur in a serial fashion; thus, the use of a flowchart was adopted. This model was integral of the shift in emphasis from development-oriented reliability measures to customer or end-user oriented reli- ability measures.

Software reliability studies were numerous around 1990, and dominated the telecommunication reliability-develop- ment activities. Emphasis was placed on the evaluation of software reliability during validation [30] with the phi- losophy: “Software reliability is also a part of the quality aspect a client is entitled to expect of his supplier”. At this stage in 1990, the industry was still looking for new models for software. Even in the late 199Os, the Bell- core RQGR series of standards [38] still stated: “These requirements will complement those in TR-NWT-000332, Relzabalzty Predactzon Procedure for Electronic Equzpment, which addresses reliability of hardware elements. However, unlike that document, no single prediction procedure is re- quired. The state of the art in software reliability predic- tion has not advanced to the level of hardware reliability prediction. No single model for software reliability based on failure data has been accepted universally”.

In the 199Os, one of the unique approaches that pro- duced a high reliability product for the communication industry was the efforts of US Robotics engineers who as- sumed [50]: “there is no such thing as a telephone net- work in the United States. There are lots of networks and they vary tremendously”. US Robotics set up a BBS and equipped 1400 beta testers with modems at a wide vari- ety of locations. With this approach, new versions of code could be tested each day and its performance analyzed un- til the best version was obtained for the entire test bed. This method differs from using an established set of data transmission standards that are written for the theoreti- cal world that in reality does not exist. Their testing was extended internationally to countries [50] “with truly crip- pled telephone networks”. The testing included many In- ternet Service Providers (ISP). This successful method of implementing a leading-edge technology, recognizing that a uniform telephone network does not really exist, lays the foundation for assuring reliability in the next millennium by using world-wide empirical testing methods.

In the late 1990s, communication technologies entered a revolution. The term “Information Appliance” (IA) had

Page 9: Communications reliability: a historical perspective

MALEC: COMMUNICATIOI~S RELIABILITY: A HISTORICAL PERSPECTIVE

Figure 3:

SP-341

Dependability as a Collective Term

Availability Performance

- Customer Customer

Technical Demonstrations Testing and

Inspections Interactions

Application of Software

Re’eases and Performance Problems Documentation Patches

Early Handling Training - In-service - of Field - and -

I Figure 4: Software Services Reliability Flowchart

come of age, and “Em spawning many unexpe to smart refrigerators) tocols to communicate Standards & protocol tween dissimilar device origin started acting a trol, error correction, part of this revolution liability studies in the 1 movement to wireless, 1

sent & received from la out knowledge of the a space dimension as i dimensions of a reliabi vations complicate an to be performed upon reliability predictions fi industry automation w specifics of Middlewar sion reliability, of whic networks.

Much effort was expi ating in a reliable manr network. Mathematics “Given a communicat switching nodes in a rt topology of a telecomn given nodes and to se capacity for each phys tion demands are satis fails” [52]. In this case the importance of the 1

LAN and Internet pro- ost any network [51]”.

to provide flow con-

large amounts of data can be or controlled processes with-

These technology inno- fficult reliability analysis

the 1990s to address oper- amorphous communication

make a reactive decision. Cost, quality, and operational performance are all ingredients in a communication net- work that need to be balanced in order to achieve commu- nication efficiency. Thus, the ability to provide services with specific guaranteed survivability via diversification & reservation [52] lead to network topologies that differ in transmission costs and the effort to manage those net- works.

Efforts were started in the late 1990s to address commu- nication systems of mission-critical information technology (IT) applications across the entire environment [sa]. This included the hardware, operating system, database, appli- cation, and network with an end-to-end availability goal of “5nines: 5minutes” [53], interpreted as an ‘availability of 0.99999, or ‘5 minutes of system-outage/year’. This ap- pears to be a step backwards from the 30-yr old system outage goals of 5 3 min/yr of the telecommunication in- dustry and not as good as the historical communication availability requirements for the US Federal Banking Sys- tem. The “5nines:5minutes” concept uses many known reliability-design techniques [54] such as: . software mirroring, . Raid V architectures, . hot swapping, . rolling upgrades,

and addresses outages; however, lost and misdirected or wrongly billed calls or data transfers are not addressed in the reliability specifications that are traditional mea- sures in the telecommunication world. The present op- erating results of the current service is 0.9995 availabil- ity or 4.3 hr/yr (263 min/yr) of downtime [55]. A grow- ing issue that will carry over into the next millennium is the redefining of standard reliability/dependability defini-

Page 10: Communications reliability: a historical perspective

342-SP IEEE TRANSACTIONS ON RELIABILITY, VOL. 47, NO. 3-SP 1998 SEPTEMBER

tions and concepts for terms such as reliability, availability, and serviceability. Traditional Diagnostic Capability has been renamed Serviceability which previously character- ized the human-product/service ease of interaction, as one example [56]. Reliability has been defined from the equip- ment provider and designer viewpoints historically by indi- vidual nations, standards, and professional organizations, and in the Telecommunication Industry by Bellcore [57]. Thus, it is necessary to approach reliability from the cus- tomers/users viewpoint (actual field performance).

Areas of complexity dominated the communication area in the late 1990s. The area of real-time voice over packet- switched networks, such as the Internet was a subject of concern for reliability and Quality of Service (QoS) per- formance. The benefit of this service is that both business & home need only buy a single line to the outside world. The challenge is to find ‘‘the best combination of codec, ac- cess technology, and end-to-end architecture” [58]. Inter- net telephony promises to combine the separate data and voice networks into a single transport mechanism; how- ever, a limit for QoS, “we may be able to build faster hardware in the future, but we cannot increase the speed of light” [58].

As we conclude the 199Os, the Internet and individual company Intranets have changed the way reliability data are gathered & accessed. The problem of too little relia- bility information* has not only disappeared but has been replaced with: “Information is no longer scarce. Indeed, there is far too much of it for any one person to review, let alone organize. Instead of being starved for information, we find ourselves overloaded” [59]. At X o m , it is possible t o obtain real-time reliability statistics for any production line, current results from on-going reliability testing, fail- ure rates from all divisions in a common database, accel- erated reliability testing results, field failure rates, etc.

§elf-healing structures have been studied in detail to assure that a survivable network is achievable for service availability between any 2 nodes [60]. The question re- mains that it might not be possible to model accurately the software, hardware, and human elements all operating to- gether in an environment that always has some equipment malfunctioning or out of service [39]. In addition, with years of stressing the reliability & availability for commu- nication services, one of the largest breakdowns occurred in 1998 April. “The breakdown in AT&T’s vaunted ‘frame- relay’ network, used exclusively for high-speed transmis- sion of data between computers, affected thousands of corporate customers nationwide” [61]. This breakdown demonstrated that self-healing networks are still vulner- able to system outage. The lesson learned is that there might be a need “to use more than one networking vendor in case their primary network provider fails” [61]. Thus, reliability modeling in actuality needs to be performed by the end user who implements two or more service providers

4The distinction between raw-data and useful information in not addressed here.

and can take into account the human elements and the actual operational environment. The final challenge is de- termining the level of the Internet and its connections; currently, the number of Hosts attached to the Net is dou- bling every 6 months [sa]. In 1993 June there were 130 Web Hosts and in late 1997, there were 650k Hosts. Given the growth trends, a projection has been recently made that the level of 100 million Internet Hosts will be reached in the year 2001 [63].

6. THEFUTURE

Looking forward, there are many considerations. One is, “there is no such thing as a telephone network in the United States” [50]. This idea supported a growing con- cern in 1998 that the “worldwide communication indus- try is rampaging ahead and metamorphosing so rapidly as to defy comprehension” [ l ] . The rapidity of dramatic change was forecast in 1998 April [64]: “Thirty percent of corporations will be taking advantage of converged data, video and voice networks by the Millennium. This con- vergence of networks will generate up to $10 billion in an- nual savings that can be reinvested in customer service, new business opportunities, and new technologies”. The trend toward network-convergence did accelerate strategic- alignment of telecommunication vendors and networking companies. Some of the necessary transitions to imple- menting next generation networks, include [64]: . moving the public access network from analog to digital

subscriber lines and cable technologies, . creating intelligent network access points, . further deregulation of the telecommunication industry, . developing a new generation of networked applications

adapted to converged network services, . making networks more accessible and affordable to the

business & consumer markets. With this convergence of networks, the reliability com- munity has a challenge to contribute to complex issues in complex systems. Perhaps a new paradigm is necessary for reliability/dependability analysis & standards, and per- formance criteria. Over the past 40 yr, reliability ana- lysts have been modeling static systems from the reliability viewpoint. With the introduction of converging networks and a world of dynamic configurations/connections chang- ing in real-time, possibly conventional reliability standards & analysis will yield to design-capab~lity simulation-anal- ysis and be measured in terms of service & value and not MTBF & Availability. “Even before the second genera- tion networks have been fully deployed, a third generation (3G) of wireless networks is already being defined” [65]. By the year 2000, the standards will be formed for build- ing 3G systems with standards under development in the USA, Europe, and Japan with the fear of fragmentation [65]. There was no indication that a major concern for the design of the 3G was reliability performance. The major current issues are the ownership of patents and possible patent pools for the competing technologies for 3G [66].

Page 11: Communications reliability: a historical perspective

MALEC COMMUNICATIOPbS RELIABILITY: A HISTORICAL PERSPECTIVE

M. J. Riezeman, analysis & forecast”, - 36.

AT&T Principles of Telegraph Work, phone and Telegrarh

I. Welber, H. Evan:!, the TD-2 radio relaJr ing”, Bell System

“System aspects of 65-464, 1965 Jun.

SP-343

REFERENCES

“Communications-Technology 1998 IEEE Spectrum, 1998 Jan, pp 29

Electricity applied to Telephone and Long Lines Department, American Tele-

Company, 1953 & 1961

G. Pulles, “Protection of service in system by automatic channel switch-

yeliable communications”, ESD-TRD- I‘echnical J , 1955 May.

The Call for Partic Workshop sponsored 1: ciety (COMSOC) Ted tions Quality and Relia damentals for the Neu stated why they were the telecommunication: market forces are drivi ferings, network infrast The New Millennium I: nications and new fron

But some things will The role of Quality an ments, international si limitations, best practic The ‘fundamentals’ for be settled. Understand alogue and consensus (

generates powerful levt well as global telecomn

QUEST (Quality Ex munications) forum is i trade association that quality requirements w supply chain. This has document to be based L

communication industr in 1998 in the formula networks, switching sys ment which must oper; information to end cus in this era of rapid dat ware, software and serv support IS0 9001:2000 when issued in the ye pendability concept wk ity performance, relial: nance support. In sun eliminating the presen in the communication ternational reliability r timated 1000 tier-1 su cation suppliers world1

s in both local as

pliers of Telecom-

the highly reliable hard- sential. This activity will in which, IS0 9001:2000

will include the IEC De-

I.A. Feigenbaum, “Reliability of commercial communica- tions satellite systems”, IEEE J. Selected Areas in Com- munications, vol 4, 1986 Oct, pp 1034 - 1038. H.A. Malec, “Telephone switching system reliability - past, present, and future”, Nat’l Telecommunications Conf. Proc, 1975 Dec, sec 5, pp 14 - 19; New Orleans. R.W. Downing, J.S. Nowak, L.S. Tuomenoksa, “No. 1 ESS maintenance plan”, The Bell System Technical J , vol 43, #5, part 1, 1964 Sep. “Automatic telephone systems with code switches, stored program controlled system ake, general description”, Pub- lication No. 17470,. 1964 Jul; L.M. Ericsson. Twitching systems with stored program control, General description”, Publication No. 17580, 1967 Apr; L.M. Er- icsson. W.P. Karas, “Reliability and field experience of electronic switching systems”, IEEE Trans. Communication Tech- nology, 1967 Dec. J.C. Kennedy, 0. Pedde, “Service and technical objectives for the SP-1 electronic switching system”, Northern Elec- tric TELESIS, vol 1, 1969 Jan. H.A. Malec, “Designing for reliability in pcm switching systems”, Nat’l Communications Conf. Record, 1973 Nov,

H. Willrett, “Field experience with quasi-electronic tele- phone switching systems”, IEEE Int’l Conf. Digest, 1967 Mar. G. Gosztony, J. Wirth, “Determination of the reliability of BHG private branch exchanges, type RA and CA, by statistical observations”, Budavox Telecommunications Review, #3 - 4, 1970. G. Petersen, K. Vestergard, “Jutland Telephone Com- pany’s experience of L.M. Erickson rural exchanges ARK 522”, Ericsson Review, vol 48, #2, 1971. T.F.A. Urben, “Maintenance and reliability require- ments” , Switching Techniques for Telecommunications Networks Conference Publication, 1969 Apr. H.H. Adelar, J.L. Masure, “Semiconductor reed crosspoint telephone switching system”, I T T Electrical Communica- tion, vol 42, #I, 1967. G. Haugk, H.N. Seckler, “Evaluating the first no. 1 ESS offices”, Bell Laboratories Record, 1967 Dec. S.H. Tsiang, G. Haugk, H.N. Seckler, “Maintenance of a large electronic switching system”, IEEE Trans. Commu- nication Technology, 1969 Feb. H.A. Malec, “Reliability optimization in telephone switch- ing systems design”, IEEE Trans. Reliability: Special Is- sue on Reliability Optimization, vol R-26, 1977 Aug, pp

(An Interview with H.C. Higgins), “Major trends in

R.W. Ketchledge, “Designing phone systems of the fu- ture”, Telephone Engineer & Management Magazine, 1973 Jun 1.

“Point to point”, Communications Design Magazine, 1973 Jun 1, p 33.

pp 146.1 - 146.6.

203 - 208.

switching”, Bell Laboratories Record, 1972 Oct.

Page 12: Communications reliability: a historical perspective

344-SP IEEE TRANSACTIONS ON RELIABILITY, VOL. 47, NO. 3-SP 1998 SEPTEMBER

J.A. Hamers, “Six years of corrective maintenance (CMM) in the Rotterdam telephone district”, Ericsson Review, vol 49, #3, 1972.

A.E. Pinet, “Introduction of integrated PCM switching in the French telecommunication network”, Int ’1 Switching Symp. Record, 1972 Jun.

R. Leblanc, “Reliability achievement in the IBM 3750”, Int ’I Switching Symp. Record, 1974.

P.K. Giloth. J.R. Witskin,, “No. 4 ESS - Design and performance of reliable switching software”, ISS’81 CIC, 1981 Sep, pp 33Al/l - 9; Montreal.

E.A. Davis, P.K. Giloth, “No. 4 ESS: Performance objec- tives and service experience”, Bell System Tech. J , vol 60, num 6, 1981.

J. Musa, A. Iannino, K. Okumoto, Software Reliability - Measurement, Prediction, Application, 1987; McGraw- Hill.

G.L. Gall, M-F Adam, H. Derriennic, et al, “Studies on measuring software”, IEEE J. Selected Areas in Commu- nications, vol 8, num 2, 1990 Feb, pp 234-246.

H. Malec, ‘Software bugs and communications systems performance”, Proc. ph Symp. Reliability in Electronics, 1982 Oct, pp 1-11; Budapest.

J . Dutt, H.A. Malec, “ITT 1240 Digital Exchange System effectiveness”, Electrical Communication, vol 56, num 2/3, 1981, pp 198 - 206. “Quality assurance for the communications community”, IEEE J . Selected Areas in Communications, vol 4, 1986 Oct, pp 997 - 1183.

“Quality after the sale for the communications commu- nity”, IEEE J. Selected Areas in Communications, vol 6, 1988 Oct, pp 1281 - 1448.

“Telecommunications software quality and productivity” , IEEE J . Selected Areas in Communications, vol 8, 1990 Feb. pp 161 - 308.

“Quality of telecommunications services, networks, and products”, IEEE J. Selected Areas in Communications, vol 12, 1994 Feb, pp 217 - 374.

S.R. Ali, “Analysis of total outage data for stored pro- gram control switching systems, IEEE J . Selected Areas in Communications, vol 4, 1986 Oct, pp 1044 - 1046.

“Generic requirements for software reliability prediction” , Bellcore, GR-2813-CORE, Issue 1, 1993 Dec, p 1-1.

J.D. Spragins, J.C. Sinclair, Y.J. Kang, H. Jafari, “Cur- rent telecommunications network reliability models: A critical assessment”, IEEE J . Selected Areas in Commu- nications, vol 4, 1986 Oct, pp 1168 - 1173.

G. Pirovano, G. Turconi, “Telecommunications reliability databank: from components to systems”, IEEE J. Selected Areas in Communications, vol6, 1988 Oct, pp 1364 - 1370.

J-F. Bars, B. Loyer, “Reliability and evolvability: Re- quirements for the design of ISDN communication soft- ware”, IEEE J. Selected Areas in Communications, vol 6, 1988 Oct, pp 1405 - 1413.

[42] M. Sharifi, L. Wolter, “Design and operational issues of VSAT application in ISDN-type networks”, IEEE J. Se- lected Areas in Communications, vol 6, 1988 Oct, pp 1422

[43] Y. Itoh, H. Kawashima, Y. Shimojo, K. Kawase, “Fault tolerance techniques in the NEAX61 digital switching sys- tem and its field performance”, IEEE J. Selected Areas in Communications, vol 6, No. 8, October 1988, pp 1414 - 1421.

[44] CCITT Recommendation G. 106, Temns and Definitions Related t o Quality-of- Service, Availability and Reliability, 1985; Red Book.

~ 1430.

[45] K. Strandberg, “Field dependability evaluation princi- ples”, IEEE J . Selected Areas in Communications, vol 6, 1988 Oct, pp 1330 - 1337.

[46] Reliability and Quality Measurements for Xelecommuni- cations Systems (RQMS), TA-TSY-000929, Issue 1, 1989 June; Bellcore.

[47] R. Erickson, D. Saxena, G. Brush, “A view of reliabil- ity and quality measurements for telecommunications sys- tems”, IEEE J . Selected Areas in Communications, vol 8, 1990 Feb, pp 219 - 223.

[48] M.S. Shen, “On the measurement of field return rates”, Quality and Reliability Eng’g Int’l, vol 8, 1992 Sep/Oct,

[49] J.P. Hudepohl, “Measurement of software service for large telecommunications systems”, IEEE J. Selected Areas in Communications, vol 8, 1990 Feb, pp 210 - 218.

boardwatch.com, 1998 Mar.

pp 501 - 510.

[50] J. Rickard, “The 56K modem battle”, http://www.

[51] L. Goldberg, “Information appliances: From web phones to smart refrigerators”, Electronic Design, 1998 Mar 23, pp 69 - 84.

[52] D. Alevaras, M. Grotschel, P. Jonas, et al, “Survivable mobile phone network architectures: Models and solution methods”, IEEE Communication Magazine, 1998 Mar pp

[53] “HP, CISCO and ORACLE collaborate to provide end-to- end availability”, http://www.oracle.com/partners/ news/hp-main.htm1, 1998 Feb 02.

[54] “Key elements of high availability when using MC/ ServiceGuard” , http: //www. hp.com/es y/ software-applications/high-availability /overview/ key-elements.htm1, 1997 Jun.

[55] “Hewlett-Packard high availability solutions”, http:// www.hp.com/esy/so ftware-applications/ high-availability/overview/index.html, 1998 Jan.

88 - 93.

[56] “An overview of the ORACLE8TM high availability archi- tecture”, http://www.oracle.com/st/08collateral/html/ xpsh3twp.htm1, 1997 Jun.

[57] “RQGR - Reliability and quality generic requirements”, BR FR-796 ISS97 (set of 33 documents), 1997; Bellcore.

[58] T. Kostas, M. Borella, I. Sidhu, et al, “Real-time voice over packet-switched networks”, IEEE Network, 1998 Jan/Feb, pp 18 - 26.

Page 13: Communications reliability: a historical perspective

MALEC COMMUNICATIO! S RELIABILITY: A HISTORICAL PERSPECTIVE P [59] M. Wilson, “The

work architectures munications Magazne,

up on information [60] A. Borchers, J.

pp 106 - 108.

[61] S.N. Mehta, “AT&’T network used by CO

Wed, 1998 Apr 15, [62] L. Lance, “The Internet

cast”, IEEE Spectrum,

[63] “Dimensioning the 1998 Mar-Apr, pp t

[64] “3Com CEO Eric verged networks ence”, http://3com

vergence or chaos?”, [65] L. Goldberg,

69 - 84.

[66] P. Fletcher, “A wireless technology Apr 20, pp 72 - 74.

[67] “Fundamentals for munications Society Communications 1998.

[68] E.R. Larson, “New tions industry”,

SP-345

quantitative impact of survivable net- on service availability” IEEE Com-

1998 May, pp 122 - 126. Herlocker, J. Konstan, J. Reidl, “Ganging

cverload”, IEEE Computer, 1998 Apr,

is seeking cause of big outage in data :porations”, The Wall Street Journal, 3 B15.

- Technology 1998 analysis & fore- 1998 Jan, pp 37 - 42.

Internet”, IEEE Internet Computing,

EIenhamou predicts cost saving of con- during keynote to Red Herring Confer-

com/news/releases/aprO698~.html.

“Next-generation cellular technologies: Con- Electronic Design, 1998 Apr 20, pp

- 10.

European perspective on 3‘d-generation and politics”, Electronic Design, 1998

the new millennium”, IEEE Com- (COMSOC) Technical Committee on

Quality and Reliability Call for Papers,

quality standard for telecommunica- Quality Digest, 1998 May, p 9.

AUTHOR Henry A. Malec; 3Com - Carrier Systems Business Unit;

1800 West Central Rd; Mount Prospect, Illinois 60056 USA. Internet (e-mail):

Henry A. Malec (M’65, SM’77) is a consulting engineer in Research and Development at the 3Com Carrier Systems Business Unit in Mount Prospect, where he chairs the 3Com corporate- wide International Reliability Council. He received the BS in both Electrical Engineering and Mathematics from the Illinois Institute of Technology, Chicago. He has 30 years of international experience in both reliability & quality for both hardware & software with GTE, ITT, Siemens, and DEC. He was chair’n of the IEEE Communications Society Quality As- surance Management Committee for 1986-1988 and has served 3 terms on the IEEE Reliability Society AdCom. He is the author of over 60 published technical papers and the Deputy Technical Advisor for the USNC IEC TC56 Dependability stan- dards group. Henry Malec is a member of the Board of Exam- iners for the Malcolm Baldrige National Quality Award in 1998 and has also served in 1994, 1995, 1996 in this capacity. He is a Registered Professional Engineer in the State of Illinois and a Chief Editor for Quality and Relzabzlity Engzneering Int’l Journal.

Publisher Item Identifier S 0018-9529(98)07830-0