fault diagnosis of multivariate systems using pattern recognition and multisensor data analysis...

Computers and Chemical Engineering 25 (2001) 1313–1339

Fault diagnosis of multivariate systems using pattern recognitionand multisensor data analysis technique

F. Akbaryan, P.R. Bishnoi *Department of Chemical and Petroleum Engineering, The Uni�ersity of Calgary, 2500 Uni�ersity Inn Dri�e NW, Calgary,

Alta, Canada T2N 1N4

Received 19 October 2000; received in revised form 19 March 2001; accepted 21 March 2001

Abstract

A pattern recognition-based methodology is proposed for fault diagnosis of multivariate and dynamic systems. Noisy inputpatterns, belonging to a class of event(s), are first scaled to unit variance and zero mean. This step, termed as a harmonizing step,reduces magnitude difference amongst patterns belonging to a class of event(s). Existence of low frequency segments, such asramp-type trends, in a pattern hampers the efficacy of the harmonizing step. In this work, a digital band-pass filter is designedto eliminate the ramp-type segments and decrease noise intensity. Then, the Principal Component Analysis (PCA) technique isapplied in order to describe the information space by a set of uncorrelated and fictitious data sources. A wavelet-basedmethodology is employed for each new sensor to extract pattern features. A binary decision tree is used to classify the extractedfeatures. The outputs of each decision tree are: (1) the a posteriori probabilities that an unlabeled input pattern belongs todifferent classes of events; and (2) the probability confidence limits that input pattern may be classified to any of known classes.As the last step, any of two consensus theory-based techniques or evidence theory are utilized to combine the outputs of thedecision trees and find the best classes of events describing system behavior. The performance of the proposed technique isexamined by diagnosis of simulated faulty behavior for the Tennessee Eastman Process. © 2001 Elsevier Science Ltd. All rightsreserved.

Keywords: Fault diagnosis; Pattern recognition; Multisensor data analysis

www.elsevier.com/locate/compchemeng

1. Introduction

A system operates in the faulty condition when itsbehavior deviates considerably from normal and pre-defined operating strategies. Equipment failure, sensordegradation, set point change, and disturbances in theinput streams are the instances of faulty states for asystem. The first group of faults, known as determinis-tic faults, is generated by a fixed magnitude cause andis usually damped by using a robust control strategy.Various magnitudes of a deterministic cause, even atdifferent operating points, produce faults with similartrends. The second group, known as stochastic faults,results from causes whose magnitude change randomlywith time. Besides, the controlling scheme cannot driveback the system to a steady state operating condition.

A stochastic fault, even at the same initial operatingpoint, could have different patterns.

Fault diagnosis is an important part of the processsupervisory routines that determines the states of thesystem (faulty or normal) as well as type of faults. Theanalytical model-based (Isermann, 1984; Kramer, 1987;Frank, 1990; King & Gilles, 1990; Petti et al., 1990),causal analysis (Kramer & Palowitch, 1987; Mohindra& Clark, 1993) and pattern recognition (Fan et al.,1993; Bakshi & Stephanopoulos, 1994; Watanabe et al.,1994; Rengaswamy & Venkatasubramanian, 1995; Kas-sidas, 1997; Oh et al., 1997) are the main groups offault diagnosis approaches. Chemical processes are of-ten characterized by nonlinear behavior, noisy inputsand unknown parameters. Thus a model describing thesystem behavior, either mathematically or qualitatively,will be quite complicated (Frank, 1990; King & Gilles,1990; Isermann, 1984). However, computer-based pat-tern recognition extracts a wealth of information from

* Corresponding author.E-mail address: [email protected] (P.R. Bishnoi).

0098-1354/01/$ - see front matter © 2001 Elsevier Science Ltd. All rights reserved.PII: S 0 0 9 8 -1354 (01 )00701 -3

F. Akbaryan, P.R. Bishnoi / Computers and Chemical Engineering 25 (2001) 1313–13391314

the large amount of process data quite satisfactorilywithout concern about the nature of a system. Some ofthe fault diagnosis methods assume that the fault occur-rence drives the system to a new steady state condition.Then the system characteristics at two different operat-ing points are used for diagnosis purposes (Kramer,1987; Petti et al., 1990; Fan et al., 1993; Watanabe etal., 1994). If the system happens to reach its initialsteady state condition, these methods will not be usefulfor fault diagnoses. As another shortcoming, these ap-proaches cannot deal with the stochastic faults, becausethe system cannot reach a steady state point. If tran-sient trends of system variables are used as patterns, thefault diagnosis method will be free from considering thesteady state conditions (Bakshi & Stephanopoulos,1994; Rengaswamy & Venkatasubramanian, 1995; Kas-sidas, 1997Oh et al., 1997). It implies that the diagnosismethod is equally applicable for any type of fault andfinal system condition.

In a complex chemical plant, fault diagnosis tech-niques cannot rely on information received from only asingle source. The system disturbances affect some vari-ables more than others. Furthermore, the amount ofirrelevant information, such as noise, varies accordingto the type of the variable, as well as type of thedisturbance. Akbaryan (2000) proposed a fault diagno-sis technique that classifies system behavior based on asingle system variable. He showed that the diagnosisresults vary considerably according to the type of thevariable.

In the present work, the single-variate fault diagnosismethodology (Akbaryan, 2000) is extended for classify-ing the faulty behaviors corresponding to a multivariateand dynamic system. The patterns are transient trendsof process variables resulting from disturbances in thesystem. The magnitudes of measured variables, dis-turbed by a deterministic fault, change directly accord-ing to the size of fault. Different magnitudes of a faultresult in qualitatively similar trends for a measuredvariable; however, these trends are quantitatively un-like. The proposed pattern recognition technique reliesquantitatively on system variables, therefore, any sizediscrepancy amongst patterns for a class of event(s)should be eliminated. Scaling each system variable, sothat it has a zero mean and unit variance, can resolvethis problem. When a variable exhibits a step-type orramp-type trend, the scaling of the variable depends onthe number of data points sampled before and after theoccurrence of the step or ramp. In this work, a band-pass recursive filter is used to remove the step- orramp-type trends, as well as the noise. A suitableband-pass digital filter is an important factor in thiswork, which assures that the information content of apattern is preserved. Akbaryan and Bishnoi (2000) pro-posed a wavelet-based denoising method that was supe-rior to the Fourier-based low-pass filters. However, in

the present work it is found that the wavelet-basedhigh-pass filtering methods do not perform satisfacto-rily. Using the Principal Component Analysis (PCA),the proposed technique transforms the original, corre-lated and noisy information space into a lower dimen-sional, uncorrelated and less noisy space. Principalcomponents are the uncorrelated and fictitious variablesthat characterize the new information space. The PCAmethod discards irrelevant information, such as noiseand uniform variables, by projecting them into the lessvariant principal components. The single-variate pat-tern recognizer (Akbaryan, 2000) is used to classify thesystem behavior based on the information provided byeach new variable individually. Then the classificationresults of new variables are integrated in order toobtain the system behavior. The proposed techniqueallows the reliability of each new system variable to beincorporated into the final diagnosis result. The morereliable a sensor is, the more influence it has on thefinal outcome. Choosing a correct measure of reliabilityfor a sensor is a challenging problem that affects theperformance of the fault classification technique.

2. Pattern harmonization

A pattern recognizer comprises of two main parts: (1)feature extractor; and (2) feature classifier. Measureddata exhibit effects of different events such as processdynamics, sensor noise, faults and external loads. Thefeature extractor separates these events, known as fea-tures, quantitatively and/or qualitatively, and thensends them to the feature classifier. Having used aquantitative feature extractor, an unlabeled pattern willbe admitted into a class of event by a feature classifierif its extracted features have similar magnitudes tothose of the other patterns in the class. A systemvariable will exhibit qualitatively similar but quantita-tively different trends, if the system is disturbed by adeterministic fault with different magnitudes (Kassidas,1997). Moreover, if a system deviates from differentsteady state conditions by a fixed size and deterministicfault, the resulting trends for a variable will have differ-ent magnitudes. Although the resulting patterns belongto a same class of event, they may be classified intoother classes by the proposed classification methodol-ogy. Thus patterns of a variable, within a class ofevent(s), should be harmonized quantitatively in orderto decrease the misclassification rate. The first stepwould be subtracting all the samples from the firstsample so that a pattern, regardless of disturbance type,starts from zero. In the second step, a pattern is scaledso that it has unit variance and zero mean. This tech-nique removes the magnitude difference amongst agroup of patterns Y.

F. Akbaryan, P.R. Bishnoi / Computers and Chemical Engineering 25 (2001) 1313–1339 1315

Y� =Y−mY

� Y

(1)

where m and � represent the mean and standard devia-tion of vector Y. The scaling of pattern also eliminatesthe effect of measurement units that may exist amongstvariables used to describe a class of event. When apattern exhibits a ramp- or step-type trend, the varianceand mean of the pattern depend on the number of datapoints collected after and before the step or rampoccurrence. This fact hampers the efficacy of the pat-tern scaling technique, because the pattern classificationtechnique depends on the length of data (Kassidas,1997). Furthermore, the mean and variance for anensemble of data, with step- or ramp-type trends, lackthe statistical meaning. Since ramp- and step-type seg-ments exist within a low frequency ranges, filtering thepattern by a high-pass filter helps remove these seg-ments (Kassidas, 1997). The resulting pattern certainlyhas a different structure from the original pattern.

Designing a suitable high-pass digital filter is a majorstep that affects the accuracy of the proposed method-ology considerably. According to Akbaryan and Bish-noi (2000), one may design different types of filters inorder to accomplish the defined requirements. Differenttypes of high-pass filters, such as Butterworth, Cheby-shev, Bessel, Short Time Fourier Transform (STFT)using Hamming Window, STFT using Kaiser Window,Wavelet Packet Transform (WPT) using the Lifting-Scheme (LS) wavelet filters, and WPT using the Symm-let8 wavelet filters, are examined in this work forfiltering the low frequency segments. In spite of thesuperiority of wavelet packet-based filtering methodsfor denoising purposes (Akbaryan & Bishnoi, 2000), itis found that these approaches are not very successfulin the filtering of low frequency parts. Among the otherfilters, which are based on the Fourier transform, theButterworth and Chebyshev filters are observed toshow the best performance. For a given filter order n�2, the Chebyshev filter performs better than theButterworth filter to separate passband and stopbandregions by narrowing the transition area (Lynn &Furest, 1989). By increasing the order of these filters,the output signal consists of more high frequencies,which cause more oscillation in the signal trend. It isnoted that the presence of large numbers of high fre-quency bands deteriorates the identities of deterministicpatterns, so that the misclassification rate increases. Inthis work, traces of low frequency bands are allowed topass through the filtering process. This fact is alsoaddressed by Kassidas (1997) in his work. Havingchosen a fairly low cutoff frequency fc, the low-orderButterworth and Chebyshev filters are found to bereliable for smooth filtering the low frequency compo-nents. For the low values of fc, however, the Butter-worth filter not only outputs a smoother signal thandoes the Chebyshev filter, but also follows the passband

ideal frequency response satisfactorily. The perfor-mances of these two filters are highly sensitive to thevalues of filter elements, so that slight changes in theirvalues significantly change the output. The data se-quences are often corrupted by noise, which degradesthe efficiency of the data processing techniques. Byharmonizing the noisy trends, the effect of noise ismagnified so that the true trend is masked more by theirrelevant information. A low-pass filter could signifi-cantly reduce the negative effect of noise. In this work,it is preferred to use a band-pass digital filter so that theeffect of low and high frequency bands are removedsimultaneously. Details of selecting the band-pass filtercan be found in Akbaryan (2000).

3. Single-variate pattern classification

Akbaryan (2000) proposed a pattern classifier thatcan be used for dynamic, single-variate fault diagnosisproblems. First, the most important features F of apattern Y� are extracted according to a formulationdescribed by:

f : Y� �Ff=�(k)��

(2)

The raw pattern is first decomposed into a set oforthogonal coordinates shown by matrix �, and thenthe selection rule �(k) reduces the dimension of gener-ated feature space by choosing the most important kcoordinates from n coordinates. The matrix � could bedetermined by using a wavelet-based technique. TheWPT approach is preferred to Wavelet Transform(WT) and STFT methods, because it could exploit fullythe information content in time as well as frequencydomains.

A set of wavelet coordinates that discriminate theclasses of events most efficiently amongst other waveletcoordinates are determined according to the LinearDiscriminant Basis (LDB) (Saito, 1994) method andPCA technique. The original LDB technique decom-poses the entire data sequence into a single waveletpacket tree without exploring the class-discriminationperformances of other groups of pattern windows. Theproposed algorithm (Akbaryan, 2000) couples the LDBmethod with the Double Wavelet Packet Tree (DWPT)(Bakshi & Stephanopoulos, 1996) in order to determinethe best configuration of pattern windows causing themost discrimination among classes. To reduce the sizeof the feature space, the wavelet coordinates are pro-jected into a new low-dimensional space, using the PCAtechnique, where a minimum correlation exists amongthe new space variables.

The selected features, forming an ‘instance’, are fedinto a binary decision tree. The Incremental Tree In-


duction (ITI) methodology (Utgoff et al., 1997), is theframework of the tree classifier in the technique(Akbaryan, 2000). The Minimum Description Length(MDL)-based information criterion (Rissanen, 1983) isfound as a reliable choice for pruning the suggesteddecision tree (Utgoff et al., 1997). The ITI binary treeinduction method considers an unlabeled instance as amember of only one known class. The crisp classifica-tion approach is unable to address a problem when twoor more classes of events overlap each other. In multi-ple fault diagnosis problems, for example, some pat-terns belong to regions covered by more than one classof event. Presence of noise also intensifies the effect ofthis problem on the classification results. The technique(Akbaryan, 2000) incorporates a soft thresholding ap-proach, proposed by Quinlan (1993), into the originalthe ITI-based decision tree for dealing with the uncer-tainty in classification problems. The tree determinesthe a posteriori probabilities that an instance may be-long to different classes.

4. Soft classification

In practice, the upper cutoff frequency for a band-pass filter is selected such that the true signal features,existing in the high frequency bands, can pass throughthe filtering process. This fact results in an outputsignal corrupted by traces of input noise The PCAtechnique, used to produce a set of uncorrelate vari-ables, helps reduce the effect of noise by transferring itto the less important principal components (Englehart,1998). Moreover, the wavelet-based feature extractortransforms uncorrelated data, such as noise, intowavelet coefficients with small magnitudes that arelikely discarded by the best basis selection algorithm,such as one used by the modified LDB method(Akbaryan, 2000). Although these techniques largelysuppress noise, the extracted features can still be af-fected by noise, especially if the noise intensity in theraw input signal becomes high. The corrupted featuresmay cause the inaccurate classification results. Theproblem is dealt with by incorporating the softthresholds at each decision node of the tree and uncer-tainty concept into the final classification results.

The soft thresholding approach defines two sub-sidiary cutpoints, t+ and t −, such that the crispcutpoint t is laid between them. By using the softthresholds at each decision node, the unlabeled instanceis sorted down the tree according to the rules defined byEq. (3). As an instance is sorted down, each decisionnode on the path determines the probability P ofsending the input instance toward the left branch, and1−P for the right subtree. Consider that feature A isselected as the most informative feature amongst otherfeatures for a decision node; the probability P would becomputed by:

Ai� t− � P=1

t−�Ai� t � 0.5�P=1−Ai− t−

2( t− t− )�1

t�Ai� t+ � 0�P=12−

Ai− t2( t+− t )

�0.5

Ai� t+ � P=0

(3)

where Ai shows the i th value of feature A.Because of the soft thresholds at a decision node, an

instance may reach several terminal nodes that mayhave different distributions of classes. Consider thatthere are NC known classes provided by the traininginstances; the probability that an instance is assigned tothe c th class of terminal node l is calculated by:

P cl =

Nc,l

Nl

× �D

j=1

Pj (4)

where D denotes the number of decision nodes visitedby the instance in each classification path starting attree root and ending at terminal node l. The Pj willrepresent the Pleft or Pright for the j th decision node, ifthe pattern is sent to the left or right branch respec-tively. The number of training instances within the c th

class of terminal node l is shown by Nc,l in Eq. (4), andNl stands for total number of instances assigned to thel th terminal node. The most probable class in the l th

terminal node has the most number of instancesamongst the others in the node. If an instance reachesto L terminal nodes, the probability that the instancebelongs to the c th class, Pc, will be defined as:

Pc= �L

l=1

P cl (5)

Eq. (5) is used by Akbaryan (2000) to estimate roughlythe a posteriori probability for each class.

Consider that for the l th terminal node, El of the Nl

training instances do not belong to the most dominantclass in the node. The parameter eS=El/Nl representsthe classification error in the node, because if the Nl

instances are reclassified by the tree, eS percent ofinstances are assigned by the node to the most domi-nant class incorrectly. The eS is termed as the nodesample error because the Nl observations are sampledfrom the space covered by the node. The eS not onlydepends on the number of sampled instances but alsomay change as another group of instances is classifiedby the node. However, the true classification error,shown by eT, is defined as the probability of misclassify-ing an instance chosen randomly from the entire popu-lation of instances covered by the terminal node. Thiserror can be applied for predicting the misclassificationof an unseen instance. To calculate the true error, theprobability of misclassification within the node’s infor-mation space must be obtained. The probability cannotbe calculated exactly, but an interval, where the true


probability exists, is determined by using the sampleerror. The binomial distribution of eS is a reliabletechnique that provides the formulations for calculatingthe interval. The confidence interval, where the proba-bility of misclassifying a pattern into the most dominantclass falls randomly S% of the time, can be described by(Mitchell, 1997):

EN

�ZS×�E

N�

1−EN�

N

(6)

The parameter S is usually termed as ConfidenceLimits (CF), and ZS, based on the desired CF, is givenby the statistical tables. As the S decreases, the length ofthe confidence interval is reduced and so does the valueof ZS. There is a more robust algorithm that calculatesthe Upper Limit of Confidence Interval (ULCI) (Quin-lan, 1993) through the estimation of standard deviation� for binomial distribution.

E=0 � ULCI(E,N)=�

1−CF1N�

E�N−0.5 � ULCI(E,N)=0.67(N−E)+E

N

otherwise � ULCI(E,N)=b+

� 2

2+�

� 2�b×�

1−bN�+

� 2

4�

N+� 2

b=E+0.5

(7)

The certainty of classifying an instance in the mostdominant class f of the terminal node is measured by 1-ULCI f. If the f th class becomes the most populatedclass in the M terminal nodes, the certainty, Cf, ofclassifying the instance in this class, will be determinedas:

Cf= �M

l=1

P fl × (1−ULCIf(El,Nl)) (8)

The El and Nl denote the number of instancesclassified into the class f incorrectly, and total numberof sampled instances belonging to the l th terminal node.If a class is not nominated by any of terminal nodes asthe most populated one, its Cf will be equal to zero. Theprobability limits of classifying an instance in the c th

class, could be defined as:

Cc�Pc��

1− �NC

j=1

Cj�j�c�

(9)

5. Multisource data analysis

In practical pattern recognition problems, the num-ber of information sources is seldom singular becausemultiple sources provide a much better understandingof the system’s behavior, and enhance the robustnessand reliability of the final classification results. When a

system operates at the steady state condition, the col-lected data from each source of data (sensor) would bea single and constant value. The pattern, which is acollection of these measurements, could be recognizedby a single classifier. However if the objective is tofollow the system’s behavior dynamically, each sensorwill provide a series of measurements during a specifiedtime period. A matrix, where the collection of eachsensor is shown by a single row, represents all themeasurements. A pattern could be identified either as arow or column of this matrix. For the first choice, eachpattern represents the activity of each sensor individu-ally, whereas in the second approach, pattern is thecollection of sensors’ activities at a specified time. Nev-ertheless, there are some conceivable problems in usingthe latter choice of pattern configuration. First, sinceeach sensor may have different measurement scales, aformulation for unification of measurements units isrequired. Second, the sensors are not equally reliable,and each sensor provides a different degree of supportfor an observation. Thus the information regarding thedegree of reliability for each source of data must beincluded somehow in the classification results. Third, asensor might produce discrete or continuous type ofinformation so that the constructed pattern vectorcould include different types of variables. The Multi-source Data Analysis (MSDA) is the basis of varioustechniques that investigate each sensor separately, andthen combine the classification results of each sensor toprovide the final decision.

A number of approaches have been proposed toanalyze data received from different sources of informa-tion. The simplest approach is to form an extended datavector that contains information from all sources in asystem at a certain time. This approach is a reliablechoice when all information sources are equally reliable.On the other hand, statistically-based methods, such asconsensus theory and evidence theory, regard each datasource independently so that the system mode is ob-tained by fusing the information from different sources.

Consensus theory is found to be a suitable probabilis-tic framework that determines the consensus amongmembers of a group of experts in order to estimate theprobability of events in a particular �-field (Benedik-tsson & Swain, 1992). The objective is to produce aconsensus rule Cs that summarizes the various probabil-ity estimations, determined by the experts, into a singleprobability function. The Linear Opinion Pool (LIOP)is the most commonly used approach that computes theconsensus rule by the weighted summation of probabil-ities given by each expert;

C(wj�y)= �NS

i=1

�iP(wj�yi) (10)

where yi represents the measurement of the i th source, wj

is the j th class, and �i is the weighting factor for the


i th source, and ��i =1. The Global Membership Func-tion (GMF) (Benediktsson & Swain, 1992; Lee et al.,1987) is another approach with desirable properties forgenerating consensus rules. The formulation of thismethod is defined by:

C(wj�y)=P(wj) �NS

i=1

P(wj�yi)P(wj)

n� i

(11)

where P(wj) is the a priori class distribution, and theweights are selected in the interval [0, 1]. When a datasource is completely unreliable, its weighting factorwould be equal to zero.

Evidence theory is a technique that is able to repre-sent the outcomes of statistical experiments by theinterval-valued probabilities (Kim & Swain, 1995). Theevidence theory also considers the ignorance and miss-ing information by estimating the imprecision andconflicts of results among different data sources. Bydenoting the space of hypotheses as �, total number ofpossible hypothesis would be 2�. The plausibility (Pls)and belief (Bel) functions are two main definitions thatare employed to represent the imprecision and uncer-tainty in the decision-making process. The probabilitymass function (m) is the probability that could beassigned to each element of �. This parameter changeswithin [0, 1] interval, and

�H�2�

m(H)=1, m(�)=0 (12)

where H represent a hypothesis. The mass function for� can be given a nonzero value for representing theglobal ignorance. The Pls and Bel, also known as lowerand upper probability functions, are computed by themass function as follows:

Bel(H)= �H�H

m(H �)

Pls(H)= �HH�=�

m(H �)(13)

The magnitude of imprecision of hypothesis H couldbe shown as [Bel(H) Pls(H)] interval, which is known asthe belief interval. Consider the case when there is morethan one source of information, and each provides a setof hypothesis of known mass values. The evidencetheory provides a formulation, usually termed asDempster’s rule of combination, to evaluate the finalmass function resulting from the combination of en-tirely distinct bodies of evidence.

m(H)=�

H 1’ ···Hp

’ =H

�1� i�p

mi(H �i)

1−K

K= �H 1

’ ···Hp’ =�

�1� i�p

mi(H �i) (14)

where p denotes the number of sources. The parameterK indicates the measure of conflict among differentsources, and it varies between [0, 1]. The EvidenceTheory considers the relative reliability of separate datasources according to ‘discounting’ belief functions. If abody of evidence is given a degree of reliability �, where0��1, the basic probability for every hypothesis Hof � will be reduced from m(H) to �.m(H) while thebasic probability of � increases to m(�)+�.

After calculating the basic probabilities for all possi-ble hypotheses by using Eq. (14), the best hypothesisthat could fit the system conditions can be selected.Because the Evidence theory provides the interval-val-ued probabilities, several alternatives are found forexpressing a decision rule. Kim and Swain (1995) pro-posed three formulations for a decision rule in order toselect the most probable hypothesis,Minimum upperexpected loss rule

H−

=Hi if Pls(Hi)�Pls(Hj) j=1,···,n, (15)

Minimum lower expected loss rule

H−

=Hi if Bel(Hi)�Bel(Hj) j=1,···,n, (16)

Minimum average expected loss rule.

H−

=Hi ifPls(Hi)+Bel(Hi)

2�

Pls(Hj)+Bel(Hj)2

j

=1,···,n (17)

where n represents number of hypotheses.

6. Proposed methodology

A pattern classifier is proposed that can be used fordynamic and multivariate fault diagnosis problems. Thetraining steps of the proposed methodology are shownin Fig. 1. The flow diagram of classifying a new patternby the proposed methodology is illustrated in Fig. 2.

The input data are the trends of system variablesresulting from disturbances (faults) in the system.Training data for the c th class of events is described bymatrix Tc(NP×2L), where NP is the number of train-ing patterns in the class. The information space coveredby the entire training patterns is shown by matrixT= [T1T2…TNS], where NS represents the number ofinformation sources. The employed wavelet-based fea-ture extractor, proposed by Akbaryan (2000), requiresthe input pattern to have fixed and dyadic length. Thetrends can be corrupted by noise, and noise intensitymay vary by the type of variable and the magnitudes offaults. However, classes of event(s) may be representedby different number of patterns. By using a band-passButterworth filter in the second step of the proposedmethodology, the possible low frequency componentsas well as parts of undesired noise would be eliminated.The proposed methodology does not intend to remove


noise completely, because it is found that the presenceof low-magnitude noise helps extract the signal featuresmore effectively. Saito (1994) also addresses this fact inhis work. Since the goal is to remove noise partially,employing the classical frequency-based band-pass digi-tal filters would not deteriorate the efficiency of theproposed methodology. The second order band-passButterworth filter is found to be the best selection forthe filtering of signals. The main defect of this type offilter, however, is its highly nonlinear phase distortionthat changes the output signal structure significantly.The Butterworth filter is a recursive digital filter that isformulated by a series of difference equations. Havingsampled the input signal F (x), the output signal at thenth sample point can be calculated by:

F� (yn)=b1F(xn)+z1( n−1 )

z1(n)=b2F(xn)+z2(n−1)−a2F� (yn)

�zm−2(n)=bm−1F(xn)+zm−1( n−1 )−am−1F� ( yn)

zm−1(n)=bmF(xn)−amF� ( yn)(18)

The vectors b and a represent the filter coefficients.The parameter m represents the length of filter coeffi-cients, and n is an index for sample points. By usingsimulated signals obtained from the TEP case study,the best normalized cutoff frequencies, for a band-passdigital filter, are found to be 0.01 and 0.15. Thesevalues are obtained by employing the spectrum com-mand provided in MATLAB® signal processing tool-box. The filter coefficients are illustrated in Table 1.

Because of using a finite length input signal, thefiltering algorithm results in transient effects nearboundaries of the output signal. These undesired effectswill be avoided if the vector z is initialized properly.The phase distortion of the output signal will be elimi-nated completely if the input signal is filtered in theforward direction and then the output signal is filteredin reverse (Akbaryan & Bishnoi, 2000). It is found thatif dynamic trend of a variable starts coincidentally withfault occurrence, the Butterworth filter, coupled withdouble filtering method, does not always result in asatisfactory output. However, if the trend is precededby steady state behavior of the variable, the outputtrend will be stripped off the low frequency segmentsquite satisfactorily.

Fig. 1. Flow diagram for training of the proposed multivariate and real-time fault diagnosis methodology.


Fig. 2. Flow diagram for classifying a new pattern by the proposed methodology.

In Step 4, the selected principal components trans-form the input patterns into a new and uncorrelateddata space. The principal components, which are vari-ables of the new space, are calculated using the PCAmethod given the matrix V (NC×2L, NS) representingthe input training data space. NS denotes the numberof correlated sensors and NC is the number of classes.To construct the j th row of the matrix V, a trend of thej th correlated sensor is selected from each class ofevent(s), and then these patterns are appended together.The harmonized patterns within a class of deterministicevent differ only by the presence of noise. Thus choos-ing any one of them, as a prototype for the class, doesnot affect the PCA final results considerably. However,a class of stochastic event(s) can be represented by

different trends so that the PCA outputs may change byselecting different training patterns. In order to avoiddiscontinuity that may arise by attaching training pat-terns of two different classes, the patterns should have

Table 1Filter’s coefficients of second order Butterworth band-pass filter

b a

1.00.03657483584393−3.365414140663470.04.27514382446722−0.07314967168785−2.446758962721960.00.537194624801100.03657483584393


started from and ended to the same value. The band-pass filtering of patterns, i.e. Step 2, is a useful tech-nique to achieve this requirement. In the proposedmethodology, the eigenvectors of the correlation matrixfor the input data space, i.e. matrix V, serve as theprincipal components. The principal components areranked according to their importance. The classes ofevents are discriminated most efficiently in the new dataspace if the most dominant principal component is usedto transform the input patterns into the new space. Inother words, the separability among the classes in thenew space deteriorates as the rank of a principal com-ponent decreases. The PCA ranks each principal com-ponent according to the magnitude of its correspondingeigenvalue so that the best one has the largest eigen-value. In this work, the eigenvalue of each principalcomponent, considered as a new and fictitious variable,is normalized according to:

�i=�i

�N

j=1

�j

(19)

�i is used as the measure of reliability for the i th newvariable.

In Step 5, the single-variate wavelet-based featureextractor, proposed by Akbaryan (2000), determinesthe best set of features of an unlabeled trend for everynew variable. Two types of wavelet filters, i.e. LSwavelet filters and Classical Wavelet Filters (CWF),and two types of information cost functions for choos-ing the best wavelet packets, i.e. Sum of Packet Ele-ments (SOE) and Shannon’s Entropy (SHE), areemployed in the LDB technique coupled with DWPTmethod. These four feature extractors are abbreviatedby LSFSOE, LSFSHE, CWFSOE and CWFSHE(Akbaryan, 2000). The tree induction methodology(Akbaryan, 2000) is extended in this work. The DirectMetric Tree Induction (DMTI) (Utgoff, Berkman andClouse, 1997) technique is used to modify an existingtree when new training pattern is introduced to the tree.The decision tree for the j th new sensor results in aprobability distribution vector {Pj,c�c=1,…,NC} and acertainty distribution vector Cj={Cj,c�c=1,…,NC} cal-culated by Eq. (5) and Eq. (8) respectively. The consen-sus theory-based methods simply use the probabilitydistribution matrix P={Pj� j=1,…,NS}, combinecolumns of each row and result in a final probabilitydistribution vector PF={P c

F�c=1,…,NC}. The c th ele-ment of the PF shows the probability that the inputpattern belongs to the c th class. In order to use theevidence theory for fusing classification results of newsensors, the probability mass functions must be deter-mined first, however, the decision trees only output thecertainty factor for each class. If the classes of eventsare considered independent form each other, i.e. non-overlapping classes, the belief and plausibility that an

unlabeled pattern belongs to the c th class will be com-puted as:

mj,c=Belj,c=Cj,c

Plsj,c=1− �NC

k=1

Cj,k�k�c, with {�m�n �CnCm=�}

(20)

However, for overlapping classes, the calculation ofmj,c would be more complicated than the above equa-tion. In this case, the plausibility of each class dependson not only the mj,c of the c th class but also theprobability mass functions of intersection amongst thec th class and the other classes. This situation affects thefinal result of multi-sensor fusion algorithm and causessignificant increase in the computation time. In thiswork, therefore, it is assumed that the classes of eventsare separated from each other in order to simplify thecalculation procedure and decrease the computationtime. When the system condition exists within theboundaries of more than one class of events, such asmultiple fault diagnosis, the decision rule is expected toscore these classes equally. This implies that the inter-section of the classes is the best region to locate theevent.

7. Case studies

To present the efficacy of the proposed methodologyfor fault diagnosis of multivariate systems, three differ-ent case studies are studied in this section. The Tennes-see Eastman Process (TEP) (Downs & Vogel, 1993) isconsidered in all the case studies. The TEP is regardedas a reliable benchmark for testing the researches oncontrol strategies (McAvoy & Ye, 1994; Ricker & Lee,1995), process optimization (Ricker, 1995), and faultdiagnosis (Kassidas, 1997; Oh, Mo, Yoon and Yoon,1997). The flowsheet of the process, provided by Downsand Vogel (1993), is reproduced as Fig. 3. The reactantsA, C, D, E, and inert B enter the process through fourfeed streams. Three feed streams provide pure reactantsA, D, and E. The fourth stream is a mixture of A, C,and B. The plant outlets contain a portion of non-re-acted feed components and the products G and H. Theprocess contains four main unit operations: an exother-mic two-phase reactor, a flash drum, a compressor, anda reboiled stripper.

The evaluation of process control and pattern clas-sification techniques can be done by set point changesand load changes as listed by Downs and Vogel (1993).These disturbances (faults) cause the measured vari-ables to follow either deterministic or stochastic trends.A plant-wide control scheme (McAvoy & Ye, 1994) isutilized in this work for keeping the system variablesclose to their settings. The training and test data set are


Fig. 3. The Tennessee Eastman challenge process.

simulated by using the TEP simulator of Downs andVogel (1993). The length of patterns in the training andtest set is selected as 256; 100 observations are simu-lated for each class of events for training the proposedmethodology. The parameters for the LS-based waveletfilters, i.e. N and N� (Akbaryan & Bishnoi, 2000), areselected as 4 and 4. The Daubechies10 (D10) waveletfilters are used in CWF-based feature extractors.

7.1. Case 1

In this case study, seven deterministic step-typefaults, as listed in Table 2, disturb the steady-statecondition of the TEP individually. These faults arecaused by step changes in plant inputs or faulty valves.The fault D4 results in changing of not only the reactorfeed flow rate but also the flow rate of cooling waterused in the condenser. The magnitude of all the faults,except the fault D4, could be either positive or negative,

whereas the magnitude of the D4 is always negative.Therefore, 13 different faults are considered in this casestudy. The simulation of the TEP is carried out for 60h, and a fault occurs when 30 h of steady state condi-tion is sampled for each measured and manipulatedvariable.

Table 2Selected faults, in the TEP, used for examining the proposed method-ology

Fault ID Fault description

D1 A/C feed ratio high, B composition constant(Stream 4).

D2 B composition high, A/C ratio constant (Stream 4).D3 C header pressure loss (Stream 4).D4 Recycle flow (Stream 8); control valve stuck low.

E Feed (Stream 3); control valve stuck high.D5D6 A Feed (Stream 1); control valve stuck high.

Purge gas (Stream 9); control valve stuck high.D7


Table 3Selected measuring variables for monitoring the TEP

UnitVariable name

1 KscmhA feed (Stream 1)2 kgh−1D feed (Stream 2)

kgh−1E feed (Stream 3)34 KscmhA and C feed (Stream 4)

KscmhRecycle flow (Stream 8)5Kscmh6 Reactor feed rate (Stream 6)KpaReactor pressure7°C8 Reactor temperatureKscmhPurge rate (Stream 9)9°C10 Product separator temperatureKpaProduct separator pressure11

12 M3h−1Product separator underflowKPaStripper pressure13M3h−114 Stripper underflow°CStripper temperature15kgh−116 Stripper steam flowkWCompressor work17°C1816 Reactor C.W. temperature°CSeparator C.W. temperature1917mol%2018 B in Purge gasmol%G in Stream 112119mol%2220 H in Stream 11%D feed valve2321%2422 E feed valve%A feed valve2523%2624 A and C feed valve%Purge valve2725%28 Stripper liquid product valve%29 Stripper steam valve

Reactor cooling water valve %30

Fig. 4 illustrates the trends of some uncorrelatedvariables when the first fault D1 with magnitude of 0.6occurs in the TEP. The leading new variables, i.e. var1and var3, show smooth deterministic trends in whichthe fault occurrence is clearly detectable. However, lessimportant new variables, i.e. var6, var9 and var12,exhibit noisy patterns so that the fault effect is maskedsignificantly by noise. This observation complies withthe fact that the PCA method transfers noise to the lessimportant new coordinates.

Thirteen classes of system behavior resulting fromthe 13 single faults are studied in this case study. Table5 shows identification of these classes along with thetype and positive or negative sign of faults. A classresulting from a fault with a positive sign is representedwith ‘+ ’ sign. For example, class C2- results from faultD2 with a negative magnitude.

To simulate the training patterns for 13 classes offaults, magnitude of their corresponding faults are se-lected according to Table 6.

The four feature extraction methods are utilized foreach new variable. The configuration of the most dis-criminant pattern windows would be different for eachnew variable. Moreover it changes by the type offeature extractor method. The configurations of theDWPT for the second new variable, for instance, arepresented in Fig. 5 where the gray rectangles are thebest packets within each selected window.

The PCA method is able to transform the selectedfeatures, i.e. wavelet packet coordinates, into a new setof features. The number of new features varies with thetype of wavelet filters and the information cost functionused in the feature extractor method. For the first newvariable, the LSFSOE feature extractor results in sevennew features, while the LSFSHE method is able todescribe the feature space by 25 variables. If the CWF-SOE and CWFSHE techniques are used for extractingfeatures, the new space for the first new sensor will bedescribed by 20 and 21 coordinates respectively. Thefairly low number of features for the first new variableillustrates that the true features are easily detectable.However, less important new variables require morefeatures for describing the known classes of events. Forinstance, the LSFSOE extracts 17 and 33 features inorder to characterize the third and sixth new variable.Size of a decision tree depends on the efficacy of theextracted features for separating different classes ofevents. The first new variable is the most reliable source

Thirty manipulated and measured variables are uti-lized in order to have enough resolution about systembehavior. These variables, listed in Table 3, are accom-panied by noise with different intensity. The noiseamplitude for each variable varies by type and magni-tude of the fault. By decreasing the absolute magnitudeof a fault, the main features of signals would be maskedmore by noise.

The band-pass Butterworth filter is employed to sup-press noise and step-like trends of the patterns. Havingstandardized the patterns, the PCA transforms the re-dundant information space into a new uncorrelatedspace. It is found that 95% of the variation in the dataset could be captured by 17 principal components.Table 4 lists the normalized eigenvalues associated withthe selected new variables.

Table 4Normalized scores of selected new variables

var1 var2 var3Variable ID var4 var9var5 var6 var7 var80.0320.0360.0490.059 0..0310.0810.1390.412Variable score, A 0.029

var15var14var13var12var11Var1Variable ID var17var160.023 0.020 0.017 0.016 0.014 0.013 0.012Variable score, A 0.011


Fig. 4. Trends of five uncorrelated variables when the D1 occurs in the TEP.

of information so that its decision tree will have thelowest number of decision nodes. The more irrelevantinformation is introduced by a new variable, the moredecision nodes are required to separate different classesof events. Therefore, the decision trees of the lessimportant new variables contain more nodes than thoseof the leading new variables. Fig. 6 and Table 7 illus-trates the structure of the decision tree related to thevar1 when the LSFSOE feature extractor provides theinstances for the tree. By using the LSFSOE featureextractor, the decision trees of the third, sixth, ninthand 12th new variables have 29, 465, 581 and 593

decision nodes respectively. The decision tree for var1,when the LSFSHE, CWFSOE and CWFSHE methodsare used, has 25, 45 and 35 nodes respectively.

Table 7 shows that the decision tree uses mostly thefirst and the second attributes, i.e. extracted features, asthe best attributes for the decision nodes. Fig. 7(A)illustrates the distributions of these attributes within the13 classes of events, when all of the training instancesare used. Each class is clearly separated from the otherso that uncertainty of classification will be negligible.Fig. 7(B) shows the distribution of the first and secondfeatures, resulting from the LSFSHE method, in the 13


Table 5Definition of thirteen classes of the TEP behavior

Class ID

C1− C2+ C2− C3+ C3−C1+ C4

FaultD1 D2 D2ID D3D1 D3 D4− + − + − −+Sign

Calss ID

C5− C6+ C6+ C7+ C7−C5+

FaultD5 D6 D6 D7 D7ID D5− + − + −+Sign

Fig. 5. The configuration of four DWPTs resulting from the fourfeature extractors.

classes of events given by the first new variable. It isnoted that the LSFSHE feature extractor is able toachieve similar distributions for the first and secondattribute as the LSFSOE method does. The same num-ber of nodes for the LSFSOE-based decision tree andthe LSFSHE-based tree, used for the first new variable,verifies this observation. The CWFSOE-based decisiontree for the first new variable uses the first three fea-tures mostly to establish the class boundaries. In spiteof Fig. 7(A) and (B), the distributions of these features,in Fig. 7(C) and (D), for each class are not isolatedcompletely so that the tree requires more decision nodesto separate the classes. The same situations are alsonoted for features distributions when the CWFSHEfeature extractor is applied to find the most discrimi-nant features.

As the test patterns, 25 signals of the same length asthe training patterns are simulated for each test set.Each test pattern depicts the occurrence of a singlefault. Table 8 illustrates the type and magnitude of faultfor 53 test sets.

Having obtained the classification results from thedecision tree of each new variable, the consensustheory or evidence theory are employed to provide thefinal classification results. The influence of each deci-sion tree on the final output changes by the reliabilityscore of the corresponding sensor. Thus the decisiontree for the var1 has the most dominant effect on thefinal results. Figs. 8 and 9 present the performance ofthe LIOP and GDF multisensor data fusion methodrespectively when any one of 53 test sets are examined.

The number of misclassified patterns for each test set isshown in these figures. Although the LSFSHE-, CWF-SOE- and CWFSHE-based pattern classificationmethods show satisfactory performances, the LSFSOE-based approach reveals better performancesthan the other classification methods. It is noted thatthe computation time for the LS-based fault diagnosistechniques is much less than that of the D10-baseddiagnosis.

Fig. 10 illustrates the results of fault classificationwhen the evidence theory is used for combining multi-variable information. The accuracy of the outputs is assatisfactory as that of the consensus theory-based tech-niques. The LSFSOE-based pattern classificationmethod shows again better performance than the othermethods. A Pentium II® 466 MHz processor is used forthe computations. The required time for diagnosis of all53 test sets, based on any of new variables, is 55 minwhen any of LS-based feature extractors is used. How-ever, the computation time when any of D10-basedfeature extractors is employed is found to be 170 min.Thus fault diagnosis technique based on LS filters is

Table 6Magnitude of single faults in 13 classes of faults

C1−Class ID C2+C1+ C2− C3+ C3− C4 C5+ C5− C6+ C6− C7+ C7−

−0.50.7−0.40.5−0.50.7−0.07−0.30.6Fault size −0.250.6−0.250.6


Fig. 6. Structure of the tree for the first new sensor and LSFSOEfeature extractor.

The noteworthy deficiency of the evidence theory isthe required processing time, which is significantlylarger than those of the LIOP- and GDF-based ap-proaches. To determine a value for Eq. (14), (NC+1)NS intersections between any pair of classes must becomputed. For this case study, 1417 intersections mustbe computed; which is computationally prohibitive. Us-ing the Pentium II® 466 MHz processor, the requiredtime is 3 h for combining the classification resultsobtained from the first six leading new variables.

7.2. Case 2

This case study aims to demonstrate the ability of theproposed methodology for diagnosing of simultaneousdeterministic faults. Since the measured variables areaffected by a group of faults, the resulting trends maybe quite different from those caused by each faultseparately. To have a robust multiple faults classifica-tion technique, the training sets should have had awealth of information about the system behaviorcaused by different sets of faults. This goal is hard toachieve because every combination of faults, based onfault magnitudes and/or types, results in a differentsystem behavior. This case study selects four determin-istic faults as they are noted by D1, D2, D3 and D4 inTable 2. The magnitudes of the D1, D2 and D3 could

almost three times faster than that based on D10wavelet filters. This fact is also found in the trainingphase.

Table 7Specification of the tree shown by Fig. 6

AttributeNode ID t− t+T Class

ID No. of Instances

−16.65−17.351 −19.3212 C3− 100

1 −11.94 −10.123 −9.38100C5−4

1 −1.91 −1.895 −1.88−5.86 −4.32 −3.8

1 3.96 4.57 4.668 1 −6.24 −4.72 −4.209 4.072.910.0052

5.554.7410 2.62211.1510.458.66111

10012 C6+C7− 10013

100C2−1410015 C4

1.31−1.574 0.6416C7+17 100

1.15218 3.102.5519 19.7518.4614.821

C3+ 10020100C1−21100C1+22100C6−23

C5+ 10024100C2+25


Fig. 7. Distribution of two leading features in the 13 classes of events if (A) LSFSOE is used to extract the features, (B) LSFSHE feature extractoris used, (C) distribution of two leading features in the 13 classes of events if CWFSHE is used to separate the features, (D) distribution of secondand third features if CWFSHE feature extractor is used.

be positive or negative, whereas the magnitude of theD4 is always negative. Therefore, seven different singlefaults are considered in this case study. Seven classes ofsystem behavior resulting form the seven single faultsare considered in this case study. The notation of theseclasses along with the type and sign of the correspond-ing fault are illustrated in Table 9.

To simulate the training patterns, magnitude of thefaults, corresponding to classes C1+ , C2+ , C3+ ,C1− , C2− , C3− and C4− are 0.5, 0.5, 0.5, −0.2,−0.2, −0.2 and −0.1 respectively. Both LS parame-ters, i.e. N and N� , are equal to 4. The band-passButterworth filter is employed to remove noise andstep-like trends of the input patterns. Having harmo-nized the output patterns, the PCA transforms thecorrelated information space into a new uncorrelatedspace. It is found that 16 new variables catch 95% ofthe variation in the data set. Table 10 lists the normal-ized eigenvalues associated with the selected newvariables.

The new variables with low reliability scores presentmore irrelevant information in their output datastreams. The four feature extractors are used for multi-ple fault diagnosis based on the trends of 16 newvariables. For the first new variable, the dimension offeatures spaces defined by LSFSOE-, LSFSHE-, CWF-SOE- and CWFSHE-based feature extraction methodsis found to be 19, 37, 25 and 20 respectively. Fig. 11(A)and (B) illustrates the distributions of the first twoleading features for each class of events if the LSFSOEand CWFSOE feature extractors are used respectively.These features are mostly used to construct the corre-sponding decision trees. Based on the values of the firsttwo leading features, the CWFSOE feature extractor, asseen in the figures, separates the seven classes betterthan the LSFSOE method. However, the LSFSOE-based decision tree uses the fifth leading feature toresolve this problem. The LSFSOE-based decision treeconsists of 13 nodes; and the CWFSOE-based tree has17 decision nodes.


Table 11 illustrates test sets when multiple faults withdifferent magnitudes occur in the TEP. In this table, themagnitudes and types of contributing faults are shown

for 16 test sets. Each test set contains 15 faultypatterns.

It is desired that a fault diagnosis method not only

Fig. 7. (Continued)

Table 8Selected magnitudes of single faults for simulating test cases in Case 1

FaultTest Fault TestTest FaultFault Test

ID Size ID SizeID Size ID Size

43 C6+ 0.629 C3-0.3 −0.35C3+150.3C1+1C3− −0.2 44 C6− −0.52 C1+ 0.5 16 C3+ 0.5 30

45 C6− −0.3C1+ −0.153 0.8 C4310.8C3+17−0.09C1+ 46 C6− −0.20.9 18 C3+ 0.9 32 C44

C4 −0.02 47 C7+ 0.3C1+5 1.1 19 C3+ 1.1 336 48C1+ C7+ 0.51.3 20 C3+ 1.3 34 C5+ 0.3

49 C7+ 0.97 0.5C5+C1+ 351.5C3+211.50.9C2+ 50 C7+ 1.20.3 22 C1− −0.6 36 C5+81.2C2+ 51 C7− −0.750.5 23 C1− −0.35 37 C5+9

52 C7− −0.6C5− −0.753810 −0.2C1−240.8C2+5311 C7−C2+ −0.30.9 25 C2− −0.6 39 C5− −0.6

1.1 −0.312 26C2+ C5−40−0.35C2−0.2C2+ 1.3 27 C2− −0.2 41 C6+13

C6+ 0.41.5 2814 C3−C2+ −0.6 42


Fig. 8. Fault Diagnosis results, shown by No. of Misclassified Patterns (MCP), for 53 test sets based on the LIOP data fusion method and (A)LSFSOE, (B) LSFSHE, (C) CWFSOE, (D) CWFSHE feature extractors.

identifies types of faults correctly, but also ranksthem based on their importance (magnitudes). Table 12illustrates the results of fault diagnosis if theLIOP approach is employed for multi-sensor data fu-sion.

If there is more than one class as the cause for faultybehavior, the classes are sorted from left to right ac-cording to their importance. For some test sets, themethodology fails to propose a unique diagnosis. Thesecases are shown by ‘Fail’ in Table 12. The table showsthat the LSFSOE-based fault diagnosis method per-forms much better than the other diagnosis methods.For all of the test sets, this method is able to find atleast one of the contributing faults. Moreover, it ranksthe faults based on their magnitudes correctly for most

of the cases. The performances of other diagnosis tech-nique are also acceptable since in most cases theypropose the desired combination of faults for thesystem.

7.3. Case 3

Two stochastic faults, which are listed in Table 13,disturb the steady state conditions of the TEP.

Kassidas (1997) used the Stoc1 and slow drift inreaction kinetics as stochastic faults for presenting theperformance of his statistical-based fault diagnosismethod. The simulated patterns, when the latter fault isassumed, are structurally similar to those produced byStoc2 fault. In his model, the auto-correlation and


cross-correlation coefficients of a pattern are consideredas the extracted features. A similarity assessmentscheme, based on the dynamic time warping technique,is the feature classifier. This approach assumes that: (1)a stochastic fault results in a consistent correlationpattern in the process variables; and (2) patterns in aclass of stochastic event show similar correlation pattern.Since the validity of these assumptions is questionablefor the TEP, this method fails to provide satisfactorydiagnosis results for the TEP (Kassidas, 1997).

Two classes of system behavior, noted by C1 and C2,

are resulting from the Stoc1 and Stoc2 respectively. Themagnitudes of the faults are considered to be alwayspositive. To simulate the training patterns, magnitudesof the faults Stoc1 and Stoc2 are equal to 1.0 and 1.0.The operating time for simulating the input patterns is60 h. The stochastic fault occurs after sampling 30 h ofsteady state condition. In this case study, two examplesare investigated. In Example 3.1, the feature extractorand classifier for a variable are trained based on 100noisy training patterns for each class of event. Example3.2 constructs the feature extractor and classifier of a

Fig. 9. Fault Diagnosis results, shown by No. of Misclassified Patterns (MCP), for 53 test sets based on the GDF data fusion method and (A)LSFSOE, (B) LSFSHE, (C) CWFSOE, (D) CWFSHE feature extractors.


Fig. 10. Fault Diagnosis results, shown by No. of Misclassified Patterns (MCP), for 53 test sets based on the evidence theory and (A) LSFSOE,(B) LSFSHE, (C) CWFSOE, (D) CWFSHE feature extractors.

Table 9Definition of seven classes of the TEP behavior

Class ID

C1+ C1− C2+ C2− C3+ C3− C4−

D1 D1 D2 D2ID D3Fault D3 D4

+ − + − +Sign − −

variable by 20 noisy training patterns for each class.Then the feature classifier is restructured by 30 newtraining patterns for each class. As the test patterns, 50

signals are simulated for each test set. Table 14 illus-trates the type and magnitude of fault for ten test setsused in the Example 3.1 and 3.2. The training and test



Variable ID var1 var2 var3 var4 var5 var6 var7 var8

0.143 0.078 0.061 0.051Variable Score, � 0.0380.415 0.036 0.031var10 var11 var12 var13var9 var14Variable ID var15 var160.021Variable Score, � 0.0200.027 0.017 0.016 0.015 0.012 0.012

sets are corrupted by noise whose intensity varies bytype of the fault and measured process variable.

7.3.1. Example 3.1The noisy input patterns are passed through the

second order Butterworth filter in order to decreasenoise intensity and filter ramp-like trends. Then theoutput patterns for each class are scaled to unit vari-ance and zero mean. Fourteen uncorrelated new vari-ables capture 95% of total variation for the trainingdata set. The reliability scores of these new variables,i.e. their normalized eigenvalues �, are listed in Table15.

The four feature extractors explore the feature spaceto find the best set of discriminant features. The num-ber of selected features varies with the type of featureextractor. For example, it is found that the stochastictrends of the first new variable are described by 46, 45,37 and 41 features if LSFSOE, LSFSHE, CWFSOEand CWFSHE methods are used respectively. Fig.12(A), (B), (C) and (D) illustrates the distributions ofthe first two leading features, obtained by the fourfeature extractors, within the two classes of stochasticevents. These features (Feature1, Feature2) account for(12.5%, 10.5%), (12.6%, 11.3%), (48%, 3.6%) and (24%,12.5%) of the total variation if LSFSOE, LSFSHE,CWFSOE and CWFSHE methods are usedrespectively.

Contrary to the deterministic fault diagnosis prob-lems, even the first two leading features are not quiteseparated; and that may cause poor performance forthe features classifier. Having established a decision treefor each new system variable, the test sets are intro-duced to the proposed pattern classification algorithm.According to the specifications of the selected waveletcoordinates for each new variable, the features areextracted and classified into one of two classes. TheLIOP, GDF and evidence theory are employed in orderto combine the classification results of each uncorre-lated variable. Fig. 13 illustrates the results of applyingthese methods.

The LS-based classification methods, i.e. LSFSOEand LSFSHE, show better overall performance thanthe CWF-based approaches especially for the last fivetest sets where the Stoc2 is the source of the faultybehavior. The consensus theory-based methods givemore favorable results with LSFSHE than with the

LSFSOE technique. The evidence theory, however, re-sults in improved performance of the LSFSOE ap-proach. The CWF-based methods, coupled with theevidence theory, show better performance than theCWF-based methods coupled with the consensus the-ory-based techniques.

7.3.2. Example 3.2Two stochastic patterns may have different structures

so that use of a wealth of training patterns ensures areliable and efficient pattern classifier for the determin-istic fault diagnosis problems. For some cases, however,the history of system cannot provide enough informa-tion to train a pattern classifier efficiently. This examplesimulates a scenario in which the number of initial

Fig. 11. (A) Distributions of the first two features calculated byLSFSOE method, (B) Distributions of the first two features calcu-lated by CWFSOE method.


Table 11Magnitude and type of faults occurring simultaneously in Case 2

Fault Test FaultFault FaultTest Fault Fault

Size ID Size ID Size IDID Size ID Size ID Size

1 D1 1.0 D2 0.8 9 D1 −0.6 D3 −0.30.7 D2 1.2 10 D2 −0.3D1 D32 −0.6

D13 1.0 D3 0.8 11 D2 −0.6 D3 −0.34 1.0D2 D3 0.8 12 D1 −0.6 D3 1.0

−0.1 D2 −0.3 13 D2 −0.6D1 D35 1.2D16 −0.3 D2 −0.1 14 D2 1.2 D4 −0.1

−0.3 D2 −0.6 15 D1 −0.6 D4 −0.17 D1−0.3 D3 −0.6 16 D1 −0.9 D2 0.1D1 D38 1.0 D4 −0.25

Table 12Results of multiple fault diagnosis, shown by the number of correctly classified patterns, using four feature extraction methods. The most probableclasses of events are shown within brackets.

LSHSHETest CWFSOELSFSOE CWFSHE

15 (C1+, C2+) 15 (C1+, C2+)15 (C1+, C2+) 15 (C1+, C2+)115 (C1+, C2+)2 14 (C1+, C2+)15 (C2+, C1+) Fail15 (C1+, C3+) Fail15 (C1+, C3+) 12 (C1+, C3+)3

15 (C2+, C3+)4 15 (C3+, C2+) 15 (C3+) 14 (C2+, C3+)15 (C2−)5 Fail Fail 13 (C2−, C1−)

15 (C1−, C2−) 15 (C1−, C2−)15 (C1−) 14 (C1−, C2−)615 (C2−) 10 (C1−, C2−)7 15 (C2−, C1−)15 (C2−, C1−)15 (C1−, C3−) 14 (C3−, C1−)15 (C1−, C3−) 13 (C1−, C3−)8

13 (C1−)9 15 (C1−, C3−) 14 (C1−) 15 (C1−)12 (C2−, C3−)10 15 (C3−) 14 (C3−) 15 (C2−, C3−)

Fail Fail14 (C3−, C2−) 15 (C2−)11Fail 15 (C3+)12 15 (C3+)15 (C1−, C3+)15 (C3+) 13 (C3+)11 (C3+, C2−) 15 (C3+, C2−)13

15 (C4−)14 Fail 15 (C4−) 15 (C4−)15 (C4−, C1−)15 Fail 15 (C4−) Fail

Fail 15 (C4−)15 (C1+, C4−, C3+, C2+) 11 (C4−)16

training patterns is limited to 20 observations for eachclass of events. Having constructed the feature extrac-tor and classifier, they are used for fault diagnosispurposes. Since more training data can be collected asthe process operation continues, the fault diagnosistechnique could benefit from the new data. Because theproposed binary tree classifier can implement the ITI orDMTI methodology to partition the information space,the new training data can be incorporated in the treestructure. The feature extractor lacks the ability to beupdated dynamically. New setting for the feature ex-tractor has to be defined based on all the accumulateddata. In this example, only the decision tree for eachvariable is updated, whereas the feature extractors arethose constructed with the initial training data set. Thenumber of new variables and their normalized scoresare the same as those defined in the previous example.

The distributions of the first two features for theclasses of events are shown in Fig. 14. The extractedfeatures, as shown in Fig. 14, covers regions of classesless than the features shown in Fig. 12.

Table 13Description of two stochastic faults used in Case 3

Fault ID Description

Random variation of composition in Stream 4Stoc1Stoc2 Random variation of reactor cooling water inlet

temperature

Table 14Specification of 10 test sets for Example 3.1 and 3.2.

FaultTest Test Fault

ID SizeSizeID

1 Stoc260.5Stoc1 0.50.8Stoc1 72 Stoc2 0.8

8 Stoc2 0.43 Stoc1 0.41.2Stoc1 9 Stoc2 1.24

5 1.4 10 Stoc2 1.4Stoc1



var2 var3 var4Var1 var5Variable ID var6 var7

0.012 0.090 0.0780.384 0.045Variable Score, � 0.039 0.034var8Variable ID var9 var10 var11 var12 var13 var14

0.025Variable Score, � 0.0210.030 0.0192 0.015 0.014 0.010

Fig. 12. Distribution of first two features within two class of faults when: (A) LSFSOE is used, (B) LSFSHE is used, (C) CWFSOE is used, (D)CWFSHE is used.

By using the first set of training patterns, the decisiontrees for the new variables are constructed. For the firstnew variable, the pruned decision tree is found to have5, 5, 9 and 5 nodes, if LSFSOE, LSFSHE, CWFSOEand CWFSHE feature extractors are employed. Sincethe features space is not complete, a few partitions, i.e.decision nodes, could cover all the information space.

Fig. 15(A), (B) and (C) describes the performance ofthe proposed pattern classifier when LIOP, GDF andevidence theory is used for multisensor data fusionrespectively. The misclassified patterns for the test setsrelated to the Stoc1 are lower than those of the test sets

for the Stoc2 fault. This fact indicates that the decisiontrees are more in favor of the class C1 than the classC2. It is noted that the type of the feature classifier hasconsiderable effect on the final classification results.Also, the differences between the first and second fivetest sets will be more significant if the evidence theory isemployed for data fusion.

Consider that 30 new training observations are col-lected during the plant operation. Using the waveletcoordinates obtained from the initial training set, thebest features of these new patterns are extracted and fedinto the decision trees constructed for each new vari-


able. By use of the DMTI methodology, each decisiontree is restructured dynamically so that new sets ofdecision nodes are obtained. The MDL-based directmetric (Utgoff et al., 1997) is used for selecting themost informative attribute and its corresponding test ateach decision node. The decision tree used for the firstvariable is found to have 13, 15, 11 and 9 nodes ifLSFSOE, LSFSHE, CWFSOE and CWFSHE featureextractors are used respectively. Using the new decisiontrees, the results of fault classification for the ten testsets based on the LIOP, GDF and evidence theorymethods are illustrated in Fig. 16(A), (B) and (C)respectively. It is noted that the difference in the num-ber of misclassified patterns between the first and thesecond five test sets decreases because the decision trees

acquire more knowledge about the distribution of pat-terns in the two classes of events. The performance ofthe CWFSHE-based fault classifier is better than thatof the other fault classifiers when the consensus theory-based methods are used for fusing the multi-sourceclassification results. When the evidence theory is used,i.e. Fig. 16 (C), the performance of CWFSHE is goodalthough not the best compared to the other methods.

Although the feature extractors do not take intoaccount the new information, the performance of thefour fault classifiers improves for all the multi-sourcedata fusion techniques. This observation verifies thatthe decision tree technique increases the performance ofa pattern classifier by discarding the worthlessinformation.

Fig. 13. Classification results, shown by No. of Misclassified Patterns (MCP), for 10 test sets of single stochastic faults if (A) LIOP is used, (B)GDF is used, (C) evidence theory is used.


Fig. 14. Distribution of first two features within two classes of faults when: (A) LSFSOE is used, (B) LSFSHE is used, (C) CWFSOE is used, (D)CWFSHE is used.

8. Conclusions

The proposed pattern recognition methodology suc-ceeds satisfactorily to identify source(s) of faulty behav-ior in a complex chemical plant. The input data setcomprises of noisy variations of process variables dis-turbed by deterministic or stochastic fault(s). Themethod extracts the most valuable information at twoconsecutive steps. The measured process variables, rep-resented by sensors, are usually correlated to eachother. Thus in the first step, PCA discards redundantinformation by transforming the correlated sensors intoa set of new and uncorrelated sensors. The results showthat PCA almost halves the number of system vari-

ables. The number of output variables depends on thecomplexity of input data space and noise intensity.Second, a set of wavelet coordinates that discriminatethe classes of events most efficiently amongst otherwavelet coordinates are extracted by a wavelet packet-based technique. It is found that the number of ex-tracted features is much less than the dimension of dataspace. The feature extractor is able to determine thebest configuration of pattern windows so that the max-imum resolution of the system behavior could beachieved. The results of single fault diagnosis, based onany of the four feature extractors, are within acceptableranges. Moreover, it is noted that the type of multi-sen-sor data fusion technique affects the final diagnosis


results. Comparison of the classification results basedon the four different feature extraction algorithms indi-cates that one could not reject a feature extractionmethod in favor of others. However, the computationtime of LS-based feature extraction methods is muchless than that of the Daubechies10-based techniques.Inferences about multiple simultaneous faults could begenerated from only training data for single faults. Themethodology is able to find at least one of the con-tributing faults. The performance of the method de-pends on the magnitude and type of the faults.Moreover, the technique of feature extraction affects

the diagnosis results. It is noted that the evidencetheory-based method, even after some simplifications,demands much more computation time than the con-sensus theory-based techniques.

Acknowledgements

The authors acknowledge the financial support pro-vided by the Natural Sciences and Engineering Re-search Council (NSERC) of Canada and the Ministryof Education of Iran.

Fig. 15. Classification results, shown by No. of Misclassified Patterns (MCP), for 10 test sets of single stochastic faults when 20 observations areused for each class and if (A) LIOP, (B) GDF, or (C) evidence theory is used for multi-source data fusion.


Fig. 16. Classification results, shown by No. of Misclassified Patterns (MCP), for 10 test sets of single stochastic faults when 20 observations areused for each class and if (A) LIOP, (B) GDF, or (C) evidence theory is used for multi-source data fusion.

References

Akbaryan, F., & Bishnoi, P. R. (2000). Smooth representation oftrends by a wavelet-based technique. Comput. Chem. Eng., 24,1913–1943.

Akbaryan, F. (2000). Fault diagnosis of dynamic multi-variate chem-ical processes using pattern recognition, and smooth representa-tion of trends by a wavelet–based technique. University ofCalgary, Canada: Department of Chemical & PetroleumEngineering.

Bakshi, B. R., & Stephanopoulos, G. (1994). Representation ofprocess trends- IV. Induction of real-time patterns from operatingdata for diagnosis and supervisory control. Comput. Chem. Eng.,18, 303–332.

Bakshi, B. R., & Stephanopoulos, G. (1996). Compression of chemi-cal process data by functional approximation and feature extrac-tion. Am. Inst. Chem. Eng. J., 42, 477–492.

Benediktsson, J. A., & Swain, P. H. (1992). Consensus theoreticclassification methods. IEEE Trans. Geosci. Remote Sens., 22,688–704.

Downs, J. J., & Vogel, E. F. (1993). A plant-wide industrial processcontrol. Comput. Chem. Eng., 17, 245–255.

Englehart K. (1998). Signal representation for classification of thetransient myoelectric signal. Doctoral dissertation. The Universityof New Brunswick, Canada: Department of Electrical and Com-puter Engineering.

Fan, J. Y., Nikolaou, M., & White, R. E. (1993). An approach tofault diagnosis of chemical processes via neural networks. Am.Inst. Chem. Eng. J., 39, 82–88.

Frank, P. M. (1990). Fault diagnosis in dynamic systems usinganalytical and knowledge-based redundancy- a survey and somenew results. Automatica, 26, 459–474.

Isermann, R. (1984). Process fault detection based on modeling andestimation methods- a survey. Automatica, 20, 387–404.


Kassidas, A. (1997). Fault diagnosis using speech recognition methods,McMaster University, Canada: Department of Chemical Engi-neering.

Kim, H., & Swain, P. H. (1995). Evidential reasoning approach tomultisource-data classification in remote sensing. IEEE Trans.Syst. Manag. Cybern., 25, 1257–1265.

King, R., & Gilles, E. D. (1990). Multiple filter methods for detectionof hazardous states in an industrial plant. Am. Inst. Chem. Eng.J., 36, 1697–1706.

Kramer, M. A. (1987). Malfunction diagnosis using quantitativemodels with non-Boolean reasoning in expert systems. Am. Inst.Chem. Eng. J., 33, 130–140.

Kramer, M. A., & Palowitch, B. L. (1987). A rule-based approach tofault diagnosis using the signed directed graph. Am. Inst. Chem.Eng. J., 33, 1067–1078.

Lee, T., Richards, J.A., Swain, P.H. (1987). Probabilistic and eviden-tial approaches for multisensor data analysis; IEEE Trans. Syst.Manag. Cybern., GE-25, 283–292.

Lynn, P. A., & Furest, W. (1989). Introductory digital signal process-ing with computer applications. Tiptree, Essex, UK: John Wiley &Sons.

McAvoy, T. J., & Ye, N. (1994). Base control for the TennesseeEastman problem. Comput. Chem. Eng., 18, 383–413.

Mitchell, T. (1997). Machine learning. New York, USA: McGrawHill.

Mohindra, S., & Clark, P. A. (1993). A distributed fault diagnosismethod based on digraph models: steady-state analysis. Comput.Chem. Eng., 17, 193–209.

Oh, S. O., Mo, K. J., Yoon, E. S., & Yoon, J. H. (1997). Faultdiagnosis based on weighted symptom tree and pattern matching.Ind. Eng.. Chem. Res., 36, 2672–2678.

Petti, T. F., Klein, J., & Dhurjati, P. S. (1990). Diagnostic modelprocessor: using deep knowledge for process fault diagnosis. Am.Inst. Chem. Eng. J., 36, 565–575.

Quinlan, J. R. (1993). C4.5: programming for machine learning. SanAntonio, USA: Morgan Kaufmann Publishers.

Rengaswamy, R., & Venkatasubramanian, V. (1995). A syntacticpattern-recognition approach for process monitoring and faultdiagnosis. Eng. Appl. Artificial Intell., 8, 35–51.

Ricker, N. L. (1995). Optimal steady-state operation of the TennesseeEastman challenge process. Comput. Chem. Eng., 19, 949–959.

Ricker, N. L., & Lee, J. H. (1995). Nonlinear model predictivecontrol of the Tennessee Eastman challenge process. Comput.Chem. Eng., 19, 961–981.

Rissanen, J. (1983). A universal prior for integers and estimation byminimum description length. Anal. Stat., 11(2), 416–431.

Saito N. (1994). Local feature extraction and its applications using alibrary of bases. Doctoral Dissertation, Yale University, USA:Department of Mathematics.

Utgoff, P. E., Berkman, N. C., & Clouse, J. A. (1997). Decision treeinduction based on efficient tree restructuring. Mach. Learn., 29,5–44.

Watanabe, K., Hirota, S., Hou, L., & Himmelblau, D. M. (1994).Diagnosis of multiple simultaneous fault via hierarchical artificialneural networks. Am. Inst. Chem. Eng. J., 40, 839–848.

.

fault diagnosis of multivariate systems using pattern recognition and multisensor data analysis...

Documents