fault diagnosis of single-variate systems using a wavelet-based pattern recognition technique

Fault Diagnosis of Single-Variate Systems Using a Wavelet-BasedPattern Recognition Technique

Fardin Akbaryan and P. R. Bishnoi*

Department of Chemical and Petroleum Engineering, The University of Calgary, 2500 University Drive NW,Calgary, Alberta T2N 1N4, Canada

A pattern recognition-based methodology is presented for fault diagnosis of a single-variate anddynamic system. A group of wavelet coordinates discriminating the classes of events mostefficiently among other wavelet coordinates are determined according to the linear discriminantbasis (LDB) method and a principal component analysis (PCA) technique. The proposed featureextractor couples the LDB method with the double wavelet packet tree in order to determinethe best configuration of pattern windows causing the most discrimination among classes. Thelifting scheme-based wavelet filters are used so that the required computation time is reducedsignificantly without degrading the robustness of the method. To reduce the size of the featurespace, the wavelet coordinates are projected into a new low-dimensional space, by using a PCAtechnique, where minimum correlation exists among the new space variables. The tuning ofsome parameters, which affect the performance of the approach, is also discussed. The featureclassifier is a binary decision tree that employs a soft-thresholding scheme for recognition of anoisy input pattern. The performance of the proposed technique is examined by a classificationbenchmark problem, and the faults classification problems for the Tennessee Eastman process.It is observed that the proposed pattern recognition methodology succeeds satisfactorily to classifythe noisy input pattern into the known classes of events.

1. Introduction

A system operates in the faulty condition when itsbehavior deviates considerably from normal and pre-defined operating strategies. Equipment failure, sensordegradation, set-point change, and disturbances in theinput streams are the instances of faulty states for asystem. The first group of faults, known as deterministicfaults, is generated by a fixed magnitude cause and isusually damped by using a robust control strategy.Various magnitudes of a deterministic cause, even atdifferent operating points, produce faults with similartrends. The second group, known as stochastic faults,results from causes whose magnitude changes randomlywith time. Besides, the controlling scheme cannot driveback the system to a steady-state operating condition.A stochastic fault, even at the same initial operatingpoint, could have different patterns.

Fault diagnosis is an important part of the processsupervisory routines that determines the states of thesystem (faulty or normal) as well as the types of faults.The analytical model-based,1-5 causal analysis,6,7 andpattern recognition8-13 are the main groups of faultdiagnosis approaches. Chemical processes are oftencharacterized by nonlinear behavior, noisy inputs, andunknown parameters. Thus, a model describing thesystem behavior, either mathematically or qualitatively,will be quite complicated.1-3 However, computer-basedpattern recognition extracts a wealth of informationfrom the large amount of process data quite satisfacto-rily without concern about the nature of a system. Someof the fault diagnosis methods assume that the faultoccurrence drives the system to a new steady-statecondition. Then the system characteristics at two dif-

ferent operating points are used for diagnosis pur-poses.4,5,11,12 If the system happens to reach its initialsteady-state condition, these methods will not be usefulfor fault diagnoses. As another shortcoming, theseapproaches cannot deal with the stochastic faults be-cause the system cannot reach a steady-state point. Iftransient trends of system variables are used as pat-terns, the fault diagnosis method will be free fromconsidering the steady-state conditions.8-10,13 This im-plies that the diagnosis method is applicable equally forany type of fault and final system condition.

In the present work, we propose a supervised patternrecognition methodology for fault diagnosis of single-variate and dynamic systems. The patterns are tran-sient trends of a process variable resulting from adisturbance in the system. The technique assesses thesimilarity of new, unknown patterns with the prototypesof each class, and the most similar class is consideredas the source of faulty behavior. Because transienttrends contain valuable information scattered within thetime and frequency domain, the feature extractor mustbe efficient equally for these domains. A multiscalewavelet-based transform serves in this work as thefeature extractor. The Fourier transform (FT) suffersfrom an inability to extract information of the timedomain.14 Although the short-time FT (STFT) is ableto process temporal features, its performance is inferiorto the wavelet transform (WT) especially for short-livedsegments of a pattern.14 The linear discriminant basis(LDB) method15 is modified in this work and used asthe basis of the proposed feature extractor. In additionto the whole-line wavelet filters used in the originalLDB method, the lifting scheme (LS)-based waveletfilters14 are employed by the feature extractor. As themain advantage, the LS filters require less computationtime than the whole-line wavelet filters. The informa-

* To whom correspondence should be addressed. Tel: (403)220-6695. Fax: (403) 282-3945. E-mail: [email protected].

3612 Ind. Eng. Chem. Res. 2001, 40, 3612-3622

10.1021/ie000779l CCC: $20.00 © 2001 American Chemical SocietyPublished on Web 07/13/2001

tion content of a signal depends on the length of thedata sequence. A large window of data gives moreinformation about the general trend of a pattern,whereas small-size windows focus more on the localstructure of a pattern. The proposed feature extractoris able to choose the best set of nonoverlapping windowsadaptively so that the selected features for each classare maximally discriminated. This helps the featureclassifier define a more robust decision scheme. Theproposed feature classifier is based on the binarydecision tree (DT) approach implemented in manyclassification routines. A DT-based classifier is trainedeasily and needs a few a priori assumptions. Theproposed tree classifies the extracted features accordingto a soft-thresholding technique. The tree determinesthe a posteriori probabilities that extracted features maybelong to different classes.

For ease of understanding, the frameworks of multi-scale feature extraction, the wavelet-based transforms,and the induction technique for the DT classifier areintroduced. The proposed methodology for feature ex-traction and classification, given a set of noisy data, isthen described. To demonstrate the efficacy of theproposed algorithm, it is applied to simulated data.

2. Background on Pattern Recognition

Pattern recognition is an algorithm that determinesthe most appropriate class of event(s) for the givenunlabeled input pattern. To recognize an unknownpattern, two main steps are usually followed: (1) featureextraction and (2) feature classification.

2.1. Feature Extraction. The transient behavior ofa chemical process shows the effects of different phys-icochemical events such as process dynamics, sensornoise, faults, and external loads. These events, knownas features, can be observed over different time andfrequency ranges. Filtering is a conventional techniquefor extracting the features. A filtering technique, whichexplores the entire range of frequency and time domainsimultaneously, is more reliable for extracting thefeatures. Multi-resolution analysis (MRA) of a patternis considered as a reliable basis for filtering the patternin time and frequency domains.10,14,16 Linear time-frequency representation (TFR) of a pattern aims topresent a pattern y(t) in terms of a weighted summationof some basis functions:

where ψi(t) stands for the basis function, ci is theweighting factor, and N is the number of sample points.The MRA of a pattern combines the pattern’s TFR atdifferent sampling rates, i.e., resolution, so that finedetails and a general trend of the pattern with adesirable accuracy can be achieved.14

The WT14 is considered as a highly efficient approachfor the MRA. The WT tiles the time-frequency planeeffectively such that the main features of a pattern,located at various frequencies and times, are extractedwith minimum redundancy. The wavelet basis functionsare localized well in time and frequency domains.

2.1.1. Wavelet Packet. The wavelet packet trans-form (WPT) is a generalized version of WT that decom-poses even the high-frequency bands kept intact in theWT. Unlike the WT, the WPT decomposes the pattern

into more different frequency bands at each time scaleso that a set of overcomplete wavelet coefficients wouldbe generated.14 By WPT, any subspace regardless of itstype is decomposed into two coarser subspaces.

where m represents the scale and f denotes the fre-quency counter for subspaces at each scale. Eachsubspace is called a packet, and binary packet tree isan ensemble of all of these subspaces. Thus, a techniquemust be implemented to choose the best set of basisfunctions from the group of redundant subspaces. Theformulation for selecting the best packets, for classifica-tion purposes, is discussed below.

Best Basis Selection. Saito15 proposed the LDBmethodology that determines a set of best packets thatmaximize the discrimination among different classes ofdata. In this technique, the importance of each packetis measured quantitatively by a statistical distance-based criterion termed the discriminant informationfunction (DIF) D(p,q). The p and q are two nonnegativevectors with ∑kpk ) ∑kqk ) 1. The DIF could be modeledby the j-divergence criterion:

The LDB method uses the time-frequency energy mapsof classes in order to compute the function D. The time-frequency energy map of class c is a table of positivereal values, denoted by the indexes m, f, and k.

where Nc is the number of patterns for the cth class.The dm,f,k

i is the kth wavelet coefficient located in thefth packet of the mth scale. The coefficient is obtainedby transforming the ith pattern into a wavelet packettree. The DIF for each packet is defined by

NC is the number of classes. The details of computationsteps for the LDB method can be found in work bySaito.15 The LDB reduces the complexity of the clas-sification algorithm by retaining the most discriminantfeatures. When the information content is dispersedthroughout the entire time-frequency plane, retainingonly a selected group of features may be far from theoptimal solution. Englehart16 used the PCA method forseeking the best combination of all of the features in alower dimensional space.

2.1.2. Dynamic WPT. Bakshi and Stephanopoulos17

proposed the time-varying wavelet packet analysis,which is utilized mainly for on-line compression ofnonstationary signals. The WT and WPT require aminimum number of data points to decompose the givensignal into the next coarser scale. The WPT based onHaar wavelet filters, for instance, needs two data pointsto construct a two-level packet tree. As the number ofsamples increases to four, two more packet trees would

y(t) ) ∑i)1

N

ciψi(t) (1)

Ωm,f ) Ωm-1,2f x Ωm-1,2f+1 m ) L, L - 1, ..., 0

f ) 0, 1, ..., 2m - 1 (2)

D(p,q) ) ∑i)1

n

pi logpi

qi

+ qi logqi

pi

(3)

Γc(m,f,k) ) ∑i)1

Nc

(dm,f,ki )2/∑

i)1

Nc

|yi(c)|2 (4)

D(Γc(m,f,...)c)1NC ) ) ∑

k)1

2m

∑i)1

NC-1

∑j)i+1

NC

D(Γi(m,f,k),Γj(m,f,k))

(5)

Ind. Eng. Chem. Res., Vol. 40, No. 16, 2001 3613

be added to the ensemble of packet trees. When packettrees of similar depth are arranged in a row, a doublewavelet packet tree (DWPT) would be constructed. Thenodes of this tree are the single packet trees that areestablished online during the sample collection. Con-figuration of DWPT depends on the types of waveletfilters and the length of signal. The best packet selectionalgorithms could be applied to find not only the bestpackets of each single tree but also the best set of packettrees within the double trees.

2.2. Feature Classification. DT is a supervisedclassifier, with some desirable properties,18 that hasbeen employed in a broad range of classification tasks.The outputs of DT are as accurate as those of the otherclassification algorithms such as artificial neuralnetworks.19-20 A DT consists of a series of decision andterminal nodes. At each decision node a specified testis performed on a selected element (attribute) of theinput pattern. Depending on the test result, the patterndescends to another node until the pattern reaches aterminal node (leaf). A DT is constructed by recursivepartitioning of data space, represented by trainingpatterns, until stopping criteria are met at each of theterminal nodes. The binary trees are preferred tononbinary ones because the former is not biased in favorof attributes with many outcomes.21 Moreover, thebinary trees split the data space into more regions thanthe nonbinary trees do.

The classification error depends largely on the selec-tion of appropriate attributes for each decision node. Atest on an attribute that divides the data set nontriviallyis considered as a potential candidate for categorizingthe input instance. The incremental tree induction (ITI)model21 employs a form of Kolmogorov-Smirnov dis-tance (KSD) to score each test for partitioning a set ofexamples at every decision node. For the continuousattributes, the suggested KSD would be

The matrix z represents samples whose attribute A doesnot retain Ak as its value, whereas matrix z1 denotesthe samples whose attribute A only has values that aresmaller than Ak. If two tests have similar KSDs, theITI methods break the tie in favor of a test whoseassociated attribute is lexically or numerically lowerthan the other test. To find the best test for a decisionnode, the most informative value is determined first foreach attribute. This value is used as a criterion for the

test. The best test is the one whose criterion has themaximum score. For a continuous attribute with mdistinct values, given by the training set, the bestinformative value exists in one of m - 1 disjointintervals. The best interval to locate the test’s thresholdwould be the one whose upper limit has the maximumscore. Although there are infinite choices for a thresholdwithin the interval, the midpoint of the interval isusually taken as the threshold.12

If the data are corrupted by noise or the size oftraining data is too small to represent the true discrimi-nant function, the DT overfits the training data. Prun-ing of DT, when it grows completely is often used foravoiding such problems. Utgoff et al.21 utilized theminimum description length (MDL)22 for pruning theundesired subtrees. If the MDL of a node as a leaf issmaller than that as a decision node, the node will bedenoted as a leaf because it needs fewer bits forencoding.

3. Proposed Methodology

We propose a pattern classifier that can be used fordynamic and single-variate fault diagnosis problems.First, the most important features F (NP × NF) of noisypatterns Y (NP × N) are extracted according to aformulation described by

NP stands for the number of patterns, and NF showsthe number of extracted features. The raw pattern isdecomposed into a set of orthogonal coordinates shownby matrix Ψ, and then the selection rule Θ(k) reducesthe dimension of generated feature space by choosingthe most important k coordinates from n coordinates.The matrix Ψ could be determined using a wavelet-based MRA technique. The WPT method is employedin this work for transforming the input patterns into aset of wavelet coefficients. The parameter N in eq 7 isdyadic, i.e., N ) 2L. The LDB method is modified in thiswork to select a set of wavelet coefficients that discrimi-nate the classes of event(s) most efficiently among othersets of wavelet coefficients.

The original LDB technique decomposes the entiredata sequence into a single packet tree without explor-ing the class-discrimination performances of othergroups of pattern windows. Most of the pattern analysistasks employ a windowing process that divides thepattern into disjointed or overlapped windows, wherethe segmented data are processed locally. The informa-tion obtained from each window is not reliable individu-ally; therefore, a method is required to combine theoutputs of each segment. The size of each window hasa major impact on the robustness and reliability of themethod. The common choice is to divide the pattern intoa number of identical segments. When the intensity ofthe information is not uniform throughout the dataspace, the use of variable-size windows is a moreconvenient choice. The proposed algorithm adopts theDWPT in order to determine the best configuration ofpattern windows causing the most discrimination amongclasses. To reduce the size of the feature space, thewavelet coordinates are projected into a new low-dimensional space where minimum correlation existsamong the new variables.

f: Y f F

f ) Θ(k)‚Ψ (7)

KSD(T,Ak) ) max1eC<Nc |∑j)1

C

freq(wj,z1)

∑j)1

C

freq(wj,y)

-

∑j)C+1

Nc

freq(wj,z1)

∑j)C+1

Nc

freq(wj,y) | + max1eC<Nc |∑j)1

C

freq(wj,z)

∑j)1

C

freq(wj,y)

-

∑j)C+1

Nc

freq(wj,z)

∑j)C+1

Nc

freq(wj,y)| (6)

3614 Ind. Eng. Chem. Res., Vol. 40, No. 16, 2001

The selected features (attributes) form an “instance”.They are fed into a binary DT. The ITI methodology isthe framework of the proposed tree classifier in thiswork. The MDL-based information criterion is found asa reliable choice for pruning the suggested DT. In thiswork, a soft-thresholding technique is coupled with thetree induction algorithm. The tree determines the aposteriori probabilities that a pattern may belong todifferent classes of event(s).

3.1. Feature Extractor. The following steps areproposed to extract the best features from a given setof noisy training patterns:

1. Choose the type of wavelet filters, the maximumdepth of decomposition, the form of discriminant func-tion, and the information cost function required by theWPT.

2. Decompose each of the given patterns, i.e., Y (NP× 2L), into a DWPT.

3. Construct the table of frequency-energy map foreach of the classes.

4. Select the packets, which discriminate the classesof event(s) most efficiently, for each subtree separatelyusing the LDB algorithm

5. Select the subtrees (pattern windows), whichseparate the classes of event(s) most efficiently, usingthe LDB algorithm. The result is matrix W (NP × 2L)representing the wavelet packet coefficients.

6. Apply the PCA method to transform the selectedfeatures, i.e., W, set to a new fictitious space, i.e., F (NP× NF).

In step 1, in addition to the sum of packet elements(SOE) cost function,15 used in the original LDB method,the suitability of the Shannon entropy (SHE) costfunction for feature extraction is examined in this work.The SOE for a packet at mth scale and fth frequency isdefined by

The SHE criterion14,15 is described by

and it is not an additive function. It is easy to show thatminimizing the additive measure

implies the minimization of H in eq 9. The Fourier-basedwavelet filters, termed classical wavelet filters (CWF),are used in the original LDB technique. In the proposedfeature extractor, we also employ the LS-based waveletfilters. The LS-based wavelet filters are more under-standable and demand less computation time than theCWF for constructing the WPTs. The algorithm forobtaining LS-based filters and advantages of using thefilters are discussed by Akbaryan and Bishnoi.14

In step 2, at the mth (1 e m e 2L/ML) level of aDWPT, the input pattern is split into m disjointwindows, and each segment is decomposed separatelyonto a separate packet tree. The depth of DWPT andthe packet tree of each window are specified by the types

of wavelet filters. The length of the smallest packet tree,i.e. ML, changes by the types of wavelet filters, so thatwhen LS filters are used, it is equal to 2Nmax, while itwould be equal to 2 for the whole-line wavelet filters.Nmax is defined by

It is assumed that the coefficients of the first levelare equal to the values of the sample points. When theLS filters are employed, the coefficients of each packet,d, are determined as14

where m and f are the indexes of scale and frequency ofa packet. At each scale, the predicting function P andupdating function U must be reevaluated for coefficientsnear boundaries. The formulations of these functionsare elaborated by Akbaryan and Bishnoi.14 The coarsestscale M achievable by the LS method is determined by

where 2L is the dyadic length of the input pattern. Theparameters N and N affect the smoothness and localiza-tion of the wavelet basis functions extensively. Highvalues of these parameters reduce the depth of thepacket tree; therefore, fewer frequency bands could beinvestigated. Low values, on the other hand, reduce thesmoothness of the basis functions. The M will be zero ifwhole-line wavelet filters are chosen for constructingthe packet trees. Two types of wavelet filters, LS andCWF, and two types of information cost functions, SOEand SHE, are employed in the proposed feature extrac-tor. These four feature extractors are abbreviated byLSFSOE, LSFSHE, CWFSOE, and CWFSHE.

The result of step 2 is a set of redundant waveletpacket subtrees. In step 3, a new time-frequency energymap of each class, Γc, is established in this work.

The indexes m, f, and k are the counters for the scale,frequency, and wavelet coefficients of each packet. Thecounters l and p are defined for the scale and theposition of each packet subtree within the DWPT. Themaximum limits of l and p change by the types ofwavelet filters. The Γc is computed by accumulating thesquares of wavelet coefficients at each position in theDWPT and then normalizing by the total energy of theinput pattern. In steps 4 and 5, the selection of the bestdiscriminant packets for each subtree is separatelyaccomplished by using of the LDB algorithm, and thesummation of discriminant measures of the best packetsis the score of the packet subtree. When all of the

SOE(dm,f) ) ∑k)1

2m

dm,f,k (8)

H(dm,f) ) -∑k)1

2m dm,f,k2

|dm,f|2

logdm,f,k

2

|dm,f||(9)

SHE(dm,f) ) -∑k)1

2m

dm,f,k2 log dm,f,k

2 (10)

Nmax ) max (N, N) (11)

[dm-1,2f-1, dm-1,2f] ) Split(dm,f) m ) L, ..., M

f ) 1, ..., 2L-m

dm-1,2f ) dm-1,2f - P(dm-1,2f-1)

dm-1,2f-1 ) dm-1,2f-1 + U(dm-1,2f) (12)

M ) L - [log2( L - 1Nmax - 1)] (13)

Γc(l,p,m,f,k) )

∑i)1

Nc

(dl,p,m,f,kc )2

∑i)1

Nc

|yi(c)|2

(14)


subtrees are scored, the LDB method is utilized againso that the most discriminant subtrees, i.e., patternwindows, within the DWPT will be determined.

In step 6, the best group of new discriminant featuresF is determined by

The matrix W represents the set of best original waveletcoefficients, and the columns of matrix U (2L × NF) arethe principal components (PC) selected by the PCAmethod.

3.2. Features Classifier. The ITI binary tree induc-tion method, used as the basis of our DT classifier,considers an unlabeled instance as a member of onlyone known class. The crisp classification approach isunable to address a problem when two or more classesof events overlap each other. In multiple fault diagnosisproblems, for example, some patterns belong to regionscovered by more than one class of event. The presenceof noise also intensifies the effect of this problem on theclassification results. Because of the presence of someof the true features in high-frequency bands, the de-noising methods usually rectify only a certain amountof noise from the input signals. By small variation inthe value of the input features value (attributes), thetree classifier can classify the selected features to acompletely different class. In this work, we incorporatea soft-thresholding approach, proposed by Quinlan,23

into the original ITI-based DT for dealing with theuncertainty in classification problems. The soft-thresh-olding approach defines two subsidiary cutpoints, t+ andt-, for each decision node such that the node’s crispthreshold t is laid between them. As an instance issorted down, each decision node on the path determinesthe probability P of sending the input instance towardthe left branch and 1 - P for the right subtree. Considerthat attribute A is chosen as the most informativeattribute for a decision node; the probability P wouldbe computed by

where Ai denotes the value of A given by the ithinstance. Suppose R represents a subset of traininginstances directed to the current decision node and Edenotes the cases misclassified by the node when thenode’s crisp threshold is employed. The standard devia-tion for the number of misclassified instances will bedefined as

The t+ and t- are determined according to the rule thatif the node threshold is set to either of them, the numberof misclassified patterns will be equal to STD + |E|.Details of the algorithm can be found in Akbaryan.24

Because of the soft-thresholding approach at each

decision node, an instance may reach to several leaveswith different distributions for each class. Consider thatthere are NC known classes; the probability that aninstance is assigned to the cth class of terminal node lis calculated by

where D denotes the number of decision nodes visitedby the instance in each classification path starting fromtree root and ending at terminal node l. Pj representsthe probability of either the right or left branch deter-mined by eq 16. The number of training instancesbelonging to the cth class of terminal node l is shownby Nc,l, while Nl stands for the total number of instancesassigned to the same terminal node. The most probableclass in terminal node l is the most populated one. Ifan instance reaches to S terminal nodes, the probabilitythat the instance belongs to the cth class, Pc, would bedefined as

Equation 19 is used to estimate roughly the a poste-riori probability for each class.

4. Case Studies

To demonstrate the ability of the proposed methodol-ogy, three classification experiments are conducted byusing simulated trends. The Daubechies10 waveletfilters are used for the CWF-based feature extractionmethods. The effects of the dyadic length of the trendand the number of training patterns of each class arealso investigated by the first case study.

4.1. Case 1. This example is usually used as abenchmark for the evaluation of feature extraction andclassification techniques.15,16,19 Three classes of simu-lated triangular waveforms, defined below, are studiedin this case

where

u is the uniform random variable on the interval [0, 1],and ε(i) is the standard normal variate that representsnoise in the waveforms. A test set of 1000 patterns foreach class is chosen for examining the performance ofproposed methodology. The LS parameters are N ) 4and N ) 2. Except experiment III, the number oftraining patterns is 100 for each class. For all of theexperiments, except experiment IV, the pattern lengthis 32.

Pcl )

Nc,l

Nl∏j)1

D

Pj (18)

Pc ) ∑l)1

S

Pcl (19)

f I(i) ) uh1(i) + (1 - u)h2(i) + ε(i)

f II(i) ) uh1(i) + (1 - u)h3(i) + ε(i)

f III(i) ) uh2(i) + (1 - u)h3(i) + ε(i) (20)

i ) 1, ..., 32

h1(i) ) max (6 - |i - 7|,0), h2(i) ) h1(i - 8), h3(i) )h1(i - 4) (21)

F ) W‚U (15)

Ai e t- f P ) 1

t- < Ai e t f 0.5 e P ) 1 -Ai - t -

2(t - t-)< 1

t < Ai e t+ f 0 e P ) 12

-Ai - t

2(t+ - t)< 0.5

Ai > t+ f P ) 0 (16)

STD ) x(|E| + 0.5) × |R| - |E| - 0.5|R| (17)


Table 1 illustrates the effect of wavelet filters, patternlength, the number of training patterns, and hard-thresholding classification on the performance of theproposed methodology. The performance is measured byusing the misclassification rate, defined as the ratio ofmisclassified cases to the total cases for a class.

In the first experiment (expt I), LSFSHE is selectedas the feature extractor. The second experiment employsLSFSOE for extracting the features. Comparing theresults of LSFSHE with those of LSFSOE indicates thatthe SHE is a better information criterion for thesepatterns. In experiment III, the number of trainingobservations reduces to 50 for each class while LSFSHEis the feature extractor. The classification rates increasefor both training and test sets. This observation con-firms the fact that a classifier requires a rich trainingdata set in order to exploit the stochastic nature of thesystem. When the length of observation is changed to128 for training and test sets in experiment IV, themisclassification rates increase considerably for both ofthe data sets. When the sampling rate is increased,noise propagates into more wavelet coefficients so thatit degrades the performance of the LSFSHE featureextractor. The PCA approach, however, helps to reducethe noise effect to some extent, because the uncorrelatednoisy coefficients affect only less important PCs. In thelast experiment, the test patterns are classified by usingthe hard thresholding for each decision node. Theresults illustrate that the misclassification rate for thefirst class soars to 80%, while the misclassification ratefor the other classes decreases considerably. The note-worthy fact is that the results of classification amongthe three classes are inconsistent despite the magnitudesimilarity among their patterns. The overall misclassi-fication rate for the test set is much greater than thatof the soft-thresholding approach. The original tree forthe first experiment has 61 nodes, which reduces to 21by using the MDL-based pruning algorithm. LSFSHEreduces the size of the original pattern to 27 by choosing0.95 as the threshold for the PCA method. By using afull DT and five top discriminant features, Saito15

achieved average misclassification rates of 7.0% and21.37% for training and test sets accordingly. The six-tap Coiflet wavelet filter was employed in his work forpacket tree construction. It is noteworthy that the Bayeserror, i.e., the lowest possible error, is found to be 14%.

4.2. Case 2. The Tennessee Eastman process (TEP)25

is considered in this case study. The TEP is regardedas a reliable benchmark for testing the researches oncontrol strategies,26,27 process optimization,28 and faultdiagnosis.8,13 The flowsheet of the process, provided byDowns and Vogel,25 is reproduced in Figure 1. Thereactants A, C, D, E, and inert B enter the processthrough four feed streams. Three feed streams providepure reactants A, D, and E. The fourth stream is amixture of A, C, and B. The plant outlets contain aportion of nonreacted feed components and the products

G and H. The process contains four main unit opera-tions: an exothermic two-phase reactor, a flash drum,a compressor, and a reboiled stripper.

There are 41 measured variables and 12 manipulatedvariables. The evaluation of process control and patternclassification techniques can be done by set-point changesand load changes as listed by Downs and Vogel.25 Thesedisturbances (faults) cause the measured variables tofollow either deterministic or stochastic trends. A plant-wide control scheme27 is utilized in this work for keepingthe system variables close to their settings. Four dif-ferent step-type faults, listed in Table 2, are selectedfor this research. It is reported8 that these disturbanceshave similar effects on streams and units so that theirdiscrimination would be a challenging problem.

The magnitudes of D1, D2, and D3 could be eitherpositive or negative, whereas the magnitude of D4 isalways negative. Therefore, seven different faults areconsidered in this case study. The simulation of TEP iscarried out for 60 h, and fault(s) occur(s) when 30 h ofsteady-state condition is sampled for each measured andmanipulated variable. All measurements are corruptedby noise, and each sensor exhibits different signal-to-noise ratios, which also changes by the fault type. Whenthe absolute magnitude of a fault is decreased, the mainfeatures of signals would be masked more by noise. Forthis case study, two measured variables, i.e., the flowrate of stream 2 and compressor work, are consideredin examples 2.1-2.3 for illustrating the performance ofthe proposed methodology. Examples 2.1 and 2.2 il-lustrate the diagnosis of single fault cases, whereasexample 2.3 is concerned with multiple fault diagnosis.Seven classes of system behavior resulting from the

Table 1. Results of Pattern Classification for Classes C1,C2, and C3 Based on Different Settings

misclassification rate (%)

training set for class test set for class

C1 C2 C3 average C1 C2 C3 average

expt I 18.0 3.0 1.0 7.3 24.1 12.3 15.3 17.23expt II 16.0 30.0 37.0 83.0 54.3 58.6 66.6 60.0expt III 12.0 6.0 6.0 8.0 39.0 11.1 21.5 23.9expt IV 28.0 16.0 20.0 21.3 60.4 38.5 48.0 49.0expt V 80.5 0.03 9.4 31.0

Figure 1. TEP.

Table 2. List of Step-Type Disturbances Employed forTesting the Fault Diagnosis Technique

fault ID process variable

D1 A/C feed ratio, B composition constant (stream 4)D2 B composition, A/C ratio constant (stream 4)D3 C header pressure loss, reduced availability (stream 4)D4 change in base value of recycle flow (stream 8)


seven single faults are studied in these examples. Table3 shows identification of these classes along with thetypes and positive or negative sign of faults. A classresulting from a fault with a positive sign is representedwith a plus sign. For example, class C1+ results fromfault D1 with a positive size.

The training sets consist of 100 patterns for each classof fault, and the length of patterns is 256. To simulatethe training patterns, the magnitudes of the faultscorresponding to classes C1+, C2+, C3+, C1-, C2-,C3-, and C4- are 0.5, 0.5, 0.5, -0.2, -0.2, -0.2, and-0.1, respectively. Both LS parameters, i.e., N and N,are equal to 4. Because of the undesired effects of noise,it is preferred to rectify the signals by using a band-pass digital filter. This filter is the second-order But-terworth filter whose normalized band-pass frequenciesare set to 0.01 and 0.15. It is found that most of thenoise could be removed by setting the high cutofffrequency to 0.15 without seriously distorting truesignal features. Two similar patterns with differentmagnitudes are considered to be unalike because theproposed feature extractor processes a pattern quanti-tatively. The standardization of a variable, i.e., scalingto zero mean and unit variance, is used to harmonizetwo similar patterns with different magnitudes. Theexistence of a step-type segment in a signal is knownto cause the standardization technique to become inef-ficient for this purpose. The low cutoff frequency is usedin the proposed methodology to help remove the step-or ramplike trends from the signal. It is noted that theLS-based feature extraction requires much less compu-tation time than the Daubechies-based ones in thefollowing examples. The required computation time forthe latter is 5 times longer than that of the former.

Example 2.1. The dynamic trends of the stream 2flow rate due to the seven different faulty states areshown in Figure 2. The true signal features for the C1-and C2- classes are hardly detectable. The signalsbelonging to the C2+ and C2- classes exhibit a ramp-type pattern.

As the test patterns 25 signals of the same length asthe training patterns are simulated for each test set.Table 4 illustrates the type and magnitude of fault for20 test sets.

The test sets consist of a variety of fault magnitudessuch that for some signals the true features are mostlymasked by noise, while for some others the noiseintensity is fairly low. Figure 3 illustrates the perfor-mance of the four feature extractors.

Having compared the classification results, the LS-FSOE feature extractor exhibits a better overall per-formance than the other feature extraction methods.The classification results related to the test set no. 19have the highest misclassification rate because noisecauses significant differences between test and trainingpatterns. This implies that a classifier will define classboundaries more efficiently if it uses mildly noise-corrupted training patterns. Although the PCA methodis able to reduce the dimension of feature space consid-erably, its performance depends largely on the types of

wavelet filters and criteria for selecting the best packets.The feature spaces, constructed by the LSFSOE method,could be represented effectively by 27 variables, whilethis number would be 43 for the LSFSHE technique.The features space is described by 48 variables ifCWFSOE is used, while 66 variables are needed whenCWFSHE is employed for the feature extraction. Asthese numbers show, the feature extraction methodsusing SOE criterion are more effective for representingthe data space than those utilizing SHE. The D10-basedfeature space is represented by more variables than theone based on the LS filters. The four feature extractionmethods define the DWPT structure differently. TheLSFSOE method chooses the entire length of the pat-tern as the best packet for time-frequency representa-tion of each input signal. The configurations of DWPT,which are the best subtrees and their corresponding bestpackets, are illustrated in Figure 4 for other featureextractors.

The structure of DT classifier depends largely on thetypes of parameters selected for the feature extraction

Table 3. Definition of Seven Classes of TEP Behavior

class ID

C1+ C2+ C3+ C1- C2- C3- C4-

fault ID D1 D2 D3 D1 D2 D3 D4sign + + + - - - -

Figure 2. Deviation of the stream 2 flow rate from its steady-state set point caused by the seven single faults.

Table 4. Specification of 20 Test Sets for Examples 2.1and 2.2

fault fault fault

test ID size test ID size test ID size

1 D1 0.3 8 D2 1.5 15 D2 -0.352 D1 0.7 9 D3 0.3 16 D2 -0.63 D1 1.1 10 D3 0.7 17 D3 -0.354 D1 1.5 11 D3 1.1 18 D3 -0.65 D2 0.3 12 D3 1.5 19 D4 -0.026 D2 0.7 13 D1 -0.35 20 D4 -0.27 D2 1.1 14 D1 -0.6


step. The pruned DT contains 87 nodes if LSFSHE arechosen for extracting the signal features. The numberof nodes for a tree classifier, when any one of LSFSOE,CWFSOE, and CWFSHE is used for extracting features,would be 89, 97, and 99, respectively. The training spacefor the classifier is complicated because a large numberof nodes are required to separate the instances ofdifferent classes. The CWFSHE-based DT, for instance,uses mostly the 2nd, 3rd, and 5th leading features toscreen the training patterns of each class. By use of thesecond attribute, as shown by “Feature2” in Figure 5,all of the classes, except C1- and C2-, are separatedfrom each other satisfactorily. The third and fifthfeatures, as shown by “Feature3” and “Feature5” inFigure 5, discriminate C1- from C2- more efficiently.This fact would help the DT to separate these groupsof classes by nonoverlapping boundaries.

Example 2.2. In this example the proposed method-ology is examined for the classification of trends pro-duced by variations of the work required by the com-pressor installed for the recycle stream in TEP. Figure6 illustrates an observation for each class of event listedin Table 3. As these plots show, the noise effects on truesignals are not as much as they were in the previousexample. The magnitudes of faults for the test set arealso similar to those listed in Table 4.

The four feature extraction methods, LSFSOE, LSF-SHE, CWFSOE, and CWFSHE, are employed in thisexample. The classification results, shown by the num-ber of misclassified patterns in Figure 7, reveal a betterperformance of CWFSOE compared to the other featureextraction methods. It must, however, be noted that theperformances of the other two techniques, i.e., CWFSHEand LSFSHE, are satisfactory. The LSFSOE-basedtechnique presents the worst results compared to theother methods.

Figure 4. Configurations of DWPTs obtained by three featureextraction methods.

Figure 3. Results of fault diagnosis for the stream 2 flow rate byuse of the four feature extraction methods.

Figure 5. Distributions of four leading extracted features in theseven classes of events.


The number of best and uncorrelated features wouldchange as different feature extraction methods areselected. The dimensions of feature spaces defined byLSFSOE-, CWFSHE-, LSFSHE-, and CWFSOE-basedextraction methods would be 20, 36, 52, and 56, respec-tively. Figure 8 presents the configurations of DWPTdefined by these four techniques. The LSFSOE methodresults in the lowest number of windows, whereas theCWFSHE splits the input pattern into the largestnumber of windows.

The DT based on the CWFSOE feature extractionmethod has the least number of nodes, i.e., 17, comparedto the other tree classifiers. The number of nodes forCWFSHE-, LSFSOE-, and LSFSHE-based tree classi-fiers is 21, 21, and 27, respectively. Figure 9 and Table5 show the configuration of a pruned CWFSOE-basedclassifier.

The DT, shown in Table 5, mostly employs the secondfeature in order to discriminate the classes of events.Figure 10 shows the distribution of the first fourattributes for each class of event. The seven classes offaults are perfectly separated when only distributionsof the first two leading features are considered, whereasthe next two attributes fail to discriminate class bound-aries satisfactorily.

Example 2.3. Table 6 illustrates new test sets whenmultiple faults with different magnitudes occur in theTEP. In this table, the magnitudes and types of con-tributing faults are shown for nine test sets. Each testset contains 15 faulty patterns.

Figure 7. Results of fault diagnosis for compressor work by usingof the four feature extraction methods.

Figure 6. Deviation of compressor work from its steady-state setpoint caused by the seven single faults. Figure 8. Configurations of DWPTs obtained by four feature

extraction methods.


The four feature extractors are used for multiplefaults diagnosis based on the trends produced by thevariations of compressor work installed in the recyclestream. The structures of classification trees are similarto those constructed in example 2.2. Table 7 presentsthe number of correctly classified patterns and the mostprobable classes of events inside the brackets for eachtest set.

If there is more than one class as the cause for faultybehavior, the classes are sorted from left to rightaccording to their importance. For some test sets, themethodology fails to propose a unique diagnosis. Thesecases are shown by “fail” in Table 7. It is observed thatthe four pattern classifiers pinpoint at least one of thecontributing faults in most of the test cases satisfacto-rily. The classification results show that the four clas-sifiers diagnose the system differently for some cases.For example, all of the patterns for the first case areclassified into the first class according to the LSFSOE-and LSFSHE-based classification methods. However,the CWFSOE approach finds the second class as thedominant source of the fault, while the first class has aminor contribution to the behavior of the system. The

CWFSHE, on the other hand, is more in favor of thefirst class than the second class.

5. Conclusions

It is observed that the proposed pattern recognitionmethodology succeeds satisfactorily to classify the noisyinput pattern into the known classes of events. Althoughnoise has a negative impact on the proposed technique,the number of misdiagnosed patterns is still within theacceptable ranges. By using a wavelet-based featureextractor, the PCA technique, and the soft-thresholdingapproach, the proposed technique is able to eliminatenoise effects on the classification results. Comparisonof the classification results based on the four differentfeature extraction algorithms indicates that one could

Figure 9. Pruned DT used with the CWFSOE-based featureextractor.

Table 5. Selected Attributes and Distribution of Classes for Pruned DT Node ID

class

node attribute t- t t+ C1+ C2+ C3+ C1- C2- C3- C4-

1 2 - - - -3.78 -2.49 -2.202 2 - - - -3.22 -1.70 -1.133 2 - 1.88 4.46 -1.84 1.88 4.464 1005 2 0.03 1.60 2.246 1 4.076 4.20 4.647 1 0.327 3.72 5.088 1009 100

10 5 1.84 2.93 4.5511 3 9112 10013 10014 83 115 9 - - -16 1 817 13

Figure 10. Distributions of the four most leading features withinseven classes of events.


not reject a feature extraction method in favor of others.However, the computation time of LS-based featureextraction methods is much less than that of theDaubechies10-based techniques. This will be quiteimportant when the algorithm is applied for real-timefault diagnosis. The proposed method is able to deter-mine the best configuration of pattern windows so thatthe maximum resolution of the system behavior couldbe achieved. When multiple faults occur in a system,the technique is able to pinpoint the sources quitesatisfactorily.

Acknowledgment

The authors acknowledge the financial support pro-vided by the Natural Sciences and Engineering Re-search Council (NSERC) of Canada and the Ministryof Education of Iran.

Literature Cited

(1) King, R.; Gilles, E. D. Multiple filter methods for detectionof hazardous states in an industrial plant. AIChE J. 1990, 36,1697.

(2) Isermann, R. Process fault detection based on modeling andestimation methodssa survey. Automatica 1984, 20, 387.

(3) Frank, P. M. Fault diagnosis in dynamic systems usinganalytical and knowledge-based redundancysa survey and somenew results. Automatica 1990, 26, 459.

(4) Kramer, M. A. Malfunction diagnosis using quantitativemodels with non-Boolean reasoning in expert systems. AIChE J.1987, 33, 130.

(5) Petti, T. F.; Klein, J.; Dhurjati, P. S. Diagnostic modelprocessor: using deep knowledge for process fault diagnosis.AIChE J. 1990, 36, 565.

(6) Kramer, M. A.; Palowitch, B. L. A rule-based approach tofault diagnosis using the signed directed graph. AIChE J. 1987,33, 1067.

(7) Mohindra, S.; Clark, P. A. A distributed fault diagnosismethod based on digraph models: steady-state analysis. Comput.Chem. Eng. 1993, 17, 193.

(8) Oh, S. O.; Mo, K. J.; Yoon, E. S.; Yoon, J. H. Fault diagnosisbased on weighted symptom tree and pattern matching. Ind. Eng.Chem. Res. 1997, 36, 2672.

(9) Rengaswamy, R.; Venkatasubramanian, V. A syntacticpattern-recognition approach for process monitoring and faultdiagnosis. Eng. Appl. Artif. Intell. 1995, 8, 35.

(10) Bakshi, B. R.; Stephanopoulos, G. Representation ofprocess trendssIV. Induction of real-time patterns from operatingdata for diagnosis and supervisory control. Comput. Chem. Eng.1994, 18, 303.

(11) Watanabe, K.; Hirota, S.; Hou, L.; Himmelblau, D. M.Diagnosis of multiple simultaneous fault via hierarchical artificialneural networks. AIChE J. 1994, 40, 839.

(12) Fan, J. Y.; Nikolaou, M.; White, R. E. An approach to faultdiagnosis of chemical processes via neural networks. AIChE J.1993, 39, 82.

(13) Kassidas, A. Fault diagnosis using speech recognitionmethods. Ph.D. Dissertation, McMaster University, Hamilton,Ontario, Canada, 1997.

(14) Akbaryan, F.; Bishnoi, P. R. Smooth representation oftrends by a wavelet-based technique. Comput. Chem. Eng. 2000,24, 1913.

(15) Saito, N. Local feature extraction and its applications usinga library of bases. Ph.D. Dissertation, Yale University, New Haven,CT, 1994.

(16) Englehart, K. Signal representation for classification of thetransient myoelectric signal. Ph.D. Dissertation, The Universityof New Brunswick, Fredericton, New Brunswick, Canada, 1998.

(17) Bakshi, B. R.; Stephanopoulos, G. Compression of chemicalprocess data by functional approximation and feature extraction.AIChE J. 1996, 42, 477.

(18) Saraiva, P. M.; Stephanopolous, G. Continuous processimprovement through inductive and analogical learning. AIChEJ. 1992, 38, 161.

(19) Breiman, L.; Freidman, J. H.; Olshen, R. A.; Stone, C. J.Classification and regression trees; Wadsworth: Belmont, CA,1984.

(20) Weiss, S. M.; Kulikowski, C. A. Computer systems thatlearn; Morgan Kaufmann: San Mateo, CA, 1991.

(21) Utgoff, P. E.; Berkman, N. C.; Clouse, J. A. Decision treeinduction based on efficient tree restructuring. Mach. Learning1997, 29, 5.

(22) Rissanen, J. A universal prior for integers and estimationby minimum desription length. Ann. Statist. 1983, 30, 629.

(23) Akbaryan, F. Fault diagnosis of dynamic multi-variatechemical processes using pattern recognition and smooth repre-sentation of trends by a wavelet-based technique. Ph.D. Disserta-tion, The University of Calgary, Calgary, Alberta, Canada, 2000.

(24) Quinlan, J. R. C4.5: Programming for machine learning;Morgan Kaufmann Publishers: San Antonio, TX, 1993.

(25) Downs, J. J.; Vogel, E. F. A plant-wide industrial processcontrol. Comput. Chem. Eng. 1993, 17, 245.

(26) Ricker, N. L. Optimal steady-state operation of the Ten-nessee Eastman challenge process. Comput. Chem. Eng. 1995, 19,949.

(27) McAvoy, T. J.; Ye, N. Base control for the TennesseeEastman problem. Comput. Chem. Eng. 1994, 18, 383.

(28) Ricker, N. L.; Lee, J. H. Nonlinear model predictive controlof the Tennessee Eastman challenge process. Comput. Chem. Eng.1995, 19, 961.

Received for review August 29, 2000Revised manuscript received March 23, 2001

Accepted May 25, 2001

IE000779L

Table 6. Magnitude and Types of Faults OccurringSimultaneously in Example 2.3

fault fault fault fault fault fault

test ID size ID size test ID size ID size ID size ID size

1 D1 1.0 D2 0.8 6 D1 -0.3 D2 -0.12 D1 0.7 D2 1.2 7 D1 -0.3 D3 -0.13 D1 1.0 D3 0.8 8 D2 -0.3 D3 -0.14 D2 1.0 D3 0.8 9 D1 0.9 D2 1.0 D3 1.0 D4 -0.255 D1 -0.1 D2 -0.3

Table 7. Results of Multiple Fault Diagnosis, Shown bythe Number of Correctly Classified Patterns, Using FourFeature Extraction Methodsa

LSFSOE LSFSHE CWFSOE CWFSHE

1 15 (C1+) 15 (C1+) 15 (C2+, C1+) 14 (C1+)2 15 (C1+) 15 (C1+, C2+) 15 (C2+) 14 (C2+)3 fail 15 (C1+) 15 (C3+) 15 (C1+, C3+)4 15 (C3+) 15 (C3+) 15 (C3+) 15 (C3+)5 11 (C1-, C2-) 15 (C2-) 15 (C2-) 14 (C1-)6 15 (C1-) 13 (C1-) 15 (C1-) 15 (C1-)7 13 (C1-) 15 (C1-) 15 (C1-) 15 (C1-)8 11 (C2-) fail 14 (C1-) 15 (C2-)9 13 (C1+) 15 (C4-) 15 (C4-, C2+) 15 (C4-)

a The most probable classes of events are shown within brackets.


fault diagnosis of single-variate systems using a wavelet-based pattern recognition technique

Documents