[ieee 2011 fourth international workshop on advanced computational intelligence (iwaci) - wuhan,...

Abstract—Nonlinear fault diagnosis methods based on kernel function have great computation complexity for all training samples are introduced in model training. This paper proposes a novel nonlinear fault diagnosis method based on multiple sparse kernel classifiers (MSKC). In the proposed method, fault diagnosis is viewed as a nonlinear classification problem between normal data and fault data. Kernel trick is applied to construct multiple nonlinear classifiers for different fault scenes. In order to reduce the complexity of kernel classifier and improve classifier generalization capability, a forward orthogonal selection procedure is applied to minimize the leave one out classification error. Lastly, multiple sparse kernel classifiers are combined by weight voting technique to build a monitoring statistic. Simulation of a continuous stirred tank reactor system shows that the proposed method performs better compared with kernel principal component analysis method in terms of fault detection performance and computation efficiency.

I. INTRODUCTION HE demands for improving product quality and ensuring process safety have stimulated the recent development of fault diagnosis technique. As large amounts of process

data are available in modern industrial process, data driven methods have emerged in fault diagnosis field during last decades which include principal component analysis (PCA), partial least squares (PLS) and independent component analysis (ICA) [1]-[3]. However, these methods are based on the linear assumption which may cause incorrect results for nonlinear industrial process. To handle nonlinear systems, many improved methods have been studied such as principal curve, neural PCA, kernel PCA (KPCA). In [4] principal curve method was first studied as nonlinear generalization of linear PCA. In [5], an adaptive nonlinear PCA was introduced based on an improved input training neural network. KPCA applies kernel function to compute the principal components in a high-dimensional feature space [6], which is nonlinearly related to the input space. KPCA has been used in fault detection and diagnosis by [7]-[9]. Kernel trick was also applied in other nonlinear methods such as kernel PLS (KPLS) and kernel ICA (KICA) [10][11].

Kernel methods have shown good performance and been proven to be a promising way in fault detection and diagnosis. However, kernel methods of KPCA, KICA and KPLS have

Manuscript received June 30, 2011. This work was supported by the Natural Science Foundation of Shandong Province of China (Y2007G49) and the Fundamental Research Funds for the Central Universities(NO10CX04046A)

Xiaogang Deng is with the College of Information & Control Engineering, China University of Petroleum, Dongying, China (e-mail: [email protected]).

Xuemin Tian is with the College of Information & Control Engineering, China University of Petroleum, Dongying, China (telephone: 086-0546-8391904; e-mail: [email protected]).

some shortcomings. Firstly, kernel matrix is not sparse and has great computation complexity. In the modelling stage of kernel methods, it is necessary to compute and store kernel matrix, the size of which is the square of the sample number. When the sample number becomes large, the eigenvalue decomposition and the matrix inversion calculation will be time-consuming. Secondly, classification information from fault data is not utilized well. Fault diagnosis is in nature a classification problem when fault data are available. Methods of KPCA, KICA and KPLS only analyze nominal data but omit fault data which could also be very useful in detecting fault.

Motivated by above analysis, this paper proposes a multiple sparse kernel classifiers MSKC method for fault diagnosis. In the proposed method, fault diagnosis procedure is considered as a nonlinear classification problem between normal class and fault class and multiple kernel classifiers are constructed to handle nonlinear classification. In order to reduce model complexity, forward orthogonal selection technique is applied to make classifier sparse. The rest of the paper is organized as follows. The concept of sparse kernel classifier is introduced in Section II. Section III gives the formulation of multiple sparse kernel classifiers based fault diagnosis strategy. In Section IV an example of the continuous stirred tank reactor system is used to validate the proposed monitoring scheme. Finally, conclusions are presented in Section V.

II. SPARSE KERNEL CLASSIFIER

A. Problem Formulation The goal of fault diagnosis is to determine whether

abnormal process behavior has occurred on-line, and to provide early warning for process upset so that proper corrective actions can be taken. When normal data and some class of fault data are both available, diagnosis problem can be considered as two-class classification between normal class and fault class.

Given modelling data T

1[ ,..., ]n=X x x and T1[ ,..., ]ny y=y ,

where n m×∈X R stands for normal and fault data set with n samples of process vector ix , ×∈ n 1y R stands for n samples of class variable , 1iy = for normal data and 1iy = − for fault data. The training procedure of a two-class classifier is to find a classification hyperplane ( ) 0f =x based on training data set. For linearly separable data, ( )f x can be formulated as:

T

1( )

n

j jj

f xβ=

= =�x x � (1)

Fault Diagnosis Method based on Multiple Sparse Kernel Classifiers Xiaogang Deng and Xuemin Tian

T

213

Fourth International Workshop on Advanced Computational Intelligence Wuhan, Hubei, China; October 19-21, 2011

978-1-61284-375-9/11/$26.00 @2011 IEEE

where vector � determines the position of the separating hyperplane. Function ( )f x is also called decision function whose value is close to 1 for normal operation and -1 for fault operation.

For nonlinear classification task, a nonlinear function (.) : ( )Φ → Φx x is applied to map original data space onto a

high dimensional feature space where linear classification is executed. So the classification hyper plane is transformed as

T( ) ( )f = Φx x � (2) As decision vector � can be spanned by

1( ) ( )

n

i ii

x θ=

= Φ = Φ�� X � , Equation (2) can be expressed as

T

1( ) ( ) ( )

n

i ii

f x θ=

= Φ Φ�x x (3)

where iθ is projection parameter. According to Mercer’s theorem of kernel function [6], we

can avoid performing the nonlinear mappings and computing the dot products in the feature space by introducing a kernel function. With the application of kernel trick, a dot product of the feature space mappings of original data points is a kernel T( , ) ( ) ( )K = Φ Φx y x y . So the decision function will be

1( ) ( , )

n

i ii

f K θ=

=�x x x (4)

Representative kernel functions include polynomial kernel, sigmoid kernel and Gaussian kernel. In this paper, Gaussian kernel function is applied and its expression is

2

( , ) cK e−

−=

x y

x y (5) where c is kernel parameter specified by the user.

According to [12], the classification problem can be solved in the framework of regression modeling with the known class labels used as the desired output. By defining the modeling error ( )i i ie y f= − x for all modeling data, the classifier model can be expressed as

( )f= + = +y X e K� e (6) where K is the kernel matrix and ( , )ji j iK=K x x .

B. Orthogonal Forward Selection for Model Solving To solve the parameter � in (6) enables the minimization

of the objective function as (7). Tmin J = e e (7)

Equation (7) is the minimization optimization for the mean-squared errors (MSE) of the modelling data, which is one least squares (LS) problem. The result of the optimization is T 1 T( )−=� K K K y . Due to the dimension of K which is the square of the sample number n×n, the complexity of calculation usually leads to dimension disaster as the number of training data dimension n is very large. As variable correlation exists among columns of training data, TK K may be an ill-condition matrix that leads to overmatching and low generalization capability of the final model parameters.

In order to avoid the above problems and obtain a sparse kernel classifier, matrix orthogonal decomposition methods

are used by K = WA . The calculation of matrix W and A could be completed through improved Gram-Schmidt orthogonalization method as follows:

1,2 1,

1,

10 1

0 0 1

N

N N

a a

a −

� �� =� ��

A

��

� � ��

1[ ]

0,N

Ti j i j

=

= ≠

W w w

w w

�(8)

According to (8), Equation (9) is available as: y = Wg + e (9)

where 1 2[ ]Tng g g=A� = g, g �

In (9) W could be viewed as the model base vectors and g is the model parameter. Due to existence of correlation in modelling data, only a part of model basis vectors is needed to build sparse model, which means selecting ( )s sn n n� uncorrelated model subsets Ws from W through optimization method. This procedure is called forward subset selection from [12][13]. The final model is given as follows:

s sy = W g + e (10) where sg stands for sn items of g.

Equation (10) improves the model sparseness and has better generalization capability. In order to solve model parameter in (10), regularization constraint for model parameters are used and the optimization objective function is shown as (11) according to [12].

( )J = +T Tg, � e e g �g (11) where 1 2diag( ... )nλ λ λ� = stands for regularization parameters which represent the prior distribution of g.

For the objective function in (11), g can be obtained through the following optimization procedures:

T ( 1) T/ ( )ii i i i ig λ−= +w y w w (12)

( ) ( 1)i ii ig−= −y y w (13)

where iλ is derived from Bayesian evidence process. The specific updating formulas are given as follows:

2 ,1old T

new ii old

i

i nn g

γλγ

= ≤ ≤−

e e (14)

T

1,

ni i

i i Ti i i i

γ γ γλ=

= =+� w w

w w (15)

If the mean squared error of training data is used as Te e for optimization cost function in (11), overmatching would easily appear which means high precision for training data fitting and low generalized prediction capability. To improve generalization capability of model, error calculation methods with good generalization performance should be used, that is the MSE based on leave-one-out method (LOO MSE). In LOO MSE, for the model including n bases, ( , )n k

ke − represents the test error of ( ,k kyx ) excluded in modelling data.

( , ) ( , )ˆ ( )n k n kk k ke y f− −= − x (16)

Optimization based on leave-one-out principle is a very complex procedure and existing research results show that it is not necessary to execute LOO modelling test for each data.

214

It is more practical to calculate LOO MSE by an iteration way, detail information can be seen in [12] and [13].

According to (11)-(16), sg can be obtained. Then we can get a sparse kernel classifier model as

1( ) ( , )

sn

l ll

f K θ=

=�x x x (17)

where { }{ }, 1,2,...,l l sl nθ =x are model sparse base vector and sparse parameters.

III. FAULT DIAGNOSIS BY MULTIPLE SPARSE KERNEL CLASSIFIERS

A. Multiple Sparse Kernel Classifiers When there are two or more kinds of fault data available,

fault diagnosis problem becomes a multi-class classification problem. Now one sparse kernel classifier is not enough and multiple sparse kernel classifiers (MSKC) are needed. In MSKC, we construct one classifier for normal data and each fault operation data. If there are k classes of fault data, there would be k classifiers built as 1 2{ ( ) ( ) ( )}kf f fx x x� . As orthogonal forward selection technique is applied, every classifier is trained quickly. Once multiple classifiers training have been finished, they can be applied online for fault detection. When a new sample vector is collected, each classifier would give its own conclusion about normal or not. The last classification result can be given by weighted voting way. The scheme of MSKC can be seen in Fig.1.

sparse kernelclassifier (1)

sparse kernelclassifier (2)

sparse kernelclassifier (k)

input data

...

classifiers combination

real t ime data collection

fault diagnosis result

Fig.1 The scheme of MSKC

B. Monitoring Statistic According to Section 2, when input data x is form normal

operation condition, classifier output ( )if x should range around 1, so ( ( ) 1)if −x should be a small number. For each classifier, we can construct monitoring index

2( ) ( ( ) 1)i if f′ = −x x (18)

Monitoring index ( )if ′ x indicates process status by its value change. Under normal operation, monitoring index

( )if ′ x would be a small positive number, otherwise a large positive number. Considering all classifiers, a combining monitoring statistic is built by weighted voting.

2

1 1F ( ) ( ( ) 1)

k k

i i i ii i

f fα α= =

′= = −� �x x (19)

The next key problem is how to design weighted parameter iα . Here we design the weighted coefficients according to the

standard variance of ( )if ′ x under normal operation

condition. If monitoring index ( )if ′ x has a small fluctuation around 1 for normal operation, that is to say, the ith classifier has high credibility and should have a high voting coefficient. So weighted coefficient is designed as

1

( ( ))i

istd f xα =

′ (20)

C. Confidence Limit on Kernel Density Estimation After monitoring statistic F is obtained, we need to

calculate the confidence limit to determine whether process is in control. In PCA and KPCA, the confidence limit of monitoring statistic is determined from the assumption that the probability density functions of latent variables follow a multivariate Gaussian distribution. However, [14] reported that the assumption of multivariate Gaussian distribution is inaccurate. In this section the confidence limit of MSKC F statistic for nominal operating regions is determined by non-parametric empirical density estimation using kernel density estimation [15].

A univariate kernel estimator with kernel K is defined by

1

1ˆ ( ) { }n

i

i

y yp y K

nh h=

−= � (21)

where y is the data point under consideration, iy is an observation value from the data set, h is the smoothing parameter, n is the number of observations, K is the kernel function. In practice, the Gaussian kernel is most commonly used.

The confidence limit of F statistic can be obtained using kernel density estimation as follows. First, the F statistics from normal operating data are computed. Then the univariate kernel density estimator is used to estimate the density function of the normal F values. Lastly, the confidence limit of normal operating data is obtained by calculating the point occupying the 95% area of density function.

D. Fault Diagnosis Procedure based on MSKC The whole MSKC based fault diagnosis procedure includes

two parts: modelling stage and online detection stage. In modeling stage, monitoring model is developed by MSKC and the confidence limit of F charts is determined. During online detection stage, new observed data is collected and online F statistic is calculated to determine whether process is under normal operation condition.

215

Part I Modelling stage (1) Acquire normal operating data and normalize the data

using the mean and variance of each variable. (2) Acquire the kth class of fault operating data and scale

the data using mean and variance in (1). (3) For scaled normal and fault data, we have training

data 1 2{ , ,..., }n=X x x x , 1 2{ , ,..., }ny y y=y and {1, 1}iy ∈ − denotes the normal and fault class. Then sparse kernel classifier model is built for normal operation and fault k.

(4) Repeat step(1)-(3), obtain all classifiers. (5) Build multiple sparse kernel classifiers and calculate

confidence limit for F statistic.

Part II: online detection stage (6) Obtain new data and scale it with mean and variance

from step(1) of modeling stage. (7) Compute the output of multiple sparse kernel

classifiers for new collected data. (8) Monitor whether F statistic exceeds its confidence

limit.

IV. SIMULATION STUDY The proposed fault diagnosis strategy based on MSKC is

tested with a simulated process, a non-isothermal continuous stirred tank reactor (CSTR) system. The practicability of MSKC for process monitoring and fault diagnosis can be demonstrated as follows.

The CSTR with cooling jacket dynamics and variable liquid level is simulated for process monitoring. It is assumed that the classic first order irreversible reaction happens in CSTR. The flow of solvent and reactant A into a reactor produces a single component B as an outlet stream. Heat from the exothermic reaction is removed through cooling flow of jacket. The temperature of reactor is controlled to set-point by manipulating the coolant flow. The level is controlled by manipulating the outlet flow. A schematic diagram of the CSTR with feedback control system is shown in Fig. 2.

The data of normal operating condition and faulty conditions are generated by simulating the CSTR process. Ten process variables are recorded and Gaussian noise is added to all measurements in simulation procedure. The simulation brings normal operating data and five kinds of fault pattern data. The applied fault pattern can be seen in Table 1. These faults contain operation condition change, process parameter change and sensor bias. During process simulation, data are sampled every 10 seconds and 720 samples are stored. For each fault pattern data, fault is introduced after the 240th sample.

In this section, PCA, KPCA and MSKC are applied to detect faults. For convenience of methods comparison, all monitoring statistic need to divide its confidence limit when the monitoring charts is plotted so that alarm limits in all plots are same as 1.

CAF QF TF

QC TCF

T

h

QC TC

CA Q

TTTC

LT LC

FC FT

TTTC

Fig.2 A diagram of the CSTR system

TABLE 1 FAULT PATTERN FOR CSTR SYSTEM

Fault Description F1 The coolant feed temperature ramps down. F2 The feed concentration ramps up F3 The heat transfer coefficient ramps down. F4 Catalyst deactivation. F5 The reactor temperature measurement has a bias.

Firstly we compare the model complexity of KPCA and

MSKC which are both kernel methods. In the modeling stage, 200 normal samples are collected as modelling data. For KPCA, all samples are used to construct kernel matrix whose size is 200 200. When new sample is sampled online, it has to compute kernel function 200 times. For MSKC, although five classifiers are built, every classifier is sparse and number of modelling samples is 21, 14, 9, 6 and 5 respectively. When new sample is available, it computes kernel function only 55 times. So it can be seen that MSKC has better sparse capability than KPCA as kernel methods.

The monitoring results of fault F3 and F5 are illustrated to show the effectiveness of MSKC respectively. In all monitoring charts, we think fault is detected if continuous 8 samples exceed confidence limit which is plotted as dashed line. As shown in Fig.3, the T2 and Q monitoring charts of PCA method are plotted for fault F3, where the heat transfer coefficient ramps down. This fault can not be detected by T2 statistic before sample 486, whereas the Q statistic can detect it at 409th sample. When KPCA is applied, the Q chart and T2 chart in Fig.4 detect fault F3 at sample 393 successfully, much earlier than the PCA method for its nonlinear property. From Fig.5, MSKC method can indicate the presence of abnormalities at the 361th sample which improved the monitoring performance obviously.

Fault F5 is associated with the reactor temperature sensor bias. For this fault, PCA performs poorly and its T2 statistic detects the fault until the sampling instant 338, Q statistic exceeds the alarm limit until about the sampling instant 288 (see Fig. 6). KPCA behaves better than PCA and two statistics show a clear alarm at about the sampling instant 247 (see Fig. 7). MSKC has the same detection time as KPCA (see Fig. 8), but it has higher fault alarm rate and exceeds alarm limit more clearly. After fault F5 happens, fault alarm rate of MSKC is 0.9875, while KPCA fault alarm rates of Q and T2 statistic are 0.9396 and 0.8875 respectively. This example also demonstrates the advantages of MSKC in fault detection.

216

0 200 400 6000

5

10T2 s

tatis

tic

Sample Number (a) PCA T2 chart for fault F3

0 200 400 6000

5

10

Q s

tatis

tic

Sample Number (b) PCA Q chart for fault F3

Fig.3 PCA monitoring results for fault F3

0 200 400 6000

5

10

T2 sta

tistic

Sample Number (a) KPCA T2 chart for fault F3

0 200 400 6000

5

10

Q s

tatis

tic

Sample Number (b) KPCA Q chart for fault F3

Fig.4 KPCA monitoring results for fault F3

0 200 400 600

0

5

10

Sample Number

F s

tatis

tic

Fig.5 MSKC monitoring results for fault F3

0 200 400 6000

5

10

T2 sta

tistic

Sample Number (a) PCA T2 chart for fault F5

0 200 400 6000

5

10

Q s

tatis

tic

Sample Number (b) PCA Q chart for fault F5

Fig.6 PCA monitoring results for fault F5

0 200 400 6000

5

10T2 s

tatis

tic

Sample Number (a) KPCA T2 chart for fault F5

0 200 400 6000

5

10

Q s

tatis

tic

Sample Number (b) KPCA Q chart for fault F5

Fig.7 KPCA monitoring results for fault F5

0 200 400 6000

5

10

Sample Number

F s

tatis

tic

Fig.8 MSKC monitoring results for fault F5

217

Table.2 lists fault detection result of five faults for three methods. The value in Table.2 is alarm sample number and smaller number shows quicker detection. From Table.2, MSKC performs better than PCA and KPCA in terms of fault detection performance.

TABLE 2

COMPARISON OF DETECTION SAMPLE OF THREE METHODS

Fault PCA-Q PCA-T2 KPCA-Q KPCA-T2 MSKC

F1 322 355 316 322 303 F2 313 338 311 311 301 F3 409 486 393 393 361 F4 329 358 313 313 311 F5 288 338 247 247 247

V. CONCLUSIONS In this paper, a new fault diagnosis method based on

multiple sparse kernel classifiers has been formulated for supervising nonlinear process. The proposed method uses multiple kernel classifiers to complete nonlinear classification between normal and abnormal operation. To reduce the model complexity from application of kernel trick and improve generalization of classification model, orthogonal forward selection and LOO MSE are applied to select sparse basis vectors. Lastly F statistic is constructed to monitor process. The application on the CSTR system shows that the proposed strategy can work well.

REFERENCES [1] J. F. MacGregor and T. Kourti, “Statistical process control of

multivariate processes,” Control Engineering Practice, vol. 3, pp. 403–414, March 1995.

[2] V. Venkatasubramanian, R. Rengaswamy, S. N. Kavuri and K. Yin. “A review of process fault detection and diagnosis Part III: process history based methods,” Computers and Chemical Engineering, vol. 27, pp. 327-346, March 2003.

[3] P. P. Odiowei and Y. Cao, “State-space independent component analysis for nonlinear dynamic process monitoring,” Chemometrics and Intelligent Laboratory Systems, vol.103, pp.59-65, August 2010.

[4] D. Dong and T. J. McAvoy. “Nonlinear principal component analysis based on principal curves and neural networks,” Computers and Chemical Engineering, vol. 20, pp.65-78, January 1996.

[5] Z. Q. Geng and Q. X. Zhu, “Multiscale nonlinear principal component analysis and its application for chemical process monitoring,” Industrial and Engineering chemistry Research, vol.44, pp.3585-3593, March 2005.

[6] B. Scholkpof, A. J. Smola and K. MJuller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Computation, vol. 10, pp. 1299-1319, July 1998.

[7] J. M. Lee, C. K. Yoo, S. W. Choi, P. A. Vanrolleghem and I.-B. Lee, “Nonlinear process monitoring using kernel principal component analysis,” Chemical Engineering Science, vol. 59, pp. 223-234, January 2004.

[8] J.-H. Cho, J.-M. Lee and S. W. Choi, “Fault identification for process monitoring using kernel principal component analysis,” Chemical Engineering Science, vol. 60, pp. 279–288, January 2005.

[9] V. H. Nguyen, and J. C. Golinval, “Fault detection based on Kernel principal component analysis,” Engineering Structures, vol. 32, pp.3683-3691, November 2010.

[10] Y. Zhang, and C. Ma, “Fault diagnosis of nonlinear processes using multiscale KPCA and multiscale KPLS,” Chemical Engineering Science, vol. 66, pp.64-72, January 2011.

[11] X. M. Tian, X. L. Zhang and X. G. Deng, “Multiway kernel independent component analysis based on feature samples for batch process monitoring,” Neurocomputing, vol. 72, pp. 1584-1596, March 2009.

[12] S. Chen, X. Hong, B. L. Luk and C. J. Harris, “Orthogonal least squares regression: a unified approach for data modeling,” Neurocomputing, vol. 72, pp. 2670-2681, June 2009.

[13] X. Hong, S. Chen and C. J. Harris, “Fast kernel classifier construction using orthogonal forward selection to minimize leave-one-out misclassification rate,” International Journal of Systems Science, vol. 39, pp. 119-125, February 2008.

[14] E. B. Martin and A. J. Morris, “Non-parametric confidence bounds for process performance monitoring charts,” Journal of Process Control, vol. 6, pp.349-358, December 1996.

[15] J. M. Lee, C. K. Yoo and I. B. Lee, “Statistical process monitoring with independent component analysis,” Journal of Process Control, vol. 14, pp.467-485, August 2004.

218

[ieee 2011 fourth international workshop on advanced computational intelligence (iwaci) - wuhan,...

Documents