[ieee 2010 international conference on biomedical engineering and computer science (icbecs) - wuhan,...

A Wavelet Component Selection Method for Multivariate Calibration of Near-Infrared Spectra

Based on Information Entropy Theory

Dan Peng, Xia Li, Kaina Dong College of Grain Oil and Food Science

Henan University of Technology Zhengzhou, China

[email protected], [email protected], [email protected]

Abstract—A new hybrid algorithm (EWPCS) was proposed for selecting appropriate wavelet packet components containing the variations of analyte as the input data of regression model based on wavelet packet transform (WPT) and information entropy theory. At first, WPT algorithm and its reconstruction algorithm are employed to split the raw spectra into different frequency components with the maximum levels. Then the information entropy of the differences between the raw spectra and each frequency component was calculated, showing the importance of each component. At last, based on an optimized threshold value determined by the performance of regression model, the wavelet packet components representing the features of analyte variation can be obtained according to the difference of information entropy. To validate EWPCS method, it was applied to measure the oil content of corn using near-infrared spectra. The results show that the prediction ability and robustness of models obtained with EWPCS and partial least squares regression can be significantly improved with the prediction errors decreasing by up to 43.2%, indicating that EWPCS algorithm is an effective way for preprocessing modeling of near-infrared spectra.

Keywords-wavelet packet component; near-infrared spectra; oil content; feature extraction; preprocessing method

I. INTRODUCTION Due to the characteristics of its rapidity, simplicity and non-

destructive measurement, near-infrared (NIR) spectroscopy has been taking a very important role in the measurement of the composition of complex samples [1,2]. However, the presence of the relatively weak variation of useful information in the NIR spectra poses a challenge for extracting the sample-specific or property-specific information [3]. Therefore, the efficient use of NIR method is dependent on chemometrical method. Moreover, the quality of a chemometrical method is dependent to a great degree on the quality of the spectra. Typically, NIR spectra are consisted of analyte information and spectral interferences. Spectral interferences, including background and noise, can lead to problems with instrument calibration and quantitation of the spectral information [4]. Owing to its nature, the quality of the NIR spectra is often worsened by the spectral interferences. Thus, much more attention should be paid to feature selection in chemometrical method for NIR. In virtue of feature selection, it is widely accepted that a better NIR calibration model can be obtained by

selecting the portions of spectra including property-specific information instead of full-spectra [5]. For this aim, several methods have been developed, such as the standard normal variate (SNV) [6], multiplicative scatter correction (MSC) [7], orthogonal signal correction (OSC) [8] and so on.

In this work, a novel preprocessing algorithm, which is the combination of wavelet packet transform (WPT) and information entropy computation, is proposed for selecting the characteristic information from NIR spectral data and named as EWPCS (entropy-based wavelet packet component selection). WPT has been found to be a very efficient tool in processing analytical signals. With WPT technique, NIR spectra can be split into several frequency components without overlapping. EWPCS algorithm aims at selecting the appropriate wavelet packet components containing the analyte information for input data of regression model. The algorithm consists of two parts: a frequency decomposition algorithm by WPT and a component selection algorithm based on information entropy comparison. To validate the effectiveness of EWPCS algorithm, a real NIR spectral dataset of corn were analyzed for oil content measurement. The results show that the EWPCS algorithm can significantly improve the performance of calibration model.

II. PRINCIPLE AND METHOD

A. Wavelet Packet Transform WPT decomposes a signal into localized contribution

labeled by a scale and a position parameter, and each of the contributions at different scale represents the information of different frequency contained in the original signal [9]. The decomposition of WPT is executed through a convolution of the signal with various scales and translations of a mother wavelet. As shown in Fig. 1, a signal denoted as S or s0,0 is fully decomposed up to L levels, where sj,i represents the ith wavelet packet coefficients in jth level decomposition. The wavelet packet decomposing arithmetic can be described as

1,2 ,

1,2 1 ,

j i j i

j i j i

s H s

s G s+

+ +

= ⋅

= ⋅

⎧⎨⎩

(1)

where H and G are the low-pass filter and the high-pass filter, j=1, 2…L, i=1, 2 … 2j-1, and S can be split into 2L frequency contributions.

Supported by the National Natural Science Foundation of China (No.30700168), the National Key Technology R&D Program in the 11th Five Years Plan of China (No. 2006BAI03A03) and the Doctoral FoundationProgram of Henan University of Technology (No.2009BS008).

978-1-4244-5316-0/10/$26.00 ©2010 IEEE

In Fig. 1, the outputs of filters, sj,i, denote both detail and approximation coefficients with respect to variable j. Unlike wavelet transform, the high-frequency coefficients and low-frequency coefficients have the same resolution in WPT decomposition. Due to the orthogonality of wavelet packet filters and the linearity of wavelet packet transform, the reconstruction (inverse WPT) is also a linear transformation. By using inverse WPT, the wavelet packet coefficient also can be converted back into the original domain as * *

, 1,2 1 1,2j i j i j is H s G s+ − += + (2) where H* and G* are the pairing operators of H and G. Thus, the pL,i, which represents contribution of the individual frequency band, is calculated by executing (2) L times with only sL,i. Then the signal S can be recomputed as

2 1

,0

L

L ii

S p−

=

=∑ (3)

It should be noted that these frequency contributions do not overlap with each other where the analyte information resides. This is why it is possible to analyze the raw signal by WPT.

Figure 1. Diagram of wavelet packet decomposition

B. Information Entropy Contained in Spectra The information entropy contained in spectra is an

important parameter for evaluating the uncertainty respect to the analyte to be determined. Let S be an m×n spectral matrix with n wavelengths in m samples, and Y be an m×p analyte concentration matrix with p calibration properties in m samples. According to [10], the information entropy contained in S can be computed as following:

Step1. Build a linear regression model between S and Y as S Y g ε= ⋅ + (4) where g is the regression matrix, and ε is the residual error matrix. Using the least square method, the regression matrix g can be computed as 1( )T Tg Y Y Y S−= (5) Then the information matrix can be obtained as 1( )T TIM g g g g−= ⋅ (6)

Step2. Calculate the information entropy corresponding to Y as

, ,1

log( )n

i k i kk

I q q=

= − ⋅∑ (7)

where ,i kq is the kth diagonal element of matrix IM.

C. EWPCS Algorithm In order to select the wavelet packet components containing

the analyte information, EWPCS algorithm is proposed here. It is developed by combining the WPT algorithm and the information entropy theory. Because the analyte information resides in the wavelet packet component without overlapping, EWPCS algorithm is just applied to investigate the corresponding component’s contribution to the entropy of whole spectra. Therefore, EWPCS algorithm focuses on these wavelet packet components, which have significant influence on the entropy of collected spectra. If a wavelet packet component can greatly change the entropy of whole spectra, it must contain much analyte information or much information irrelevant to analyte. On the contrary, if a wavelet packet component has little influence on the entropy of whole spectra, it is probably composed of random noise.

With these criterions, the influences of different frequency components can be easily determined and the problems of wavelet packet component selection become simple and feasible. The EWPCS algorithm can be summarized by the following steps:

Step1. According to (7), calculate the information entropy of entire spectra Xraw and denote as Iraw.

Step2. Perform WPT decomposition on Xraw by the maximum Lmax levels, and get the max2L frequency components { }

max ,L ix according to (3).

Step3. Let k be equal to 0, and construct the new spectral matrix gk as

max ,k raw L kg X x= − (8) Then calculate the information entropy of gk as Ik according to (7), and get the new information entropy (△Ik) as k raw kI I IΔ = − (9)

Step4. Make a comparison between △Ik and a threshold value δ. If △Ik is smaller than δ, the corresponding wavelet packet component

max ,L kx should be eliminated. Otherwise,

max ,L kx should be selected for building the regression model.

Step5. Add variable k by one, and repeat Step2 and Step3 until the variable k is larger than ( max2 1L − ). Then obtain the corrected spectra for inputting as

max , ,sel raw L i iX X x if I δ= − Δ <∑ (10)

To correct the prediction set, Xpre should be decomposed by Lmax levels and results into ( max2 1L − ) frequency components. Then these wavelet packet components should be selected according to (10).

III. EXPERIMENTS NIR spectra data were obtained from Cargill Inc., and also

can be downloaded from http://www.eigenvector.com/ Data/Corn/index.html. At Cargill, 80 corn samples were measured by the m5, mp5 and mp6 instruments. The

wavelength ranges from 1100nm to 2498nm, operating at 2nm resolution (Fig. 2). The objective of this analysis is to predict the oil content of the samples from the set of spectra collected by the mp5 instrument. The oil contents of these samples range from 3.088%~3.832%. The EWPCS algorithm was applied to the collected NIR spectra for oil content analysis. For this study, these corn samples were split into a calibration set and a prediction set, each including 40 samples.

All computation and pretreatment were performed in Matlab v2007a (MathWorks, Natick, MA, USA) using the PLS Toolbox v4.2 (Eigenvector Technology) and Wavelet Toolbox v3.0. A leave-one-out cross-validation procedure was applied to the calibration set. The performances of multivariate models were evaluated by the squared correction coefficient (R2), the root mean square error of calibration (RMSEC) and the root mean error of prediction (RMSEP).

Figure 2. The spectra of samples in calibration set

IV. RESULTS AND DISCUSSION

A. Profiles of Information Entropy of Each Component

Figure 3. Distribution profiles of each frequency component.

Due to the multiscale property of spectra, each wavelet packet component plays different role in the whole spectra, indicating that each component’s contribution to the entropy of spectra should also be different. This phenomenon can be demonstrated by performing WPT decomposition and information entropy computation on each wavelet component. Using discrete meyer (dmey) mother wavelet, the spectra of samples in calibration set were decomposed to 9 levels (the maximum levels), yielding 512 frequency components (x9,0 ~ x9,511). Through subtracting each frequency from the whole spectra and entropy computation, the contribution profiles of each component to the entropy were obtained as a function of index parameter k and shown in Fig. 3. From Fig. 3, it can be seen that the effect of each component on the whole spectra is different, and only those components with lower k have significant influence on the entropy of collected spectra. The

reasons are threefold. Firstly, the background of spectra usually varies slowly and concentrates on lower frequency region. It can be seen that the entropy △I0 is much smaller than zero, indicating that the spectra may have a large background variation which can greatly increase the uncertainty. Secondly, the analyte information is mainly contained in low-frequency and mid-frequency region. Therefore, the entropy △I2 is larger than zero, meaning that x9,2 have some analyte information. Without x9,2, entropy of whole spectra must increase. Thirdly, the spectral noise mainly resides in high frequency region. Thus, the entropy △I with larger k is almost equal to zeros, indicating that the high frequency components may contain small noise and little analyte information.

B. The Determination of Threshold Value δ From Fig. 3, although the changes in spectral entropy of

different components can be found, it is difficult to determine which wavelet packet component should be selected. According to (10), the component x9,k with corresponding △Ik smaller than δ should be selected. Thus, threshold value δ is a key parameter which has greatly important effect on the performance of EWPCS algorithm. Fig. 4 depicts the curve of number of frequency components for elimination as a function of δ. It can be seen that the number of deselected components increase with the increase of δ. When the value of δ is larger than zero, the number of deselected components almost reaches the maximum value. This is because that the components, whose △ I is larger than zero, probably contain analyte information, and the analyte information only resides in several wavelet packet components. In addition, the curve of number of eliminated components changes sharply around zeros. It is also indicated that most of components only contain noise or little useful analyte information, which construct the detail of spectral variation.

Figure 4. The number of eliminated components varies with δ.

Figure 5. The average of RMSECs varies with δ.

In EWPCS algorithm, the determination of optimum δ is accomplished by a performance comparison between

regression models with different threshold values. Here, the partial least square (PLS) model was adopted, and the performance was evaluated by the average of RMSECs with first ten latent variables for calibration set. Fig. 5 depicts the curve of average of RMSECs as a function of δ. It can be seen that the curve of RMSEC slowly decreases at first (δ < 0), and then sharply increases to a stable value with δ over 0.0008. When δ ranges from -0.0000504 to -0.000038, the average of RMSECs reaches the minimum value meaning the minimum uncertainty in the selected components. Therefore, the summation of the wavelet packet components, whose corresponding entropy meets the condition △Ik≥-0.0000504, can be identified as the input spectra (corrected spectra) of regression model. The input spectra and the removed spectra are illustrated in Fig. 6 and Fig. 7, respectively. Apparently from these two figures, the removed spectra mainly concentrate on low frequency region with smooth changes, while the selected spectra concentrate on middle and high frequency region with sharp changes. Through preprocessing, the input spectral data of regression model show a large change compared with the raw spectra in Fig. 2.

Figure 6. The summation of selected wavelet packet components.

Figure 7. The summation of deselected wavelet packet components.

C. Prediction Results After selecting the useful frequency components using

EWPCS algorithm, PLS prediction models with different preprocessing algorithms for oil content measurement were developed. Fig. 8 shows that the RMSEP curves of different models as a function of number of latent variables. It is clear that the RMSEP curve of EWPCS algorithm descends sharply and then tends to be flat with small jitter with the number of latent variables over 9. When the number of latent variables is beyond 10, the RMSEP curve of EWPCS reaches the minimum (0.042%). Compared with the PLS model using raw spectra, the EWPCS-based PLS model can improve the RESEP for validation set by up to 43.2%. In addition, the precision of EWPCS-based model approximately outperforms those of other models over each number of latent variables by a

considerable margin, indicating that the proposed algorithm can effectively selecting the components representing the analyte information.

Figure 8. Prediction results using different preprocessing methods.

V. CONCLUSION In this paper, a new method named as EWPCS is proposed

to select the appropriate wavelet packet components for input data of regression model instead of the raw spectra. With the EWPCE, the raw spectra are decomposed by WPT with the maximum levels. Then a set of new spectra is obtained as the difference between the raw spectra and each of the wavelet packet components. After information entropy calculation, the importance of each component to quantitative model can be visible. Through optimization of the threshold value, the wavelet packet components containing the informative features of spectra can be extracted. The EWPCS algorithm was successfully applied to analyze the spectra of corn samples for oil content measurement. Experimental results show that the EWPCS algorithm can effectively improve the prediction ability of the regression models, showing that it is a promising tool before multivariate calibration.

REFERENCES [1] L.W. Liang, B. Wang, Y. Guo, H. Ni, Y.L. Ren, “A support vector

machine-based analysis method with wavelet denoised near-infrared spectroscopy,” Vib. Spectrosc. vol. 49, pp. 274–277, March 2009.

[2] M. Blanco, M. Alcalá, J.M. González, E. Torras, “Near infrared spectroscopy in the study of polymorphic transformations”, Anal. Chim. Acta. vol. 567, pp. 262–268, May 2006

[3] Y.K. Li, X.G. Shao, W.S. Cai, “A consesus least squares support vector regression for analysis of near-infrared spectra of plant samples,” Talanta. vol. 72, pp. 217-222, April 2007.

[4] D. Chen, X.G. Shao, B. Hu, Q.D. Su, “A background and noise elimination method for quantitative calibration of near infrared spectra,” Anal. Chim. Acta. vol. 511, pp. 37-45, May 2004.

[5] Y. Wang, B.R. Xiang, “Radial basis function network calibration model for near-infrared spectra in wavlet domain using a genetic algorithm,” Anal. Chim. Acta. vol. 602, pp. 55-65, October 2007.

[6] I.S. Helland, T. Naes, T. Isaksson, “Related versions of the multiplicative scatter correction method for preprocessing spectroscopic data”, Chemom. Intell. Lab. Syst. vol. 29, pp. 233-241, October 1995.

[7] P. Geladi, D.M.H. Macdougall, H. Martens, “Linearization and scatter-correction for near-infrared reflectance spectra of meat”, Appl. Spectrosc. vol. 39, pp. 491–500, March 1985.

[8] T. Feam, “On orthogonal signal correction”, Chemometr. Intell. Lab. Syst. vol. 50, pp. 47–52, January 2000.

[9] B. Jawerth, W. Swedens, “An overview of wavelet based multiresolution analyses”, SIAM Review. vol. 39, pp. 377–412, October 1994.

[10] C. Eckschlager, K. Danzer, Information Theory in Analytical Chemistry. Wiley-Interscience: New York, 1994.

[ieee 2010 international conference on biomedical engineering and computer science (icbecs) - wuhan,...

Documents