sparse kernel density estimations and its application in variable selection based on quadratic renyi...

9
Sparse kernel density estimations and its application in variable selection based on quadratic Renyi entropy Min Han n , Zhiping Liang, Decai Li Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning 116023, China article info Article history: Received 28 August 2010 Received in revised form 18 January 2011 Accepted 28 January 2011 Communicated by C. Fyfe Available online 21 March 2011 Keywords: Sparse kernel density estimation Sparse Bayesian learning Random iterative dictionary learning Quadratic Renyi entropy abstract A novel sparse kernel density estimation method is proposed based on the sparse Bayesian learning with random iterative dictionary preprocessing. Using empirical cumulative distribution function as the response vectors, the sparse weights of density estimation are estimated by sparse Bayesian learning. The proposed iterative dictionary learning algorithm is used to reduce the number of kernel computations, which is an essential step of the sparse Bayesian learning. With the sparse kernel density estimation, the quadratic Renyi entropy based normalized mutual information feature selection method is proposed. The simulation of three examples demonstrates that the proposed method is comparable to the typical Parzen kernel density estimations. And compared with other state-of-art sparse kernel density estimations, our method also has a shown very good performance as to the number of kernels required in density estimation. For the last example, the Friedman data and Housing data are used to show the property of the proposed feature variables selection method. & 2011 Elsevier B.V. All rights reserved. 1. Introduction Probability density estimation based on samples of unknown distributions plays an important part in machine learning. It is known to all that many pattern recognition questions, such as the classification and clustering [1,2], are based on probability den- sities. And also probability density estimations are of great use in practical applications. For example, the probability density esti- mation is an important step of the calculation of mutual informa- tion [35], which can be applied in many areas such as the feature selection. Therefore, it will be of great significance both in theory and in practical application. Generally speaking, both the parametric methods and the nonparametric methods can be used to estimate the densities, but they are quite different. The parametric method assumes that the samples are drawn from one certain distribution with the unknown parameters to be estimated. Therefore the accuracy depends much on the prior knowledge of the distribution. How- ever, the assumption of the distribution will be much difficult in practice. On contrary, nonparametric methods do not introduce the prior assumptions of the underlying density, whose charac- teristics are learned only from the training samples [6]. Thus, nonparametric density estimation technique has attracted much attention as it can be utilized to estimate densities functions with arbitrary shapes. The Kernel Density Estimate (KDE) referred to as the Parzen window (PW) estimator is supposed to be the most common and simple nonparametric method with high accuracy. The classical KDE can be written as a weighted average over the sample distribution function [7] assigning equal weights for all the kernels. It has been widely applied in many fields [8,9]. However, the number of kernels in KDE is equal to the size of the training data, which makes it time-consuming if the data size is extremely large. Much research has been done in order to improve the computational efficiency. Sparse weight coefficients have been observed previously in literature while developing kernel density estimates. Weston et al. [10] extended the support vector technique of solving linear operator equations to the problem of density estimation, which induces sparsity by the nature of SVM. Girolami and He [11] present one reduced set density estimator that provides a kernel based density estimator that employs a small percentage of available data sample by considering minimization of integrated squared error. Similarly, Chen et al. [12,13] proposed an even faster sparse kernel density estimation using an orthogonal forward regression that incre- mentally minimizes the leave-one-out test score. However, almost none of these density estimations methods are good at the modeling complexity. Tipping [14] proposed a promising relevance vector machine, which has shown a comparable generalization performance but rather sparse solution than SVM, and many researchers have paid Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2011.01.022 n Corresponding author. Tel.: þ86 411 84707847; fax: þ86 411 84707417. E-mail address: [email protected] (M. Han). Neurocomputing 74 (2011) 1664–1672

Upload: min-han

Post on 10-Sep-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Neurocomputing 74 (2011) 1664–1672

Contents lists available at ScienceDirect

Neurocomputing

0925-23

doi:10.1

n Corr

E-m

journal homepage: www.elsevier.com/locate/neucom

Sparse kernel density estimations and its application in variable selectionbased on quadratic Renyi entropy

Min Han n, Zhiping Liang, Decai Li

Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning 116023, China

a r t i c l e i n f o

Article history:

Received 28 August 2010

Received in revised form

18 January 2011

Accepted 28 January 2011

Communicated by C. Fyfecomputations, which is an essential step of the sparse Bayesian learning. With the sparse kernel density

Available online 21 March 2011

Keywords:

Sparse kernel density estimation

Sparse Bayesian learning

Random iterative dictionary learning

Quadratic Renyi entropy

12/$ - see front matter & 2011 Elsevier B.V. A

016/j.neucom.2011.01.022

esponding author. Tel.: þ86 411 84707847;

ail address: [email protected] (M. Han).

a b s t r a c t

A novel sparse kernel density estimation method is proposed based on the sparse Bayesian learning

with random iterative dictionary preprocessing. Using empirical cumulative distribution function as

the response vectors, the sparse weights of density estimation are estimated by sparse Bayesian

learning. The proposed iterative dictionary learning algorithm is used to reduce the number of kernel

estimation, the quadratic Renyi entropy based normalized mutual information feature selection

method is proposed. The simulation of three examples demonstrates that the proposed method is

comparable to the typical Parzen kernel density estimations. And compared with other state-of-art

sparse kernel density estimations, our method also has a shown very good performance as to the

number of kernels required in density estimation. For the last example, the Friedman data and Housing

data are used to show the property of the proposed feature variables selection method.

& 2011 Elsevier B.V. All rights reserved.

1. Introduction

Probability density estimation based on samples of unknowndistributions plays an important part in machine learning. It isknown to all that many pattern recognition questions, such as theclassification and clustering [1,2], are based on probability den-sities. And also probability density estimations are of great use inpractical applications. For example, the probability density esti-mation is an important step of the calculation of mutual informa-tion [3–5], which can be applied in many areas such as the featureselection. Therefore, it will be of great significance both in theoryand in practical application.

Generally speaking, both the parametric methods and thenonparametric methods can be used to estimate the densities,but they are quite different. The parametric method assumes thatthe samples are drawn from one certain distribution with theunknown parameters to be estimated. Therefore the accuracydepends much on the prior knowledge of the distribution. How-ever, the assumption of the distribution will be much difficult inpractice. On contrary, nonparametric methods do not introducethe prior assumptions of the underlying density, whose charac-teristics are learned only from the training samples [6]. Thus,nonparametric density estimation technique has attracted much

ll rights reserved.

fax: þ86 411 84707417.

attention as it can be utilized to estimate densities functions witharbitrary shapes.

The Kernel Density Estimate (KDE) referred to as the Parzenwindow (PW) estimator is supposed to be the most common andsimple nonparametric method with high accuracy. The classicalKDE can be written as a weighted average over the sampledistribution function [7] assigning equal weights for all thekernels. It has been widely applied in many fields [8,9].

However, the number of kernels in KDE is equal to the size ofthe training data, which makes it time-consuming if the data sizeis extremely large. Much research has been done in order toimprove the computational efficiency. Sparse weight coefficientshave been observed previously in literature while developingkernel density estimates. Weston et al. [10] extended the supportvector technique of solving linear operator equations to theproblem of density estimation, which induces sparsity by thenature of SVM. Girolami and He [11] present one reduced setdensity estimator that provides a kernel based density estimatorthat employs a small percentage of available data sample byconsidering minimization of integrated squared error. Similarly,Chen et al. [12,13] proposed an even faster sparse kernel densityestimation using an orthogonal forward regression that incre-mentally minimizes the leave-one-out test score. However,almost none of these density estimations methods are good atthe modeling complexity.

Tipping [14] proposed a promising relevance vector machine,which has shown a comparable generalization performance butrather sparse solution than SVM, and many researchers have paid

M. Han et al. / Neurocomputing 74 (2011) 1664–1672 1665

attention to it [15]. Our research is based on the sparse Bayesianlearning by the regression of the empirical cumulative densityfunction. Due to the large computational cost of the learningalgorithm, we introduce the dictionary learning algorithm pro-posed by Engel as a preprocessing step, which is essentially anapproximate form of kernel-PCA [16]. To eliminate the influenceof the bandwidth, random bandwidths are used to get a stabledictionary. Consequently, the computational efficiency of thedensity estimation with the sparse Bayesian learning is improved,and our approach is comparable to the state-of-the-arts densityestimates with the numerical simulations.

As an application of sparse kernel density estimations, thequadratic Renyi entropy based NMIFS is proposed to select thefeature variables. According to the simulation results our pro-posed feature variable selection method has shown a very goodperformance, which also shows the property of our sparse kerneldensity estimation method.

In this contribution, the random iterative dictionary learningalgorithm is proposed based on dictionary learning. Together withthe sparse Bayesian learning, the sparse kernel density is pro-posed. Based on the sparse kernel density, the estimation ofquadratic Renyi entropy is proposed to improve the NMIFS.

This paper is organized as follows. In Section 2 the densityestimation based on sparse learning will be introduced briefly andin Section 3 will be the density estimation with the dictionarypreprocessing. The quadratic Renyi entropy based NMIFS isproposed in Section 4. The numerical examples and the conclu-sions are given in the last two sections.

2. Density estimation based on sparse Bayesian learning

The RVM is a general Bayesian learning framework of kernelmethod for obtaining state-of-the-art sparse solutions to regres-sion and classification tasks [17]. In this section, we will brieflyshow the probability density estimations written as a regressionproblem with sparse Bayesian learning proposed by Yin andHao [18].

Consider a set of data points X¼ fx1,x2,. . .,xNg,xkARm,1rkrN, N is the number of the data points drawn from thedistributions p(x), where xARm. The aim of kernel densityestimations is to estimate the unknown density pðxÞ with theform

pðxÞ ¼XN

k ¼ 1

bkKðx,xkÞ ð1Þ

bk is the kernel weight, with the constrains SNk ¼ 1bk ¼ 1.

KðU,UÞ is the kernel function, and Gaussian kernels or thepolynomial kernels are usually employed. In this paper, wechoose the Gaussian kernels showed as follows:

Kðx,xkÞ ¼1

ð2ph2Þm=2

exp �:x�xk:

2

2h2

!ð2Þ

h is the kernel bandwidth, which shows the width of dataincluded in the kernel smooth. Usually, one can obtain it by crossvalidation or rules of thumb.

According to the definition of density functions, a density p(x)is defined as the solution of

FpðxÞ ¼Z x

�1

pðuÞdu ð3Þ

subject toR1�1

pðuÞdu¼ 1, where Fp(x) is the cumulative distribu-tion function, which can be estimated based on the data set X. Theempirical distribution function FðxÞ is thought to be a goodapproximation of true density which can be estimated by

following equations:

FðxÞ ¼1

N

XN

i ¼ 1

Ymj

yðxj�xi,jÞ, yðxÞ ¼1,x40

0,xr0

(ð4Þ

where the m elements of x are x1,x2,. . .,xj,. . .,xm;xi,jdenotes for thejth element of xi. Based on Eqs. (1) and (3), we can rewrite theequation

FpðxÞ ¼

Z x

�1

pðuÞdu¼

Z x

�1

XN

k ¼ 1

bkKðu,xkÞdu

¼XN

k ¼ 1

bk

Z x

�1

Kðu,xkÞdu� FðxÞ ð5Þ

For all data set X, we can get the matrix form

FN ¼Ub ð6Þ

where FN ¼ ½Fðx1Þ,Fðx2Þ,. . .,FðxNÞ�T , U is the design matrix defined

as the following equation and b is the vector form of the kernelweight

kð1,1Þ kð1,2Þ � � � kð1,NÞ

kð2,1Þ kð2,2Þ � � � kð2,NÞ

^ ^ � � � ^

kðN,1Þ kðN,2Þ � � � kðN,NÞ

266664

377775 ð7Þ

kði,jÞ ¼

Z xi

�1

Kðu,xjÞdu, i,j¼ 1,. . .:,N ð8Þ

It is well-known that the density estimation is an ill-posedproblem. That is, a slight alteration of the distribution functionwill induce tremendous transformation in the shape of theunderlying density function. To circumvent ill-posed problems,many regularization techniques have been proposed. Here, weadopt regularization by jittering to deal with the problem ofdensity estimation proposed by Yin and Hao [6]. The target valueF�NðxkÞ is defined as follows:

F�NðxkÞ ¼ FðxkÞþeð1þsÞfFðxkÞ½1�FðxkÞ�=Ng0:5 ð9Þ

where e denotes a random number in [�1,1] and s denotes asmall positive number.

The weights vectors b can be estimated via the sparse Bayesianlearning algorithms with the target vectors F�N and the inputvectors U. In the learning algorithms most values of b will be setto zero, and the number of relevance vectors will be muchsmaller. The densities fSBL�KDEðxÞ can be estimated as follows:

fSBL�KDEðxÞ ¼XnRV

j ¼ 1

bRVj Kðx,xRV

j Þ ð10Þ

where xRVj is the jth relevance vector estimated by sparse

Bayesian learning; bRVj is the jth elements of the kernel weight,

and nRV is the number of relevance vectors. In the estimation, thevalue of nRV is much smaller than N, therefore the computationalefficiency for density estimations is improved.

3. Random iterative dictionary learning algorithms andsparse kernel density estimation

In Section 2, we briefly introduced the density estimationbased on sparse Bayesian learning. With the approach mentionedabove, very sparse kernel density estimations can be got; how-ever, the computational cost of the training step is very large dueto the N42 times of kernel computations while input vectors Uare calculated. Therefore, we can choose the dictionary learningalgorithm as the preprocessing step of the sparse kernel densityestimation. In this section, we will improve the dictionary

Table 1Dictionary learning algorithms.

Random iterative dictionary learning algorithms

Inputs: a data set X with N samples, sparsely parameter nOutputs: The dictionary D

Initialize: D¼ x1

dictionary computation;

for t¼1,2, y N

Observe the sample xt

Evaluate of ALD conditiondt

if dt 4n Add xt to the dictionary: D¼D [ fxtg

Else the dictionary D unchanged

end

0.5 1 1.5 20

10

20

30

40

50

60

the bandwidth h

the

num

ber

of th

e di

ctio

nary

Fig. 1. Number of the dictionary with different bandwidth h ranging from 0.5 to 2.

M. Han et al. / Neurocomputing 74 (2011) 1664–16721666

learning algorithms. With the new dictionary learning algorithmsas the preprocessing step, the number of kernel computationsduring the sparse Bayesian learning will be reduced to p42(p ismuch smaller than N), and the computational efficiency will beimproved effectively.

3.1. Random iterative dictionary learning algorithms

As discussed above, the densities are linear combinations ofkernels. With the property of kernels in Hilbert space, each kernelis a dot product in a so-called feature space [19]. More precisely,for any kernel K, it exists a mapping: f : X-F such that

8x,yAX,Kðx,yÞ ¼/fðxÞ,fðyÞS ð11Þ

where U,Uh i denotes the inner product. The sparse kernel densityestimation can be written as

fSBL�KDEðxÞ ¼XnRV

j ¼ 1

bRVj Kðx,xRV

j Þ ¼XnRV

j ¼ 1

bRVj fðxÞ,fðxRV

j Þ

D Eð12Þ

If the point xt satisfies fðxtÞ ¼St�1i ¼ 1aifðxiÞ with the weights ai,

then in the training algorithms of the sparse Bayesian learning,this point will be of little use to construct the relevance vectors.However, the computational cost for the training algorithm willbe increased. Therefore, the training data can be divided into twocases. In the first case, the sample is approximately dependent onpast samples. Such a sample will be considered only through itseffect on the existing coefficients in densities [20]. A sample,whose feature vector is not approximately dependent on pastsamples, will be admitted into a dictionary, and a correspondingcoefficient will be added to the densities. In the following part, wewill outline the dictionary learning algorithms proposed by Engel[16].

The purpose of dictionary learning algorithm is to find asetf ~x1, ~x2,. . ., ~xpgwith pðp{NÞ points such that

fðXÞCSpanðfð ~x1Þ,fð ~x2Þ,. . .,fð ~xpÞÞ ð13Þ

Assuming at time step t, having observed t�1 training samplesfxig

t�1i ¼ 1, a dictionary Dt�1 ¼ f ~x ig

mt�1

i ¼ 1(mt�1 is the number of thepoints in Dt�1 at time step t�1), which is a subset of trainingsamples has been collected. With a new sample, xt from X will beadded to the dictionary if fðxtÞ is linearly independent offfð ~x iÞg

t�1i ¼ 1 otherwise, the dictionary will not be changed. To test

this, the weights a¼ ða1,. . .,amt�1ÞT have to be computed so as to

satisfy the approximate linear dependence (ALD) condition

:Xmt�1

j ¼ 1

ajfð ~xjÞ�fðxtÞ:2rn ð14Þ

where n is a positive threshold parameter, which determines thelevel of accuracy of approximation. Based on the kernel proper-ties, the evaluation of ALD condition dt is described as follows:

dt ¼ ktt�~kt�1ðxtÞ

T a ð15Þ

where ktt and the vector ~kt�1 are defined as follows:

ktt ¼ Kðxt ,xtÞ, ð ~kt�1ðxÞÞi ¼ Kðxi,xÞ, i¼ 1,. . .,mt�1 ð16Þ

The optimal vector of approximation coefficient a may besolved analytically

a¼ ~Kt�1�1 ~kt�1ðxtÞ ð17Þ

where the matrix ~Kt�1 is defined as follows:

½ ~Kt�1�i,j ¼ Kðxi,xjÞ i,j¼ 1,. . .,mt�1 ð18Þ

A recursive formula for ~Kt�1�1

can be derived using the well-known partitioned matrix inversion formula. For more details onthe recursive dictionary learning algorithms please refer to the

Ref. [16]. The dictionary learning algorithms is described asfollows (Table 1).

With the recursive dictionary learning algorithm, one can getthe dictionary with the samples satisfying the approximate lineardependence (ALD) condition. Note that, one has to choose thebandwidth h of the kernel. As an essential step of the dictionarylearning, the bandwidth h must be set manually.

In order to analyze the influence of bandwidth, we simulatethe dictionary learning algorithms with different bandwidths. Forthe experiment, we use training samples of 1000 data points drawfrom the two-dimensional Gaussian distribution with zero meansand unit variance. Fig. 1 shows the simulation results with thebandwidth h ranging from 0.5 to 2.

As shown in Fig. 1, we can get that the size of dictionarychanged with the variation of the bandwidth h. Note that the sizeof the dictionary did not decrease strictly with the growth of thebandwidth, and the dictionary might not satisfy the ALD condi-tions while the bandwidth changes.

For dictionary set D with the bandwidth h, the sample point xt

satisfies the ALD condition as mentioned above.

:Xmt�1

j ¼ 1

ajfð ~xjÞ�fðxtÞ:2rn ð19Þ

Therefore, we can write fðxtÞ ¼Pmt�1

j ¼ 1

ajfð ~xjÞþe, where the e is a

small number. Make the inner product with feature vector fðxtÞ

0 5 10 15 20 25 30 3520

25

30

35

Iteration of the dictionary

steps

num

ber

of d

ictio

nary

sm

ples

Fig. 2. Results of the dictionary with random bandwidth.

M. Han et al. / Neurocomputing 74 (2011) 1664–1672 1667

at each side of the equation, obtaining

/fðxtÞ,fðxtÞS¼Xmt�1

j ¼ 1

aj/fð ~xjÞ,fðxtÞSþefðxtÞ ð20Þ

Substituting the kernel function for the inner product with thekernel trick, the equation can be written as follows, where-KhðU,UÞdenotes the kernel KðU,UÞwith bandwidth h

Khðxt ,xtÞ ¼Xmt�1

j ¼ 1

ajKhð ~xj,xtÞþefðxtÞ ð21Þ

According to the equation we can conclude that if the band-width h changes, it will be unequal from the right side ofthe equation to the left side, and therefore the ALD conditionsmay not be satisfied, which makes the dictionary unstable.To get a stable dictionary, which is satisfied for a range of thebandwidth, we propose to learn the dictionary with iterativerandom bandwidth. In our method, we choose a random band-width that contains a constant perm h0 and a random perm asfollows:

h¼ h0þrandðhÞ ð22Þ

The random perm randðhÞ stands for random vectors that canbe drawn from the uniform distributions with interval [0,1]. Withthe variable bandwidth, we can iterate the dictionary learningalgorithm until the maximum iteration or the size of dictionarydoes not change. The random iterative dictionary learning algo-rithm is described as follows:

(1)

Initialize the sparsely parameter n, the constant bandwidth h0,the maximum iteration lmax,

(2)

Compute the dictionary D with bandwidth h0; (3) Update the bandwidth h: h¼ h0þrandðhÞ; (4) Estimate the dictionary Dh with bandwidth h; (5) Update the dictionary D¼Dh [ D; (6) Go to step (3) until the maximum iteration or the dictionary D

is unchanged.

Consequently, a stable dictionary, which will be satisfiedto a range of the bandwidth, can be obtained. Suppose theD with p samples is the final learning dictionary. For the datapoint xt in the training samples, there must be a subset of D

satisfying

:Xq

j ¼ 1

ajfð ~xjÞ�fðxtÞ:2rn ð23Þ

where q is smaller than p. For other samples of the dictionary D,the weights can be set to zero; hence with all the samples of thedictionary D, it still satisfies the ALD condition.

To test the random iterative dictionary learning algorithm, wechoose a two-dimensional data with 1000 samples draw from thenormal distribution. The iterative result is shown in Fig. 2. Asshown in Fig. 2, with about 20 iterations, the size of dictionary hasconverged to 35, which will be stable for a range of thebandwidths.

3.2. Sparse kernel density estimation based on dictionary

preprocessing

With the random iterative dictionary learning algorithm, onecan get a dictionary with p samples, which is much smaller thanthe size of the training data, and the computational cost of U willbe reduced to p42 times kernel computations. To sum up, thedensity estimation based on sparse Bayesian learning with thedictionary preprocessing can be described as follows.

Sparse kernel density estimation based on sparse Bayesianlearning with dictionary preprocessing:

(1)

Learn the dictionary D with the bandwidth h and maximumiteration lmax according to the random iterative dictionarylearning algorithm;

(2)

Compute the input vectors U based on the samples ofdictionary D;

(3)

Calculate the empirical cumulative distribution functions ateach sample of dictionary D with the training samples X;

(4)

With the local regulations compute the target vectors F�N; (5) Learn the relevance vectors and the weights vectors b with

the sparse Bayesian learning algorithm;

(6) Get the density estimation fSBL-KDE(x) with Eq. (10).

Note that the parameters h and n need to be selected before-hand while estimating the densities. The smoothness relies muchon the bandwidth, and an improper bandwidth approach mayresult in undershooting in areas with sparse observations whileovershooting in others. Formally, the cross validation can be usedto select the bandwidth. In this paper, we choose the bandwidthby a lot of experiments for simplicity.

The threshold parameter n determines the numbers of thesamples in dictionary. For the ALD condition during the randomiterative dictionary learning algorithm, the threshold should beset to a small positive number. But in the high dimensional it israther difficult to satisfy the ALD conditions; therefore, in ourapproach, we choose the parameter n, a small number in lowdimensional data sets and a big value in high dimensional sets.

4. Variable selection based on quadratic Renyi entropy

As an application of sparse kernel density estimation, theRenyi entropy based mutual information is used in the featurevariable selection.

As known to all, the feature variables selections play a veryimportant part in the machine learning problems. Usually weneed to select the features that are most relevant to the predictvariables. And also there should be less redundant variables. Therecently proposed normalized mutual information selection

M. Han et al. / Neurocomputing 74 (2011) 1664–16721668

feature selection method (NMIFS) [21] and minimal-redundancy-maximal-relevance (mRMR) [22] method showed a very goodperformance in the feature variable selections. The normalizedmutual information (NI(fi;fs)) between fi and fs is defined asfollows:

NIðfi; fsÞ ¼Iðfi; fsÞ

minfHðfiÞ,HðfsÞgð24Þ

where fi and fs are the feature variables; I denotes the mutualinformation; H(fi) denotes the entropy of fi . The features variablesare selected by maximizing the measure G

G¼ Iðy; fiÞ�1

9S9

Xfs A S

NIðfi; fsÞ ð25Þ

where, S is the set of the selected features and 9S9 is thecardinality of the set S; y is the output variables.

However, the mutual information is estimated by Fraser’salgorithm, which might be of less accuracy. In Kenneth et al.’s[23] research the quadratic Renyi entropy used in the featureextraction has shown a good performance. In our approach wechoose the proposed sparse kernel density estimation method toestimate the mutual information. In order to make full use of theproperty of sparse kernel density estimation, the quadratic Renyientropy rather than the Shannon entropy is used to estimate themutual information.

The Renyi entropy [24] HrðxÞ is defined as

HrðxÞ ¼1

1�rlog

Zf rðxÞdx

� �ð26Þ

where f(x) is the density of x. For 0oro1,ra1,

H1ðxÞ ¼�

Zf ðxÞlog f ðxÞdx ð27Þ

If r¼2, we obtain the quadratic Renyi entropy

H2ðxÞ ¼�log

Zf 2ðxÞdx

� �ð28Þ

With the quadratic Renyi entropy we can compute the mutualinformation between x and y as follows:

I2ðx; yÞ ¼H2ðxÞþH2ðyÞ�H2ðx,yÞ ð29Þ

Based on the sparse kernel density estimation with Gaussiankernels, the quadratic Renyi entropy can be estimated as follows:

H2ðxÞ ¼�log

Z 1�1

XnRV

k ¼ 1

bRVk Khðx�xRV

k Þ

!2

dx

¼�log

Z 1�1

XnRV

i ¼ 1

XnRV

j ¼ 1

bRVi bRV

j Khðx�xRVi ÞKhðx�xRV

j Þ

0@

1Adx

¼�logXnRV

i ¼ 1

XnRV

j ¼ 1

bRVi bRV

j

Z 1�1

ðKhðx�xRVi ÞKhðx�xRV

j ÞÞdx ð30Þ

The computation of the integral in the above equation isdifficult. For simplicity we just derive the one-dimensional case[25]. Definite chðxi,xjÞ ¼

R ba Khðx�xiÞKhðx�xjÞdx. For the Gaussian

kernels chðxi,xjÞ becomes

chðxi,xjÞ ¼1ffiffiffiffipp

4he�ðxi�xjÞ

2=4h2erf

2b�xi�xj

2h

� ��erf

2a�xi�xj

2h

� �� �ð31Þ

where erfðzÞ ¼ 2ffiffiffipp R z

0e�t2 dt

. For a-�1 and b-þ1; using

erfð71Þ¼ 71, we obtain

chðxi,xjÞ �1ffiffiffiffipp

2he�ðxi�xjÞ

2=4h2

¼ K ffiffi2p

hðxi�xjÞ ð32Þ

We extend Eq. (32), and the H2 can be estimated as follows:

H2ðxÞ ¼�log

Z 1�1

XnRV

k ¼ 1

bRVk Khðx�xRV

k Þ

!2

dx

¼�logXnRV

i ¼ 1

XnRV

j ¼ 1

bRVi bRV

j

Z 1�1

ðKhðx�xRVi ÞKhðx�xRV

j ÞÞdx

¼�logXnRV

i ¼ 1

XnRV

j ¼ 1

bRVi bRV

j K ffiffi2p

hðxRVi �xRV

j Þ

0@

1A ð33Þ

According to the above equation, we can see that the quadraticRenyi entropy can be estimated by the sparse kernels and theweights. Compared with the Parzen estimates the computationalefficiency is improved.

Consequently the normalized quadratic Renyi entropy mutualinformation features variables selection (NQRMIFS) method canbe described as follows:

(1)

Initialization, F: all feature variables to be selected, S:empty set;

(2)

Based on the sparse kernel density and the quadratic Renyientropy, compute the mutual information I2ðfi; yÞ;

(3)

Select the first feature:f i ¼maxifI2ðfi; yÞg, set S’ff ig;

(4)

Select other features: compute NI2(fi;fs), for all the pairs(fi,fs),fiAF and fsAS, the feature variables are selected bymaximizing G;

(5)

Output the set S until 9S9¼k, or go to step (4).

Therefore k feature variables can be selected based on theNQRMIFS. Note that the number of feature variables k is aparameter that is set manually. In the following part, someexamples are used to show the property of the algorithm.

5. Numerical simulations

Three examples are used in the simulation to test the proposeddensity estimation algorithm based on the dictionary preproces-sing. For the last examples, the Friedman data and Housing dataare used to test the performance of the NQRMIFS. To compare itsperformance with the Parzen window estimator, comparisonswith other SKD methods are quoted directly from the existedliteratures. The L1 test error is used to quantify the estimatingresults.

L1 ¼1

Ntest

XNtest

k ¼ 1

9pðxkÞ�pðxkÞ9 ð34Þ

where pðxkÞ and pðxkÞ are the true values and the estimatedvalues, respectively, Ntest is the number of test data points.

5.1. One-dimensional density estimation of the Gaussian mixture

This is a one-dimensional example, and the density to beestimated is the mixture of eight Gaussian distributions given by

pðxÞ ¼1

8

X7

i ¼ 0

1ffiffiffiffiffiffi2pp

si

e�ðx�miÞ2=2s2

i ð35Þ

with si ¼

ffiffiffiffiffiffiffiffi23

iq

, mi ¼ 3 23

i�1

� �, 0r ir7.

In this simulation, the number of data points for densityestimation is N¼200. The experiment is repeated 200 times,and the number of test data is Ntest¼10,000. Fig. 3a shows atypical result of the density estimation based on the dictionarypreprocessing. For comparison, the density estimation based onthe Parzen kernel is shown in Fig. 3b.

-4 -3 -2 -1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

x

p (x

)

SBL-dict estimateTrue PDFRelevant VectorsDictionary data

-4 -3 -2 -1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

x

p (x

)

Parzen estimateTrue PDFTraining data

Fig. 3. (a) True density of the one-dimensional 8 Gaussian mixtures and the

estimation based on sparse Bayesian learning with dictionary preprocessing, the

bandwidth h¼0.40 and the threshold parameter n¼0.01. (b) The true density of

the one-dimensional 8 Gaussian mixtures and the estimation based on Parzen

kernel, the bandwidth h¼0.17.

Table 2Simulation results of the mixture of 8 Gaussian distributions compared with the

Parzen estimator and other sparse kernel density estimation methods.

Method L1 test error Kernel number

Parzen KDE (2.229970.4891)�10�2 20070

SVM_KDE[10] (3.277371.4752)�10�2 10.371.4

RSDE[11] (5.816370.8358)�10�2 14.074.3

SKD of [13] (4.188671.3457)�10�2 10.271.7

Proposed method (2.827570.6863)�10�2 6.470.9

-10 -8 -6 -4 -2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

x

p (x

)

SBL-dict estimateTrue PDFRelevant VectorsDictionary data

-10 -5 0 50

0.05

0.1

0.15

0.2

0.25

x

p (x

)

Parzen estimateTrue PDFTraining data

Fig. 4. (a) True density of the one-dimensional Gaussian and Laplacian mixtures

and the estimation based on sparse Bayesian learning with dictionary preproces-

sing, the bandwidth h¼1.17 and the threshold parameter n¼0.01. (b) True density

of the one-dimensional Gaussian and Laplacian mixtures and the estimation based

on Parzen kernel, the bandwidth h¼1.1.

M. Han et al. / Neurocomputing 74 (2011) 1664–1672 1669

According to the simulation, the proposed density methodshows a very good performance, and the result was comparablewith the typical Parzen kernel density estimation. The data withthe dictionary preprocessing has been shown in Fig. 3a; it can beseen that about 10% of the training data was used for the densityestimation. With the dictionary preprocessing, very sparse kerneldensity estimation can be gained. The simulation is comparedwith kernel density estimation using support vector machine[10], the reduced set density estimator (RSDE) of Ref. [11] andsparse kernel density (SKD) of Ref. [13]. Table 2 shows the L1 testerror of each method.

As shown in Table 2, our method is the most accurate of thefour methods with the L1 test error. As to the kernel numbersrequired in the estimation, about 30% of kernel number is reducedcompared with SKD of Ref. [13]. The simulation shows our

method is also the sparsest one. With the dictionary preproces-sing, the most significant training data points are selected, andtogether with the sparse Bayesian learning algorithm, the favor-ably density estimation method is proposed.

M. Han et al. / Neurocomputing 74 (2011) 1664–16721670

5.2. One-dimensional density estimation of the Gaussian and

Laplacian mixture

This one-dimensional density estimation is the mixture ofGaussian and Laplacian distribution. The density to be estimatedfor this example was given by

pðxÞ ¼1

2ffiffiffiffiffiffi2pp e�ðx�2Þ2=2þ

0:7

4e�0:79xþ29 ð36Þ

In this example, the number of data points for the estimationwas N¼100. The experiment was repeated 200 times and thenumber of test data points is Ntest¼10,000 too. Fig. 4a shows theestimating result with the sparse Bayesian learning based ondictionary preprocessing. For comparison the Parzen estimatingresult was shown in Fig. 4b.

Also in this simulation our approach performs very well. Thedensity can be estimated clearly by our method. The estimatingresult is comparable with the typical Parzen kernel method. Tocompare with other existing methods, the L1 test error and thekernel number required are described in Table 3.

With a small number of the training data points, all the sparsekernel density estimation methods get very sparse solutions, andour approach based on the sparse Bayesian learning with dic-tionary preprocessing gets a more accurate result than othersparse kernel density estimations. Therefore, we can draw theconclusion that our method compared favorably with the kerneldensity estimation methods.

5.3. Two-dimensional density estimation

For this two-dimensional example, the true density to beestimated is a mixture of five Gaussian distributions.

pðx,yÞ ¼X5

i ¼ 1

1

10p e�ðx�mi,1Þ2=2e�ðy�mi,2Þ

2=2 ð37Þ

Table 3Simulation result of Gaussian and Laplacian mixture compared with the Parzen

estimator and other sparse kernel density estimation methods.

Method L1 test error Kernel number

Parzen KDE (1.247570.3029)�10�2 10070

SKD of[12] (2.178570.7468)�10�2 4.870.9

SKD of [13] (1.943670.6208)�10�2 5.171.3

Proposed methods (1.449670.4561)�10�2 4.770.9

Fig. 5. True density and contour plot for the two-d

The means of the five distributions are [0.0, �4.0], [0.0, �2.0],[0.0, 0.0], [�2.0, 0.0] and [�4.0 0.0]. Fig. 5 shows the truedistribution and its contour plot. The number of data points forestimation is 500, and the experiment is repeated 100 times. Thenumber of the test data set is also Ntest¼10,000. A typicalestimating result of our method is shown in Fig. 6, while Fig. 7shows the estimating results of the Parzen kernel.

According to the pictures, we can see that our approachapproximates the true density very well. The L1 test error is usedto show the results in Table 5. For comparison the results of thetypical Parzen estimate and the sparse kernel density with thesame experiment conditions are quoted from Ref. [13]. It can beseen from Table 4 that for this example our approach is compar-able to the both density estimates. The simulation shows thatwith the relevance vectors estimated by sparse Bayesian learning,the proposed method shows a better performance with theaccuracy.

5.4. Experiment of the NQRMIFS

In this part, the Friedman data and the Housing data are usedto show the property of the NQRMIFS. In the Friedman data set,there are 10 independent predictor variables X1,X2, y X10, each ofwhich is uniformly distributed over [0,1], the response is given by

Y ¼ 10sinðpX1X2Þþ20ðX3�0:5Þ2þ10X4þ5X5þe ð38Þ

where e is N(0,1). Table 5 shows the simulation results withdifferent feature selection method.

According to the simulation results, with different numbers ofdata points, our proposed NQRMIFS has shown a very goodperformance. It can be seen that the proposed NQRMIFS hasselected 4 of all 5 correct features variables especially when thenumber of data points is 50. Compared with the NMIFS, theNQRMIFS gets the best feature variables selection results.

In the Housing data, there are 13 attributes (defined as X1,X2,y,X13); the goal of the data set is to predict the value of thehouses in suburbs of Boston (defined as Y). Table 6 shows thefeature variable selection results with different selection meth-ods. In order to compare the selection results, the GRNN networkis used to predict the value of Y. For the 506 instances, 338 arelearning examples and 169 are test ones. The ERMSE defined asfollows is used to compare the performance of each method.Table 6 shows the means and variance of the ERMSE with 50experiments.

ERMSE ¼1

Ntest�1

XNtest

k ¼ 1

½yðkÞ�yðkÞ�2 !1=2

ð39Þ

imensional example of five Gaussian mixtures.

Fig. 6. Proposed density estimate and contour plot for the two-dimensional example of five Gaussian mixture, the bandwidth h was 0.95, and the threshold parameter

n¼0.01.

Fig. 7. Parzen density estimation and contour plot for the two-dimensional example of five Gaussian mixtures, the bandwidth h was [0.5, 1.1].

Table 4Simulation results of two-dimensional Gaussian mixture compared with the

Parzen estimator and the sparse kernel density estimation method.

Method L1 test error Kernel number

Parzen KDE (2.338870.2067)�10�3 50070

SKD of [13] (3.610070.5025)�10�3 13.272.9

Proposed method (2.182270.7318)�10�3 13.371.9

Table 5Simulation results of Friedman data with different number of data points.

Number of data points NMIFS NQRMIFS

N¼50 X4,X7,X2,X3,X9 X1,X7,X4,X3,X5

N¼100 X4,X8,X6,X1,X7 X4,X5,X1,X9,X8

N¼200 X4,X2,X10,X9,X1 X2,X1,X4,X3,X6

Table 6Simulation results of Housing data with different methods.

Selected variables ERMSE

All features X1, X2,y, X13, 16.543171.1492

Ref. [26] X6, X13, X1, X4 4.619470.3867

mRMR X7, X6, X13, X1 7.663570.5819

NMIFS X13, X6, X11, X7 6.494270.5349

NQRMIFS X13, X3, X1, X9 5.035670.5529

M. Han et al. / Neurocomputing 74 (2011) 1664–1672 1671

where yðkÞ is the predict values; yðkÞ is the true values; and Ntest isthe number of the test data points.

From Table 6, we can get that with different methods thefeatures selected are different. All the methods have selected thefeature X13, which is very important for the predicted variable.According to the ERMSE of the prediction results, the proposedNQRMIFS and the method in Ref. [24] has shown a very goodperformance. It can be said that our proposed NQRMIFS iscomparable to the mRMR and NMIFS methods.

6. Conclusions

In this research, a very sparse kernel density estimationmethod is proposed. With the computation of empirical cumula-tive distribution functions, kernel density estimation is changedinto a regression problem, and the sparse Bayesian learningalgorithm is adopted to construct the sparse kernel densityestimations. In order to improve the computational efficiency,the random iterative dictionary learning algorithm is introducedas a preprocessing step that reduced the number of kernelcomputations from N42 times to p42(p is much smaller than N)times during the sparse Bayesian regression. The simulationshave demonstrated that the proposed method has a comparableaccuracy to the typical Parzen kernel density estimations whilethe number of kernels required in the estimation is much lessthan the number of the training data points. The results obtainedhave shown that the proposed method offers a viable alternative

M. Han et al. / Neurocomputing 74 (2011) 1664–16721672

for SKD estimation. With the application of sparse kernel densityestimation methods, the proposed NQRMIFS selected very usefulvariables in the Friedman data and Housing data, which demon-strated the property of the sparse kernel density estimationmethod.

Acknowledgments

This research is supported by the project (61074096) of theNational Nature Science Foundation of China, the project(2007AA04Z158) of the National High Technology Research andDevelopment Program of China (863 Program), the project(2006BAB14B05) of the National Key Technology R&D Programof China and the project (2006CB403405) of the National BasicResearch Program of China (973 Program).

References

[1] Y.J. Oyang, S.C. Hwang, Y.Y. Ou, et al., Data classification with radial basisfunction networks based on a novel kernel density estimation algorithm, IEEETransactions on Neural Networks 16 (1) (2005) 225–236.

[2] T.N. Tran, R. Wehrens, L.M.C. Buydens, KNN-kernel density-based clusteringfor high-dimensional multivariate data, Computational Statistics and DataAnalysis 51 (2) (2006) 513–525.

[3] Huawen Liu, Jigui Sun, Lei Liu, Feature selection with dynamic mutualinformation, Pattern Recognition 42 (7) (2009) 1330–1339.

[4] N. Kwak, C.H. Choi, Input feature selection by mutual information based onParzen window, IEEE Transactions on Pattern Analysis and Machine Intelli-gence 24 (12) (2002) 1667–1671.

[5] D. Huang, T.W.S. Chow, Effective feature selection scheme using mutualinformation, Neurocomputing 63 (2005) 325–343.

[6] Xun-fu Yin, Zhi-Feng Hao, Fast kernel distribution function estimation andfast kernel density estimation on sparse Bayesian learning and regularization,in: Proceedings of the Seventh International Conference on Machine Learningand Cybernetics, Kunming, 2008, pp. 1756–1761.

[7] Emanuel Parzen, On estimation of a probability density function and mode,The Annals of Mathematical Statistics 33 (3) (1962) 1065–1076.

[8] R. Boscolo, H. Pan, V.P. Roychowdhury, Independent component analysisbased on nonparametric density estimation, IEEE Transactions on NeuralNetworks 15 (1) (2004) 55–65.

[9] Y.F. Xue, Y.J. Wang, J. Yang., Independent component analysis based ongradient equation and kernel density estimation, Neurocomputing 72 (7-9)(2009) 1597–1604.

[10] J. Weston, A. Gammermon, M. Stitson, V. Vapnik, V. Vovk, C. Watkins.,Density estimation using support vector machines, Technical Report(1998).

[11] Mark Girolami, Chao He, Probability density estimation from optimallycondensed data samples, IEEE Transactions on pattern analysis and machineintelligence 25 (10) (2003) 1253–1264.

[12] S. Chen, X. Hong, C.J. Harris, Sparse kernel density construction usingorthogonal forward regression with leave-one-out test score and localregularization, IEEE Transactions on Systems Man and Cybernetics PartB-Cybernetics 34 (4) (2004) 1708–1717.

[13] S. Chen, X. Hong, C.J. Harris., An orthogonal forward regression techniquefor sparse kernel density estimation, Neurocomputing 71 (4–6) (2008)931–943.

[14] M.E. Tipping, Sparse Bayesian learning and the relevance vector machine,Journal of Machine Learning Research 1 (3) (2001) 211–244.

[15] Dimitris G. Tzikas, Aristidis C. Likas, Nikolaos P. Galatsanos, Sparse Bayesianmodeling with adaptive kernel learning, IEEE Transactions on Neural Net-works 20 (6) (2009) 926–937.

[16] Y. Engel, Algorithms and representations for reinforcement learning. Ph.D.thesis, Hebrew University, 2005.

[17] Jin Yuan, Liefeng Bo, Kesheng Wang, Tao Yu, Adaptive spherical Gaussiankernel in sparse Bayesian learning framework for nonlinear regression,Expert Systems with Applications 36 (2) (2009) 3982–3989.

[18] Xunfu Yin, Zhi feng Hao, Regularized kernel density estimation algorithmbased on sparse Bayesian regression, Journal of South China University ofTechnology (Natural Science Edition) 37 (5) (2009) 123–129.

[19] Matthieu Geist, Olivier Pietquin Gabriel Fricout, A sparse nonlinear Bayesianonline kernel regression, in: Proceedings of the Second IEEE InternationalConference on Advanced Engineering Computing and Applications inSciences, vol. I, Valencia (Spain), 2008, pp. 199–204.

[20] Yaakov Engel, Shie Mannor, Ron Meir, The kernel recursive least-squaresalgorithm, IEEE Transactions on Signal Processing 52 (8) (2004) 2275–2285.

[21] Pablo A. Estevez, Michel Tesmer, Claudio A. Perez, Jacek M. Zurada, Normal-ized mutual information feature selection, IEEE Transaction on Neural Net-works 20 (2) (2009) 189–201.

[22] Hanchuan Peng, Fuhui Long, Chris Ding, Feature selection based on mutualinformation: criteria of max-dependency, max-relevance, and min-redun-dancy, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (8)(2005) 1226–1238.

[23] Kenneth E. Hild, Deniz Erdogmus, Kari Torkkola, et al., Feature extractionusing information-theoretic learning, IEEE Transactions on pattern analysisand machine intelligence 28 (9) (2006) 1385–1392.

[24] T. Cover, J. Thomas, Elements of Information Theory, Wiley, New York, 2006.[25] H. Shimazaki, S. Shinomoto, Kernel bandwidth optimization in spike rate

estimation, Journal of Computational Neuroscience 29 (1–2) (2010) 171–182.[26] D. Francoisa, F. Rossib, V. Wertza, M. Verleysen., Resampling methods for

parameter-free and robust feature selection with mutual information,Neurocomputing 70 (7–9) (2007) 1276–1288.

Min Han received the B.S. and M.S. degrees from theDepartment of Electrical Engineering, Dalian Univer-sity of Technology, LiaoNing, China, in 1982 and 1993,respectively, and the M.S. and Ph.D. degrees fromKyushu University, Fukuoka, Japan, in 1996 and 1999,respectively. She is a Professor at School of Electronicand Information Engineering, Dalian University ofTechnology. Her current research interests are neuralnetwork and chaos and their applications to controland identification.

Zhiping Liang received the B.S. degree from the Schoolof Electrical and Electronic Engineering ShandongUniversity of Technology, Shandong, China, in 2008.He is currently working towards the M.S. degree at theSchool of Control Science and Engineering, DalianUniversity of Technology, LiaoNing, China. His currentresearch interests include the kernel density estima-tions and multivariable correlation analysis.

Decai Li received the B.S. degree from School ofElectronic and Information Engineering, Dalian Uni-versity of Technology, LiaoNing, China, in 2006. He iscurrently working towards the Ph.D. degree at Facultyof Electronic Information and Electrical Engineering,Dalian University of Technology, LiaoNing, China. Hiscurrent research interests include neural networkmodel and machine learning method.