arttraining extreme learning machine via regularized correntropy criterion

10
ISNN2012 Training extreme learning machine via regularized correntropy criterion Hong-Jie Xing Xin-Mei Wang Received: 28 April 2012 / Accepted: 11 September 2012 / Published online: 26 September 2012 Ó Springer-Verlag London Limited 2012 Abstract In this paper, a regularized correntropy crite- rion (RCC) for extreme learning machine (ELM) is pro- posed to deal with the training set with noises or outliers. In RCC, the Gaussian kernel function is utilized to substitute Euclidean norm of the mean square error (MSE) criterion. Replacing MSE by RCC can enhance the anti-noise ability of ELM. Moreover, the optimal weights connecting the hidden and output layers together with the optimal bias terms can be promptly obtained by the half-quadratic (HQ) optimization technique with an iterative manner. Experi- mental results on the four synthetic data sets and the fourteen benchmark data sets demonstrate that the pro- posed method is superior to the traditional ELM and the regularized ELM both trained by the MSE criterion. Keywords Extreme learning machine Correntropy Regularization term Half-quadratic optimization 1 Introduction Recently, Huang et al. [1] proposed a novel learning method for single-hidden layer feedforward networks (SLFNs), which is named as extreme learning machine (ELM). Interestingly, the weights connecting the input and hidden layers together with the bias terms are all randomly initialized. Then, the weights connecting the hidden and output layers can be directly determined by the least squares method which is based on the Moore–Penrose generalized inverse [2]. Therefore, the training speed of ELM is extremely fast, which is the main advantage of this method. ELM has been widely used in applications of face recognition [3], image processing [4, 5], text categorization [6], time series prediction [7], and nonlinear model iden- tification [8]. In recent years, many variants of ELM have been emerged. Huang et al. [9] provided us with a detailed survey. Besides the methods reviewed in the literature [9], some new improved models of ELM have been proposed. Wang et al. [10] proposed an improved algorithm named EELM. EELM can ensure the full column rank of the hidden layer output matrix while the traditional ELM cannot fulfill sometimes. By incorporating the forgetting mechanism, Zhao et al. [11] proposed a novel online sequential ELM named FOS-ELM, which demonstrates higher accuracy with fewer training time in comparison with the ensemble of online sequential ELMs. Savitha et al. [12] proposed a fast learning fully complex-valued ELM classifier for handling real-valued classification problems. In order to automatically generate networks, Zhang et al. [13] proposed an improved ELM with adap- tive growth of hidden nodes (AG-ELM). Experimental results demonstrate that AG-ELM can get more compact network architecture compared with the incremental ELM. Cao et al. [14] proposed an improved learning algorithm called self-adaptive evolutionary ELM (SaE-ELM), which performs better than the evolutionary ELM and possesses better generalization ability in comparison with its related methods. However, there exists two limitations of ELM, which are listed below. H.-J. Xing (&) Key Laboratory of Machine Learning and Computational Intelligence, College of Mathematics and Computer Science, Hebei University, Baoding 071002, China e-mail: [email protected] X.-M. Wang College of Mathematics and Computer Science, Hebei University, Baoding 071002, China 123 Neural Comput & Applic (2013) 23:1977–1986 DOI 10.1007/s00521-012-1184-y

Upload: sebastianulloaquezada

Post on 18-Aug-2015

219 views

Category:

Documents


3 download

DESCRIPTION

extreme learning machine

TRANSCRIPT

I SNN2012TrainingextremelearningmachineviaregularizedcorrentropycriterionHong-JieXingXin-MeiWangReceived:28April2012 / Accepted:11September2012 / Publishedonline:26September2012Springer-VerlagLondonLimited2012Abstract Inthis paper, aregularizedcorrentropycrite-rion(RCC) for extremelearningmachine(ELM) is pro-posed to deal with the training set with noises or outliers. InRCC,theGaussiankernelfunctionisutilizedtosubstituteEuclideannormofthemeansquareerror(MSE)criterion.Replacing MSEby RCC can enhance the anti-noise abilityof ELM. Moreover, the optimal weights connecting thehiddenandoutput layers together withthe optimal biasterms can be promptly obtained by the half-quadratic (HQ)optimizationtechniquewithaniterativemanner. Experi-mental results on the four synthetic data sets and thefourteen benchmark data sets demonstrate that the pro-posedmethodissuperior tothetraditional ELMandtheregularizedELMbothtrainedbytheMSEcriterion.Keywords Extremelearningmachine Correntropy Regularizationterm Half-quadraticoptimization1 IntroductionRecently, Huang etal. [1] proposed a novel learningmethod for single-hidden layer feedforward networks(SLFNs), which is named as extreme learning machine(ELM). Interestingly,the weights connecting the inputandhidden layers together with the bias terms are all randomlyinitialized. Then, theweights connectingthehiddenandoutput layers can be directly determined by the leastsquares method which is based on the MoorePenrosegeneralizedinverse [2]. Therefore, the trainingspeedofELM is extremely fast, which is the main advantage of thismethod. ELM has been widely used in applications of facerecognition [3], image processing [4, 5], text categorization[6], timeseriesprediction[7], andnonlinear model iden-tication[8].In recent years, many variants of ELMhave beenemerged. Huang etal. [9] provided us with a detailedsurvey.Besidesthemethodsreviewedintheliterature[9],somenewimprovedmodelsofELMhavebeenproposed.Wangetal. [10] proposedanimprovedalgorithmnamedEELM. EELMcan ensure the full column rank of thehidden layer output matrix while the traditional ELMcannot fulll sometimes. Byincorporatingtheforgettingmechanism, Zhao etal. [11] proposed a novel onlinesequential ELMnamed FOS-ELM, which demonstrateshigher accuracywithfewer trainingtime incomparisonwith the ensemble of online sequential ELMs. Savithaetal. [12] proposedafast learningfullycomplex-valuedELM classier for handling real-valued classicationproblems. In order to automatically generate networks,Zhangetal. [13]proposedanimprovedELMwithadap-tive growth of hidden nodes (AG-ELM). Experimentalresults demonstratethat AG-ELMcanget morecompactnetwork architecture comparedwiththe incrementalELM.Caoetal. [14] proposedanimprovedlearningalgorithmcalledself-adaptiveevolutionaryELM(SaE-ELM), whichperformsbetterthantheevolutionaryELMandpossessesbettergeneralizationabilityincomparisonwithitsrelatedmethods.However, thereexists twolimitations of ELM, whicharelistedbelow.H.-J. Xing(&)KeyLaboratoryofMachineLearningandComputationalIntelligence, CollegeofMathematicsandComputerScience,HebeiUniversity, Baoding071002,Chinae-mail:[email protected]. WangCollegeofMathematicsandComputerScience,HebeiUniversity, Baoding071002,China1 3NeuralComput&Applic(2013)23:19771986DOI10.1007/s00521-012-1184-y Whentherearenoises or outliers inthetrainingset,ELMmayyieldpoorgeneralizationperformance[15].The reason lies that the mean square error (MSE)criterionassumestheGaussiandistributederror. How-ever,thisassumptiondoesnotalwaysholdinthereal-worldapplications. The original least square used in ELM is sensitive to thepresenceof noisesor outliers. Theseunusual samplesmayskewtheresultsoftheleastsquareanalysis[16].Recently, the denitionandproperties of correntropywereproposedbySantamariaetal.[17].MSEisregardedas a global similarity measure, whereas correntropy a localone[18]. Duetoitsexibility, correntropyhasbeensuc-cessfullyutilizedtodesigndifferent cost functions. Jeongetal. [19] extended the minimum average correlationenergy(MACE)toitscorrespondingnonlinearversionbyusing correntropy. Moreover, they veried that the cor-rentropy MACE is more robust against distortion andpossesses more generalization and rejection abilities incomparison with the linear MACE. Yuan and Hu [20]proposedarobust featureextractionframeworkbasedoncorrentropyandrstlysolvedtheir optimizationproblemby the half-quadratic optimization technique [21]. He etal.[2224] introduced a sparse correntropy framework forderivingrobust sparserepresentations of faceimages forrecognition. Liu etal. [18] utilized correntropy to constructtheobjectivefunctionfortraininglinearregressionmodel.Theyprimarilydemonstratedthat maximumcorrentropycriterionoutperformsMSEandminimumerrorentropyontheregressionexampleswithnoises. However, thecoef-cientsoftheir linearregressorare updatedby thegradient-basedoptimizationmethodwhichis time-consuming. Tofurther enhancetheperformanceof themethodproposedbyLiuetal. [18], Heetal. [25]addedanL1-normregu-larization terminto the maximumcorrentropy criterion.Moreover, theyutilizedhalf-quadraticoptimizationtech-nique and feature-sign search algorithmto optimize thecoefcientsintheirlinearmodel.Inorder toenhance the anti-noise abilityof ELM, aregularizedcorrentropycriterion(RCC)isproposedinthepaper. The main contributions of the proposed method, thatis, ELM-RCC are summarized in the following threeaspects. InELM-RCC, the correntropy-basedcriterionis uti-lized to replace the MSE criterion in ELM, whichmakes ELM-RCC more robust against noises comparedwithELM. TofurtherimprovethegeneralizationabilityofELM-RCC, aregularizationtermisaddedintoitsobjectivefunction. The relatedpropositions of the proposedmethodaregiven. Moreover, thealgorithmicimplementationandcomputational complexity of ELM-RCC are alsoprovided.Theorganizationofthepaperisasfollows. Thetradi-tional ELM, the regularized ELM, and the basic concept ofcorrentropyarebrieyreviewedinSect. 2. InSect. 3, theproposed method, that is, ELM-RCC, together with itsoptimization method is expatiated. Experiments to validatetheproposedmethodareconductedinSect.4.Finally,theconclusionsaresummarizedinSect. 5.2 Preliminaries2.1 ExtremelearningmachineExtreme learningmachine was rst proposedbyHuangetal. [1]. ForELM, theweightsconnectingtheinput andhiddenlayers together withthebias terms areinitializedrandomly, while the weights connecting the hidden andoutput layers are determinedanalytically. Therefore, thelearningspeedofthismethodisextremelyfasterthanthatofthetraditionalgradientdescent-basedlearningmethod.Given Narbitrary distinct samples {xp,tp}p=1N,wherexp 1dand tp 1m. The output vector of a standardsingle-hiddenlayerfeedforwardnetwork(SLFN)withNHhiddenunitsandtheactivationfunctionf ()ismathemat-icallymodeledasyp =

NHj=1bjf (wjxp bj); p = 1; 2; . . .; N; (1)where wj is the weight vector connecting the jth hidden unitandalltheinputunits,bjdenotesthebiastermforthejthhiddenunit, wjxpdenotes the inner product of wjandxp,bjistheweight vector connectingthejthhiddenunitand all the output units, and yp denotes the output vector oftheSLFNforthepthinputvectorxp.ForELM, theweightsconnectingtheinputandhiddenunits and the bias terms are randomly generated rather thanbytuning. Bydoingso, thenonlinearsystemcanbecon-vertedtoalinearsystem:Y = Hb; (2)whereH =f (w1x1 b1); . . . f (wNHx1 bNH).........f (w1xN b1); . . . f (wNHxN bNH)________;b = (b1; . . .; bNH)T; Y = (y1; . . .; yN)T;and T = (t1; . . .; tN)T. Moreover, H = hpj_ _NNHis theoutput matrixof thehiddenlayer, bisthematrixof theweights connecting the hidden and output layers, Yis1978 NeuralComput&Applic(2013)23:197719861 3the output matrix of the output layer, and T is the matrix oftargets. Thus, the matrix of the weights connecting thehiddenandoutputlayerscanbedeterminedbysolvingminb|Hb T|F; (3)where ||FdenotestheFrobeniusnorm.Thesolutionof(3)canbeobtainedasfollows^b = H,T; (4)where H,is the MoorePenrose generalized inverse.Different methodsareusedtodetermineH,. WhenHTHis nonsingular, the orthogonal project method can beutilizedtocalculateH,[1]:H, = (HTH)1HT: (5)However,therearestillsomeshortagesthatrestrictthefurther development of ELM, such as the noises in trainingsamplesarenot necessarilyGaussiandistributed, andthesolution of the original least square problem (3) is sensitivetothenoises.2.2 RegularizedELMInordertoimprovethestabilityandgeneralizationabilityofthetraditionalELM,Huangetal.proposedtheequalityconstrainedoptimization-basedELM[26].Intheproposedmethod, thesolution(4)isre-expressedasfollows^b = HTH IC_ _1HTT; (6)whereCisaconstantandIistheidentitymatrix.Let k =1C ;(6)canberewrittenas^b = HTH kI_ _1HTT: (7)The solution (7) can be obtained by solving thefollowingoptimizationproblem:minb(|Hb T|2F k|b|2F); (8)where |b|2F =

NHj=1 |bj|22is regarded as the regularizationtermwith ||2denotes the L2-normof the vector bj.Moreover, kin(8) denotes theregularizationparameter,which plays an important role to control the trade-offbetweenthetrainingerrorandthemodelcomplexity.Appendix1providesadetailedmathematicaldeductionofthesolution(7)fromtheoptimizationproblem(8).2.3 CorrentropyFrom the information theoretic learning (ITL) [27] point ofview, correntropy[18]isregardedasageneralizedcorre-lation function [17]. It is a measure of similarity bystudying the interaction of the given feature vectors. GiventwoarbitraryrandomvariablesAandB,theircorrentropycanbedenedasVr(A; B) = E[jr(A B)[; (9)wherejr() is thekernel functionthat satises Mercerstheorem [28] and E[[ denotes the mathematicalexpectation.Inthepaper,weonlyconsidertheGaussiankernelwithnitenumber of datasamples{(ai, bi)}i=1M. Thus, corren-tropycanbeestimatedby^VM;r(A; B) =1M

Mi=1jr(ai bi); (10)where jr() is given by jr(ai bi),G(ai bi) =exp((aibi)22r2).Therefore, (10)canberewrittenas^VM;r(A; B) =1M

Mi=1G(ai bi): (11)Themaximumofcorrentropyfunction(9)iscalledthemaximum correntropy criterion (MCC) [18]. Sincecorrentropy is insensitive to noises or outliers, it issuperior to MSE when there are impulsive noises intrainingsamples[18].3 ELMbasedonregularizedcorrentropycriterionIn this section, the regularized correntropy criterion fortraining ELM, namely ELM-RCC is introduced. Moreover,itsoptimizationmethodisalsopresented.3.1 L2-norm-basedRCCReplacingtheobjectivefunction(3) of ELMwithmaxi-mizing the correntropy between the target and networkoutputvariables, wecangetanewcriterionasfollowsJ(~b) = max~b

Np=1G(tp yp); (12)wheretpdenotesthetargetvectorforthepthinputvectorxpandypdenotestheoutputoftheSLFNforxp,whichisgivenbyyp =

NHj=1bjf (wjxp bj) b0; p = 1; 2; . . .; N; (13)where bjis the weight vector connecting the jth hidden unitandalltheoutputunits, whileb0 = (b01; b02; . . .; b0m)Tisthebiastermvectorfortheoutputunits.NeuralComput&Applic(2013)23:19771986 19791 3Thematrixformof(13)canbeexpressedasfollowsY =~H~b; (14)where~H =1; f (w1x1 b1);; f (wNHx1 bNH)............1; f (w1xN b1);; f (wNHxN bNH)______and~b = (b0; b1; . . .; bNH)T. Moreover, (13) can berewrittenasyp =

NHj=0hpj~bj; (15)wherehpjdenotes the(p,j)thelement of Hwithindicesj = 1; 2; . . .; NHandhp0=1, while~bj = bjwith indicesj = 1; 2; . . .; NHand~b0 = b0.Furthermore, weaugment (12) byaddinganL2regu-larizationterm, whichisdenedasfollowsJL2(~b) = max~b

Np=1G tp

NHj=0hpj~bj_ _ k|~b|2F_ _; (16)wherekis theregularizationparameter. Therearemanyapproaches for solving the optimization problem (16) in theliterature, for example, half-quadratic optimization tech-nique [20, 21], expectation-maximization (EM) method[29], and gradient-based method [18, 19]. In this paper, thehalf-quadraticoptimizationtechniqueisutilized.According tothetheoryofconvexconjugatedfunctions[21,30], thefollowingproposition[20]exists.Proposition 1 For G(z) = exp|z|22r2_ _; there exists aconvexconjugatedfunction u;suchthatG(z) =supa1a|z|22r2 u(a)_ _: (17)Moreover, for a xed z,the supremum is reached ata= -G(z)[20].Hence, introducing(17) intotheobjectivefunctionof(16), thefollowingaugmentedobjectivefunctioncanbeobtained.^JL2(~b; a) = max~b;a_

Np=1ap|tp

NHj=0 hpj~bj|222r2 u(ap)_ _ k|~b|2F_;(18)where a = (a1; a2; . . .; aN) stores the auxiliary variablesappearedinthehalf-quadraticoptimization.Moreover,foraxed~b;thefollowingequationholdsJL2(~b) = ^JL2(~b; a): (19)The local optimal solution of (18) can be calculatediterativelybyas1p= G tp

NHj=0hpj~bsj_ _; (20)and~bs1= arg max~bTr (T ~H~b)TK(T ~H~b) k~bT~b_ _ _ ;(21)where s denotes the sth iteration and K is a diagonal matrixwithitsprimarydiagonalelement Kii = as1i.Through mathematical deduction, we can obtain thesolutionof(21)below~bs1= (~HTK~H kI)1~HTKT: (22)OnecanrefertoAppendix2forthedetaileddeduction.Itshouldbementionedherethat theobjectivefunction^JL2(~b; a)in(18)convergesafteracertainnumberofiter-ations, whichcanbesummarizedasfollowsProposition 2 The sequence^JL2(~bs; as); s = 1; 2; . . ._ _generatedbytheiterations(20)and(22)converges.Proof According to Proposition 1 and (21), we can nd that^JL2(~bs; as) _^JL2(~bs1; as) _^JL2(~bs1; as1). Therefore, thesequence^JL2(~bs; as); s = 1; 2; . . ._ _is non-decreasing.Moreover, it was shown in [18] that correntropy is bounded.Thus, we knowthat JL2(~b) is bounded, and by (19),^JL2(~bs; as)isalsobounded. Consequently, weverifythat^JL2(~bs; as); s = 1; 2; . . ._ _converges. hFurthermore, comparing(22)with(5), wehaveProposition 3 Let the elements of the bias term vector b0be zero. When k=0 and K = I; RCC is equivalent to theMSEcriterion.Proof AssuggestedinProposition1, theoptimalweightmatrix~b+of (18) can be obtained after a certain number ofiterations. Moreover, if k=0 and K = I;~b+ = (~HTK~H kI)1~HTKT = (~HT ~H)1~HTT. Since alltheelements of b0are assumedtobe zero, sowehave~b+ = (HTH)1HTT; which is the same with (5). Therefore,thepropositionholds. hThe whole procedure of trainingELMwiththe RCCcriterion is summarized in Algorithm1. It should bementioned here that Es in Algorithm 1 is the correntropy inthe sthiteration, whichiscalculatedby1980 NeuralComput&Applic(2013)23:197719861 3Es =1N

Np=1G(tp yp): (23)3.2 ComputationalcomplexityAccording to Algorithm 1, Step 2 takes N(NH 1)2(NH 1) m [ [ (NH 1)3operations. For Step3, thecomputational cost of the auxiliary vector a in each iterationis O(N). Moreover, the calculation ofb in each iteration takesN2 (NH 1)m 2(NH 1)2_ _ (NH 1)3operationsaccording to (22). Therefore, the overall computationalcomplexityforStep3isO(IHQ[N(4?4NH?mNH?2N-H2)]?IHQ(NH?1)3), where IHQis the number of iterationsof the half-quadratic optimization. Usually, NNH,sothe total computational complexity of Algorithm1 isO(IHQN[(4?m)NH?2NH2]).Incontrast, thecomputational cost oftrainingboththetraditional ELMand RELMby the MSE criterion isO(N(mNH?2NH2)) [31]. Moreover, if the appropriatevalues of the width parameter rand the regularizationparameter k are chosen, the value of IHQ is usually less than5. Therefore, RCCmaynotgreatlyincreasethecomputa-tional burden in comparison with MSE. In the experiments,wendthat ELMtrainedwithRCCis still suitable fortacklingthereal-timeregressionandclassicationtasks.4 ExperimentalresultsIn this section, the performance of ELM-RCCis comparedwiththoseofELMandregularizedELM(RELM)onfoursyntheticdatasetsandfourteenbenchmarkdatasets. Forthe three approaches, that is, ELM, RELM, and ELM-RCC,the weights connecting the input and hidden layers togetherwith the bias terms are randomly generated fromtheinterval[-1, 1]. Theactivationfunctionusedinthethreemethodsisthesigmoidfunctionf (z) =11ez.Theerrorfunctionsforregressionproblemsareallrootmeansquareerror (RMSE), whiletheerror functionsforclassicationproblemsareallerrorrate.Theinputvectorsofagivendataset arescaledtomeanzeroandunit vari-ancefor boththeregressionandclassicationproblems,while the output vectors are normalizedto[0,1] for theregressionproblems. Inaddition, allthecodesarewritteninMatlab7.1.4.1 SyntheticdatasetsInthissubsection, twosyntheticregressiondatasetsandtwo synthetic classication data sets are utilized to validatethe proposed method. The description of them is asfollows.Sinc: This synthetic data set is generated by the functiony = sinc(x) =sin(x)x q; whereqisaGaussiandistributednoise. For each noise level, we generate a set of data points{(xi,yi)}i=1100withxidrawnuniformlyfrom[-6,6].Func: This articial data set is generated by the functiony(x1; x2) = x1 exp(x21 x22)_ _q; where q is also aGaussiandistributednoise. Foreachnoiselevel, 200datapoints are constructed by randomly chosen from the evenlyspaced30930on[-2,2].Two-Moon: Thereare200two-dimensional datapointsinthis syntheticdataset. Let z*U(0,p). Then, the100positivesamplesintheupper moonaregeneratedbythefunctionx1 = cos zx2 = sin z_;whiletherest100negative samplesAlgorithm1 TrainingELMusingRCCInput:InputmatrixX=(xp,i)N9d,targetmatrixT=(tp,k)N9mOutput:Theoptimalweightvectors~b+j ; j = 1;; NH;theoptimalbiastermvector~b+0Initialization:NumberofhiddenunitsNH,regularizationparameter k,maximumnumberofiterationsIHQ,terminationtoleranceStep1:RandomlyinitializetheNHweightvectorsconnectingtheinputandhiddenlayers wj_ _NHj=1togetherwiththeircorrespondingbiasterms bj_ _NHj=1.Step2:Calculatethehiddenlayeroutputmatrix~H.Step3:Updatetheweightvectors~bj; j = 1;; NHandthebiastermvector~b0.repeatfor s = 1; 2; . . .; IHQdoUpdatetheelementsoftheauxiliaryvectora = (a1; . . .; aN)Tby(20).Updatethebiastermvectorandtheweightvectors~bj; j = 0;; NHby(22).endforuntil [Es Es1[\NeuralComput&Applic(2013)23:19771986 19811 3inthelowermoonaregeneratedbyx1 = 1 cos zx2 =12 sin z_. Toalternate the noise level, different percentages of datapointsineachclassarerandomlyselectedandtheirclasslabelsarereversed, namelyfrom?1to-1, while-1to?1.Ripley: Ripleys synthetic data set [32] is generated frommixtures of two Gaussian distributions. There are 250 two-dimensional samplesinthetrainingset, while1000sam-ples in the test set. The manner of changing the noise levelofthetrainingsamplesisthesameasthatofTwo-Moon.The settings of parameters for the three differentmethods upon the four data sets are summarized in Table1.Figure1demonstratestheresultsofthethreemethodsuponSincwithtwodifferent noiselevels. InFig. 1a, theregression errors of ELM, RELM, and ELM-RCC are0.0781, 0.0623, and 0.0581. Moreover, in Fig. 1b, theerrors for ELM, RELM, and ELM-RCC are 0.1084,0.0883, and0.0791. Therefore, ELM-RCCismorerobustagainstnoisesonSincthanELMandRELM.Figure2demonstrates theresults of thethreeapproa-ches uponFuncwiththeGaussiandistributednoiseq*N(0,0.16). The regression errors of ELM, RELM, andELM-RCCare0.0952, 0.0815, and0.0775. Then, wecanndthat ELM-RCCis more robust against noises uponFuncthanELMandRELM.6 4 2 0 2 4 610.500.511.52Sinc with noise N(0,0.1) DataSincELMRELMELMRCC6 4 2 0 2 4 610.500.511.52Sinc with noise N(0,0.2) DataSincELMRELMELMRCC(a) (b)Fig.1 The regression results ofthethreemethodsonSinc.aSincwithGaussiannoise q*N(0,0.1).bSincwithGaussiannoise q* N(0,0.2)2101220210.500.51Original2101220210.500.51ELM(a) (b)2101220210.500.51RELM2 101220210.500.51ELMRCC(c) (d)Fig.2 The regression results ofthethreemethodsuponFuncwithGaussiannoise q*N(0,0.16).aTheoriginalfunctionandthepollutedtraining samples. b The result ofELM.cTheresultofRELM.dTheresultofELM-RCC1982 NeuralComput&Applic(2013)23:197719861 3For Two-Moon, werandomlychoose20%samples ineach class and reverse their class labels to conduct the anti-noise experiment. The classication results of ELM,RELM, and ELM-RCC are illustrated in Fig. 3a. The errorratesofELM,RELM,andELM-RCCare3,2,and0.5%.Therefore,ELM-RCCoutperformsELMandRELMuponTwo-Moonwithnoises.WecanalsoobservefromFig. 3athat ELM-RCCproducessmoother boundaryincompari-sonwithELMandRELM.TowardRipley,20%trainingsamplesineachclassarechosenandtheir classlabelsarereversed. Theclassica-tionresults of the three approaches are demonstratedinFig. 3b. The classication boundaries of ELM-RCCaresmoother than those of ELM and RELM. The training errorrates of ELM, RELM, andELM-RCCare12, 13.6, and14.8%, whilethetestingerror rates of themare, respec-tively, 9.8, 9.9, and 9.3%. Note that the samples in the testsetarenoise-free.Therefore,ELM-RCCgeneralizesbetterthanELMandRELM.4.2 BenchmarkdatasetsInthefollowingcomparisons, theparametersofthethreemethods, that is, ELM, RELM, and ELM-RCCare allchosen by the vefold cross-validation on each training set.Theoptimal number of hiddenunits for ELMis chosenfrom 5; 10; . . .; 100 and 5; 10; . . .; 50 for the regressionandclassicationcases, respectively.ForRELM, therearetwo parameters, that is, the number of hidden units and thepenaltyparameter k. The number of hidden units is chosenfrom the same set as that for ELM. The penalty parameter kischosenfrom{1,0.5,0.25, 0.1,0.075,0.05, 0.025, 0.01,Table2 BenchmarkdatasetsusedforregressionDatasets #Features Observations#Train #TestServo 5 83 83Breastheart 11 341 341Concrete 9 515 515Winered 12 799 799Housing 14 253 253Sunspots 9 222 50Slump 10 52 51# featuresNumber of features; # trainNumber of training samples; #testNumberoftestingsamplesTable3 BenchmarkdatasetsusedforclassicationDatasets #Features Observations #Classes#Train #TestGlass 11 109 105 7Hayes 5 132 28 3Balancescale 5 313 312 3Dermatology 35 180 178 6Vowel 13 450 450 10Spectheart 45 134 133 2Wdbc 31 285 284 2#classesNumberofclassesTable4 Setting of the parameters of the three methods on theregressiondatasetsDatasets ELM RELM ELM-RCCNHNHk NHr kBreastheart 30 20 0.1 50 0.1 0.05Concrete 35 50 0.1 50 0.25 0.025Housing 40 45 0.075 45 0.3 0.0005Servo 35 50 0.0005 45 0.9 0.001Slump 80 20 0.25 20 0.2 0.075Sunspots 35 40 0.05 50 0.3 0.025Winered 30 40 0.0001 35 0.9 1Table1 SettingsofparametersforthethreedifferentapproachesDatasets ELM RELM ELM-RCCNHNHk NHr kSinc 25 25 10-725 2 10-7Func 35 35 10-635 3 10-6Two-Moon 25 25 10-325 1 10-3Ripley 25 25 10-225 6 10-2NHNumber ofhiddenunits;rWidthparameter; kRegulariza-tionparameter1 0 1 21.510.500.511.52TwoMoonPositive classNegative classNoiseELMRELMELMRCC1.5 1 01 0.20.40.60.811.2RipleyPositive classNegative classNoiseELMRELMELMRCC(a) (b)Fig.3 The classicationresults of the three criterions uponTwo-MoonandRipleywithnoises.aTwo-Moon.bRipleyNeuralComput&Applic(2013)23:19771986 19831 30.005, 0.001, 0.0005,0.0001}. For ELM-RCC, there arethreeparameters, that is, thenumber of hiddenunits, thewidth parameter r,and the penalty parameter k. Thenumber of hidden units is chosenfrom the same set as thatfor ELMandthe penalty parameter is chosenfromthesamesetasthatforRELM.Thewidthparameter rfortheregressiontasks andthe classication cases are, respec-tively, chosen fromthe ranges 0:001; 0:005; 0:01; 0:05;0:25; 0:1; 0:2; . . .; 1 and 15; 14; . . .; 1; 0:1; 0:25; 0:5;0:75. Except for Sunspots, all the data sets are chosenfromthe UCI repository of machine learning databases[33]. Thedescriptionsof theregressiondatasetsandtheclassication data sets are included in Tables2 and 3,respectively. For Sunspots andHayes, their trainingsetsand test sets are xed. For each of the other data sets, 50%samples are randomlychosenfor trainingwhile therest50%samplesareusedfortesting.All the parameters of the three approaches in theexperiments for regression and classication tasks aresummarizedinTables4and 5, respectively.Forthethreemethods, one hundredtrialsareconductedon each data set and their corresponding average results arereported. Theaveragetrainingandtestingerrorstogetherwiththeir correspondingstandarddeviations for the100trailsupontheregressiondatasetsareshowninTable6.Moreover,theaveragetrainingandtestingerrorrateswiththeir correspondingstandarddeviations ontheclassica-tiondatasetsareincludedinTable7.It is shown in Table6 that the proposed ELM-RCCoutperformsELMandRELMonall thesevenregressiondata sets. Moreover, the values of standarddeviationinTable6 show that ELM-RCC is more stable than ELM andRELM, eventhoughthe value of standarddeviationforELM-RCC is higher than that of RELM upon Breast heart,Table7 Theresultsofthethreemethodsontheclassicationdatasets(%)Datasets ELM RELM ELM-RCCErrtrain(MeanSD)Errtest(MeanSD)Errtrain(MeanSD)Errtest(MeanSD)Errtrain(MeanSD)Errtest(MeanSD)Balancescale7.220.78 10.191.07 11.010.89 11.541.18 9.170.62 9.960.82Dermatology 1.360.86 4.801.61 2.040.94 4.861.49 1.640.87 4.481.59Glass 2.671.36 21.334.88 1.881.23 19.304.86 3.951.52 17.774.22Hayes 14.731.49 30.505.89 13.540.50 28.363.89 13.640.91 26.573.83Spectheart 21.021.63 21.261.58 10.302.03 23.083.15 17.952.34 20.922.09Vowel 20.952.00 36.642.88 21.251.78 36.382.66 20.872.07 35.883.03Wdbc 2.830.88 4.401.24 2.440.79 4.001.29 1.820.60 4.161.13ErrtrainTrainingerrorrate;ErrtestTestingerrorrateTable6 TheresultsofthethreemethodsontheregressiondatasetsDatasets ELM RELM ELM-RCCEtrain(MeanSD) Etest(MeanSD) Etrain(MeanSD) Etest(MeanSD) Etrain(MeanSD) Etest(MeanSD)Breastheart 0.29370.0262 0.34870.0320 0.31000.0291 0.34900.0327 0.31850.0387 0.33870.0371Concrete 8.17690.3445 8.93930.4460 7.58390.3171 8.48910.3822 7.45220.3323 8.47950.3839Housing 3.79830.3211 5.11870.4737 3.69080.3141 4.92110.4148 3.66990.3964 4.87790.4335Servo 0.44400.0865 0.87680.1894 0.40280.0861 0.78490.1289 0.44340.0895 0.77570.1119Slump 0.00000.0000 3.93760.8711 2.55920.3338 3.41820.5297 2.30350.3393 3.28730.5018Sunspots 11.34810.3723 22.04171.7435 11.33530.3141 21.24251.2183 10.87930.2328 21.15261.1702Winered 0.62770.0142 0.66270.0161 0.61740.0137 0.66460.0163 0.62340.0132 0.65740.0138EtrainTrainingRMSE;EtestTestingRMSETable5 Setting of the parameters of the three methods on theclassicationdatasetsDatasets ELM RELM ELM-RCCNHNHk NHr kBalancescale 45 15 0.25 45 9 0.1Dermatology 40 30 0.25 35 7 0.5Glass 40 50 0.01 45 8 0.075Hayes 30 50 0.005 35 6 0.0005Spectheart 5 15 0.5 45 12 0.5Vowel 50 50 0.01 50 3 0.0001Wdbc 20 35 1 50 14 0.00051984 NeuralComput&Applic(2013)23:197719861 3Concrete, andHousing. Upontheclassicationdatasets,the results in Table7 demonstrate that ELM-RCChaslower classicationerror rateexcept for Wdbcandlowerstandarddeviationexcept for DermatologyandVowel incomparisonwithELMandRELM.5 ConclusionsInthispaper, anovel criterionfortrainingELM, namelyregularized correntropy criterion is proposed. The proposedtrainingmethodcaninherit theadvantagesof bothELMand the regularized correntropy criterion. Experimentalresultson both synthetic data sets and benchmark datasetsindicatethat theproposedELM-RCChashighergeneral-izationperformanceandlowerstandarddeviationincom-parisonwithELMandtheregularizedELM.Tomake our method more promising, there are fourtasksfor futureinvestigation. First, it isatoughissuetond the appropriate parameters for ELM-RCC. Finding theinuences of parameters on the model remains furtherstudy. Second, the optimization method for ELM-RCCneeds to be improved to avoid the unfavorable inuence ofthe singularity of generalized inverse. Third, the ELM-RCC ensemble based on fuzzy integral [34] will beinvestigated to further improve the generalization ability ofELM-RCC.Fourth,ourfutureworkwillalsofocusonthelocalized generalization error bound [35] of ELM-RCC andthecomparisonofthepredictionaccuracybetweenELM-RCCandrule-basedsystems[36, 37] together withtheirrenements[38,39].Acknowledgments ThisworkispartlysupportedbytheNationalNatural Science Foundation of China (Nos. 60903089; 61075051;61073121; 61170040)andtheFoundationofHebei University(No.3504020).Inaddition,theauthorswouldliketoexpresstheirspecialthanks to the referees for their valuable comments and suggestions forimprovingthispaper.Appendix1Fromtheoptimizationproblem(8), wecanobtainthatminb(|HbT|2Fk|b|2F)=minb[Tr((HbT)T(HbT))kTr(bTb)[=minb[Tr(bTHTHbbTHTTTTHbTTT) kTr(bTb)[;(24)where Tr() denotes the trace operator. Taking thederivative of the objective functionin(24) withrespecttotheweightmatrixbandsettingitequaltozero,wecanget2HTH^b 2HTT 2k^b = 0: (25)Therefore, theoptimalsolution(7)canbeobtained^b = HTH kI_ _1HTT: (26)Appendix2Fromtheoptimizationproblem(18), wecanget^JL2(~b; a) = max~b;a_

Np=1ap|tp

NHj=0 hpj~bj|222r2 u(ap)_ _ k|~b|2F_== max~b;a_

Np=1ap(tp yp)T(tp yp)_ _ kTr(~bT~b) const_:(27)Theoptimalsolutionof(27)canbederivedby(20)and~bs1= arg max~bTr (T ~H~b)TK(T ~H~b) k~bT~b_ _ _ = arg max~b_Tr(TTKT TTK~H~b ~bT~HTKT ~bT~HTK~H~b k~bT~b):(28)Takingthederivationof theobjectivefunctionin(28)withrespect totheweightmatrix~bandlettingitequaltozero, wecanobtain2~HTK~H~b 2k~b 2~HTKT = 0: (29)Therefore,thesolution(22)forthe(s?1)thiterationcanbeobtained~bs1= (~HTK~H kI)1~HTKT: (30)References1. Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine:theoryandapplications. Neurocomputing70:4895012. RaoCR,MitraSK(1971)Generalizeinverseofmatricesanditsapplication. Wiley, NewYork3. Zong WW, Huang GB (2011) Face recognition based on extremelearningmachine. Neurocomputing74:254125514. Wu J, Wang S, Chung FL (2011) Positive and negative fuzzy rulesystem,extremelearningmachineandimageclassication.IntJMachLearnCybern2(4):2612715. ChackoBP, Vimal KrishnanV, RajuG, BabuAntoP(2012)Handwritten character recognition using wavelet energy andNeuralComput&Applic(2013)23:19771986 19851 3extreme learning machine. Int J Mach Learn Cybern 3(2):1491616. Zheng WB, Qian YT, Lu HJ (2012) Text categorization based onregularizationextremelearningmachine. Neural Comput Appl.doi:10.1007/s00521-011-0808-y7. WangH, QianG, FengXQ(2012) Predictingconsumer senti-ments using online sequential extreme learning machine andintuitionistic fuzzy sets. Neural Comput Appl. doi:10.1007/s00521-012-0853-18. DengJ, Li K, IrwinWG(2011) Fast automatictwo-stagenon-linear model identication based on the extreme learningmachine. Neurocomputing74:242224299. Huang GB, Wang DH, Lan Y (2011) Extreme learning machines:asurvey. IntJMachLearnCybern2(2):10712210. WangYG,CaoFL,YuanYB(2011)Astudyof effectivenessofextremelearningmachine. Neurocomputing74:2483249011. ZhaoJW,WangZH,ParkDS(2012)Onlinesequentialextremelearningmachine withforgettingmechanism. Neurocomputing87:798912. Savitha R, Suresh S, Sundararajan N (2012) Fast learning circularcomplex-valuedextremelearningmachine(CC-ELM) for real-valuedclassicationproblems. InformSci187:27729013. ZhangR, LanY, HuangGB, XuZB(2012)Universal approxi-mation of extreme learning machine with adaptive growth ofhiddennodes.IEEETransNeuralNetwLearnSyst23:36537114. CaoJW, LinZP, HuangGB(2012) Self-adaptiveevolutionaryextreme learning machine. Neural Process Lett. doi:10.1007/s11063-012-9236-y15. Miche Y, Bas P, Jutten C, Simula O, Lendasse A(2008) Amethodology for building regression models using extremelearningmachine: OP-ELM. In: Europeansymposiumonarti-cialneuralnetworks.Bruges, Belgium, pp24725216. Golub GH, Van Loan CF (1983) Matrix computation. JohnsHopkinsUniversityPress, Baltimore17. SantamariaI, Pokharel PP, PrincipeJC(2006)Generalizedcor-relationfunction: denition, properties, andapplicationtoblindequalization. IEEETransSignalProcess54(6):2187219718. LiuW,PokharelPP, PrincipeJC(2007)Correntropy:propertiesandapplicationsinnon-Gaussiansignalprocessing. IEEETransSignalProcess55(11):5286529719. Jeong KH, Liu W, Han S, Hasanbelliu E, Principe JC (2009) ThecorrentropyMACElter. PatternRecognit42(5):87188520. Yuan X, Hu BG (2009) Robust feature extraction via informationtheoretic learning. In: Proceedings of the 26th annual internationalconference on machine learning. ACM, New York, pp 1193120021. Rockfellar R (1970) Convex analysis. Princeton University Press,Princeton22. He R, Zheng WS, Hu BG (2011) Maximum correntropy criterionfor robust face recognition.IEEE Trans Pattern Anal Mach Intell33(8):1561157623. HeR, HuBG, ZhengWS, KongXW(2011) Robust principalcomponent analysis based on maximumcorrentropy criterion.IEEETransImageProcess20(6):1485149424. He R, Tan T, Wang L, Zheng WS (2012) L21 Regularized corren-tropyforrobustfeatureselection.In:ProceedingsofIEEEinter-national conference on computer vision and pattern recognition25. HeR, ZhengWS, HuBG, KongXW(2011)Aregularizedcor-rentropy framework for robust pattern recognition. NeuralComput23(8):2074210026. HuangGB, ZhouH, DingX, ZhangR(2012)Extremelearningmachineforregressionandmulticlassclassication.IEEETransSystManCybernPartBCybern42(2):51352927. Principe JC, Xu D, Fisher J (2012) Information theoretic learning.In: HaykinS(eds)Unsupervisedadaptiveltering. Wiley, NewYork28. Vapnik V (1995) The nature of statistical learning theory.Springer, NewYork29. Yang S, Zha H, Zhou S, Hu B (2009) Variational graphembeddingforgloballyandlocallyconsistentfeatureextraction.In: ProceedingsoftheEuropeconferenceonmachinelearning,vol5782, pp53855330. Boyd S, Vandenberghe L(2004) Convex optimization. Cam-bridgeUniversityPress, Cambridge31. ZhaoGP, ShenZQ, ManZH(2011)Robustinputweightselec-tionforwell-conditionedextremelearningmachine.IntJInformTechnol17(1):11332. Ripley BD (1996) Pattern recognition and neural networks.CambridgeUniversityPress, Cambridge33. FrankA, AsuncionA(2010) UCI machinelearningrepository.University of California, Irvine, School of Information andComputerSciences, Irvine, CA. http://archive.ics.uci.edu/ml34. WangX, ChenA, FengH(2011) Upper integral networkwithextremelearningmechanism. Neurocomputing74:2520252535. YeungD, NgW, WangD, TsangE, WangX(2007)Localizedgeneralization error model. IEEE Trans Neural Netw18(5):1294130536. WangX, HongJR(1999)Learningoptimizationinsimplifyingfuzzyrules. FuzzySetSyst106(3):34935637. WangXZ, WangYD, XuXF, LingWD, YeungDS(2001) Anewapproachtofuzzyrulegeneration: fuzzyextensionmatrix.FuzzySetSyst123(3):29130638. Tsang ECC, Wang X, Yeung DS (2000) Improving learningaccuracy of fuzzy decision trees by hybrid neural networks. IEEETransFuzzySyst8(5):60161439. WangX, YeungDS,Tsang ECC(2001)Acomparativestudyonheuristic algorithms for generating fuzzydecisiontrees. IEEETransSystManCybernPartBCybern31(2):2152261986 NeuralComput&Applic(2013)23:197719861 3