voting based extreme learning machine

12
Voting based extreme learning machine Jiuwen Cao a,, Zhiping Lin a , Guang-Bin Huang a , Nan Liu b a School of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798, Singapore b Department of Emergency Medicine, Singapore General Hospital, Outram Road, Singapore 169608, Singapore article info Article history: Received 3 November 2010 Received in revised form 21 May 2011 Accepted 15 September 2011 Available online 22 September 2011 Keywords: Classification Extreme learning machine Majority voting Single hidden layer feedforward networks Ensemble methods abstract This paper proposes an improved learning algorithm for classification which is referred to as voting based extreme learning machine. The proposed method incorporates the voting method into the popular extreme learning machine (ELM) in classification applications. Simulations on many real world classification datasets have demonstrated that this algo- rithm generally outperforms the original ELM algorithm as well as several recent classifi- cation algorithms. Ó 2011 Elsevier Inc. All rights reserved. 1. Introduction During past several decades, a lot of methods have been developed for classifications. The most representative approaches are the traditional Bayesian decision theory [8,21], support vector machine (SVM) and its variants [12,25], artificial neural network (ANN) and its variants [9–11,24,27,33], fuzzy method and its variants [5,13,20,23,36], etc. Among these methods, ANN provides several particular characteristics. First, by incorporating certain learning algorithms to change the network structure and parameters based on external or internal information that flows through the network, a neural network can adaptively fit to the data without any explicit specifications of the underlying model. Second, neural networks have the uni- versal approximation characteristic [10,11]. It can be used to approximate any function to arbitrary accuracy. As in classifi- cation applications, the general procedure is to construct a functional relationship between several given attributes of an object and its class label. Therefore, the universal approximation feature can make the neural network to be an efficient classification tool. Finally, the changeable network structure and nonlinear basic computing neurons make neural networks flexible in modeling the complex functional relationship of real world applications. Recently, a least square based learning algorithm named extreme learning machine (ELM) [14] was developed for single hidden layer feedforward networks (SLFNs). Using random computational nodes which are independent of training samples, ELM has several promising features. It is a tuning free algorithm and learns much faster than traditional gradient-based ap- proaches, such as Back-Propagation [9] algorithm (BP) and Levenberg–Marquardt [24,27] algorithm. Moreover, ELM tends to reach the small norm of network output weights as Bartlett’s theory [1] states that for feedforward neural networks reaching small training error, the smaller the norm of weights is, the better generalization performance the network tends to have. It has been further shown [15–18] that many types of computational hidden nodes which may not be neuron alike nodes can be used in ELM as long as they are piecewise nonlinear, such as radial basis function (RBF) hidden nodes [16], fully complex nodes [18], wavelets [3,4], etc. A number of real world applications based on ELM have been done in recent years [19,34,35]. 0020-0255/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2011.09.015 Corresponding author. E-mail addresses: [email protected], [email protected] (J. Cao). Information Sciences 185 (2012) 66–77 Contents lists available at SciVerse ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins

Upload: jiuwen-cao

Post on 04-Sep-2016

221 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Voting based extreme learning machine

Information Sciences 185 (2012) 66–77

Contents lists available at SciVerse ScienceDirect

Information Sciences

journal homepage: www.elsevier .com/locate / ins

Voting based extreme learning machine

Jiuwen Cao a,⇑, Zhiping Lin a, Guang-Bin Huang a, Nan Liu b

a School of Electrical and Electronic Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798, Singaporeb Department of Emergency Medicine, Singapore General Hospital, Outram Road, Singapore 169608, Singapore

a r t i c l e i n f o a b s t r a c t

Article history:Received 3 November 2010Received in revised form 21 May 2011Accepted 15 September 2011Available online 22 September 2011

Keywords:ClassificationExtreme learning machineMajority votingSingle hidden layer feedforward networksEnsemble methods

0020-0255/$ - see front matter � 2011 Elsevier Incdoi:10.1016/j.ins.2011.09.015

⇑ Corresponding author.E-mail addresses: [email protected], jiuwca

This paper proposes an improved learning algorithm for classification which is referred toas voting based extreme learning machine. The proposed method incorporates the votingmethod into the popular extreme learning machine (ELM) in classification applications.Simulations on many real world classification datasets have demonstrated that this algo-rithm generally outperforms the original ELM algorithm as well as several recent classifi-cation algorithms.

� 2011 Elsevier Inc. All rights reserved.

1. Introduction

During past several decades, a lot of methods have been developed for classifications. The most representative approachesare the traditional Bayesian decision theory [8,21], support vector machine (SVM) and its variants [12,25], artificial neuralnetwork (ANN) and its variants [9–11,24,27,33], fuzzy method and its variants [5,13,20,23,36], etc. Among these methods,ANN provides several particular characteristics. First, by incorporating certain learning algorithms to change the networkstructure and parameters based on external or internal information that flows through the network, a neural network canadaptively fit to the data without any explicit specifications of the underlying model. Second, neural networks have the uni-versal approximation characteristic [10,11]. It can be used to approximate any function to arbitrary accuracy. As in classifi-cation applications, the general procedure is to construct a functional relationship between several given attributes of anobject and its class label. Therefore, the universal approximation feature can make the neural network to be an efficientclassification tool. Finally, the changeable network structure and nonlinear basic computing neurons make neural networksflexible in modeling the complex functional relationship of real world applications.

Recently, a least square based learning algorithm named extreme learning machine (ELM) [14] was developed for singlehidden layer feedforward networks (SLFNs). Using random computational nodes which are independent of training samples,ELM has several promising features. It is a tuning free algorithm and learns much faster than traditional gradient-based ap-proaches, such as Back-Propagation [9] algorithm (BP) and Levenberg–Marquardt [24,27] algorithm. Moreover, ELM tends toreach the small norm of network output weights as Bartlett’s theory [1] states that for feedforward neural networks reachingsmall training error, the smaller the norm of weights is, the better generalization performance the network tends to have. Ithas been further shown [15–18] that many types of computational hidden nodes which may not be neuron alike nodes canbe used in ELM as long as they are piecewise nonlinear, such as radial basis function (RBF) hidden nodes [16], fully complexnodes [18], wavelets [3,4], etc. A number of real world applications based on ELM have been done in recent years [19,34,35].

. All rights reserved.

[email protected] (J. Cao).

Page 2: Voting based extreme learning machine

J. Cao et al. / Information Sciences 185 (2012) 66–77 67

ELM performs classification by mapping the signal label to a high dimensional vector and transforming the classificationtask to a multi-output function regression problem. An issue with ELM is that as the hidden node learning parameters in ELMare randomly assigned and they remain unchanged during the training procedure, the classification boundary may not be anoptimal one. Some samples may be misclassified by ELM, especially for those which are near the classification boundary.

To reduce the number of such misclassified samples, we propose in this paper an improved algorithm called voting basedextreme learning machine (in short, as V-ELM). The main idea in V-ELM is to perform multiple independent ELM traininginstead of a single ELM training, and then make the final decision based on the majority voting method [29]. Compared withthe original ELM algorithm, the proposed V-ELM is able not only to enhance the classification performance and reduce thenumber of misclassified samples, but also to lower the variance among different realizations. Simulations on many realworld classification datasets demonstrate that V-ELM outperforms several recent methods in general, including the originalELM [14], support vector machine (SVM) [12], optimally pruned extreme learning machine (OP-ELM) [28], Back-Propagationalgorithm (BP) [9,24,27], K nearest neighbors algorithm (KNN) [2,7], robust fuzzy relational classifier (RFRC) [5], radial basisfunction neural network (RBFNN) [33] and multiobjective simultaneous learning framework (MSCC) [6].

The organization of this paper is as follows. In Section 2, we first briefly review the basic concept of ELM. Then, we analyzean issue with ELM in classification applications, and present the V-ELM. Simulation results and comparisons are provided inSection 3. In Section 4, discussions on the performance of V-ELM with respect to different independent training numbers andon three recent methods [22,26,32] which also exploit multiple classifiers in ELM are given. Conclusions are drawn in Section5. An appendix is given at the end to illustrate the three propositions presented in Section 2.

2. Voting based extreme learning machine

In this section, we first review the basic concept of the ELM algorithm for SLFNs in Section 2.1. Then, we analyze an issuethat may exist in ELM when performing classification applications in Section 2.2. Finally, the new proposed V-ELM algorithmwill be presented in Section 2.3 to tackle the issue and enhance the classification performance of ELM.

2.1. Review of extreme learning machine

Different from traditional theories that all the parameters of the feedforward neural networks need to be tuned to min-imize the cost function, ELM theories [14–16] claim that the hidden node learning parameters can be randomly assignedindependently and the network output weights can be analytically determined by solving a linear system using the leastsquare method. The training phase can be efficiently completed without time-consuming learning iterations and ELM canachieve a good generalization performance.

For N arbitrary training samples ðxi; tiÞf gNi¼1, where xi 2 Rd and ti 2 Rm, the output of a SLFN with L hidden nodes is

oi ¼XL

j¼1

bjGðaj; bj; xiÞ; i ¼ 1;2; . . . ;N ð1Þ

where aj 2 Rd and bj 2 R (j = 1,2, . . . ,L) are learning parameters of the jth hidden node, respectively. bj 2 Rm is the link con-necting the jth hidden node to the output node. G(aj,bj,xi) is the output of the jth hidden node with respect to the input sam-ple xi.

For all N samples, an equivalent compact form of (1) can be written as

O ¼ Hb ð2Þ

where Hij = G(aj,bj,xi) represents the entry in the ith row and jth column of the hidden layer output matrix H, andb = (b1,b2, . . . ,bL) and O = (o1,o2, . . . ,oN).

To minimize the network cost function kO � Tk, where T = (t1,t2, . . . ,tN) is the target output matrix, ELM theories claimthat the hidden node learning parameters aj and bj can be randomly assigned a priori without considering the input data.Thus, the system Eq. (1) becomes a linear model and the output weights can be analytically determined by finding aleast-square solution of this linear system (1) as follows

b ¼ HyT ð3Þ

where H� is the Moore–Penrose generalized inverse [30] of the hidden layer output matrix H.ELM adopts an One-Against-All (OAA) method to decompose multi-classification applications into multiple binary clas-

sifiers and transforms the classification application to a multi-output function regression problem. Then, the one with thelargest output value is used to represent the class label of the given sample. For a C-labels classification application, the out-put label ti of a sample xi is usually encoded to a C-dimensional vector (ti1, ti2, . . . , tiC)T with tic 2 {1,�1} (c = 1,2, . . . ,C). In theOAA approach, if the class label ti of the sample xi is c, then tic will be set to be 1 and the others are set to be �1 in the newformed C-dimensional output vector. Therefore, the class label ctest of a testing sample xtest predicted by the ELM algorithm isthe index of the largest entry in the corresponding output vector.

Page 3: Voting based extreme learning machine

68 J. Cao et al. / Information Sciences 185 (2012) 66–77

2.2. An issue with ELM in classification applications

ELM constructs a nonlinear separation boundary in classification applications. Since the randomized hidden nodes areused and they remain unchanged during the training phase, some samples may be misclassified in certain realizations, espe-cially for samples that are near the classification boundary. As illustrated in Fig. 1, sample ‘‘~’’ which is in the class of ‘‘⁄’’ ismisclassified to the class ‘‘�’’ using ELM in realization n = 1 for a 2-label classification problem. However, with differentrandomly initialized hidden node parameters in the ELM algorithm, this sample ‘‘~’’ is correctly classified in realizationn = 2 and n = 3. Therefore, decisions based on a single realization of ELM may not be reliable and the classification resultsin different realizations may vary due to different nonlinear separation boundaries constructed with different random hid-den node learning parameters.

2.3. Voting based extreme learning machine

To tackle the issue mentioned above and improve the classification performance of ELM, an algorithm referred to votingbased extreme learning machine (V-ELM) by incorporating multiple independent ELMs and making decision with a majorityvoting method is proposed in this subsection.

In V-ELM, several individual ELMs with the same number of hidden nodes and the same activation function in each hid-den node are used. As in ELM, the optimal number of hidden nodes for a given application usually locates in a small regionand the variation of the network performance with respect to different number of hidden nodes selected from this region isusually very small. Hence, for the easy implementation, we fix the number of hidden nodes for all the individual ELMs to thesame value. All these individual ELMs are trained with the same dataset and the learning parameters of each ELM arerandomly initialized independently. The final class label is then determined by majority voting on all the results obtainedby these independent ELMs. Suppose that K independent networks trained with the ELM algorithm are used in V-ELM. Then,for each testing sample xtest, K prediction results can be obtained based on these independent ELMs. A corresponding vectorSK;xtest 2 RC with dimension equal to the number of class labels is used to store all these K results of xtest, where if the classlabel predicted by the kth (k 2 [1, . . . ,K]) ELM is i, the value of the corresponding entry i in the vector SK;xtest is increased byone, that is

SK;xtest ðiÞ ¼ SK;xtestðiÞ þ 1

After all these K results are assigned to SK;xtest , the final class label of xtest is then determined by conducting a majority voting:

ctest arg maxi2½1;...;C�

SK;xtest ðiÞ� �

ð4Þ

The algorithmic description of the proposed V-ELM is presented in Fig. 2.For a C-label classification application, assuming that the true label of the testing sample xtest is c, we have the following

observations.

Proposition 2.1. Given a standard SLFN with L hidden nodes, an output function G(�) and a set of training samples ðxi; tiÞf gNi¼1,

assume that the probability of correctly predicting the testing sample xtest using ELM [14] under all different possible hidden nodelearning parameters a and b is PELM cjxtestð Þ. If the following inequality holds

PELM cjxtest� �

> max PELM ijxtest� �� �

i2½1;...;C� and i–c ð5Þ

where PELM ijxtestð Þ is the probability that ELM classifies xtest to category i that is different from the class c, then, with a sufficientlylarge independent training number K, V-ELM is able to correctly classify xtest with probability one. h

The above proposition states that when the assumption (5) holds, with a sufficiently large independent training numberK, V-ELM could make a correct prediction with probability one. In the following, we will give a brief proof of Proposition 2.1.

Fig. 1. Classifiers using ELM with different random parameters.

Page 4: Voting based extreme learning machine

Fig. 2. Algorithmic description of the proposed V-ELM.

J. Cao et al. / Information Sciences 185 (2012) 66–77 69

We assume that PELM 1jxtestð Þ; . . . ;PELM c � 1jxtestð Þ;PELM c þ 1jxtestð Þ; . . . ;PELM Cjxtestð Þ are the probabilities that ELM misclas-sifies the sample xtest to classes 1, . . . , c � 1, c + 1, . . . , C under all possible hidden node parameters (a,b). Then, we rearrangethese misclassification probabilities in the descending order and denote as: PELM c1jxtestð ÞP � � �P PELM cC�1jxtestð Þ whereci 2 (1, . . . ,c � 1,c + 1, . . . ,C) is the class label that ranks the ith largest misclassification probability amongPELM 1jxtestð Þ; . . . ;PELM c � 1jxtestð Þ;PELM c þ 1jxtestð Þ; . . . ;PELM Cjxtestð Þ. With assumption (5), there exists a scalar � > 0 such that

PELM cjxtest� �� PELM c1jxtest� �

> �: ð6Þ

In V-ELM, K independent ELMs are used where each ELM network is trained with the same dataset and the hidden nodelearning parameters are initialized independently. Assuming that PELM;K ijxtestð Þ ði ¼ c; c1; . . . ; cC�1Þ is the probability that pre-dicting the testing sample xtest to label i among all these K ELMs, therefore, we have PELM;K ijxtestð Þ ¼ SK;xtest ðiÞ

K . For a sufficientlylarge number K, we can expect that

limK!1PELM;K ijxtest� �

¼ PELM ijxtest� �; i ¼ c; c1; . . . ; cC�1

andP

i¼c;c1 ;���;cC�1PELM;K ijxtestð Þ ¼ 1. Hence, from the definition of limit [31], we can get that for any given positive scalar

1C �P d > 0 and for each i, there exists a corresponding positive integer K0i, when Ki P K0i, the following inequalities hold

PELM;Kiijxtest� �

� PELM ijxtest� ��� �� < d; i ¼ c; c1; . . . ; cC�1

Therefore, among all these Ki, set K = max{Ki} (i = c,c1, . . . ,cC�1), then, we have the following inequalities hold

PELM;K ijxtest� �� PELM ijxtest� ��� �� < d; i ¼ c; c1; . . . ; cC�1

When i = c, we get the following inequality from the above equation

Page 5: Voting based extreme learning machine

1 http2 http

70 J. Cao et al. / Information Sciences 185 (2012) 66–77

PELM cjxtest� �� d < PELM;K cjxtest� �

< PELM cjxtest� �þ d ð7Þ

For any cj 2 (c1, . . . ,cC�1), we have

PELM;K cjjxtest� �

¼ 1� PELM;K cjxtest� �

�X

i–c;cj

PELM;K ijxtest� �

ð8Þ

From Eqs. (7) and (8), we have

PELM;K cjjxtest� �< 1� PELM cjxtest� �

þ d�X

i–c;cj

PELM;K ijxtest� �ð9Þ

which leads to the following corresponding inequality

PELM;K cjjxtest� �

< PELM cjjxtest� �

þ ðC � 1Þd ð10Þ

Using Eqs. (6), (7) and (10), we have

PELM;K cjxtest� �> PELM c1jxtest� �

þ �� d P PELM cjjxtest� �þ ðC � 1Þd > PELM;K cjjxtest� �

for any cj 2 ðc1; . . . ; cC�1Þ ð11Þ

The above inequality (11) means that with a sufficiently large independent training number K, the number of ELMs thatpredicts the sample xtest to category c is the largest one among all these K ELMs. Therefore, with a majority voting method, V-ELM could make a correct prediction with probability one.

Proposition 2.2. Assume condition (5) holds. Let

PELM c1jxtest� �¼max PELM ijxtest� �� �

i2½1;...;C� and i–c ð12Þ

That is, PELM c1jxtestð Þ is the incorrect probability for the highest class c1 for the testing sample xtest. Then, the required independenttraining number K for V-ELM to correctly classify xtest is related to the difference between PELM cjxtestð Þ and PELM c1jxtestð Þ. The largerthe difference is, the smaller the independent training number K is required. This means that for samples that have obviously dis-criminative features to the samples from other classes, correct predictions could always be obtained at each ELM. However, for sam-ples that are close to the classification boundary but still satisfy condition (5), the proposed V-ELM can improve the classificationperformance if a sufficiently large independent training number K is used. h

Proposition 2.3. If PELM cjxtestð Þ < PELM c1jxtestð Þ, with a sufficiently large independent training number K, the success classificationrate obtained by V-ELM will go to 0%. That is, V-ELM will predict xtest to the category c1 under this condition. h

An illustrative example for the above three propositions is given in Appendix A.

3. Performance verification

In this section, the performance of the proposed V-ELM is compared with the original ELM [14], SVM [12] and severalother recent classification methods, including OP-ELM [28], Back-Propagation algorithm (BP) [9,24,27], K nearest neighborsalgorithm (KNN) [2,7], RFRC [5], RBFNN [33] and MSCC [6]. Simulations are conducted on many real world classificationdatasets. All experiments on ELM, V-ELM, SVM and OP-ELM are carried in the Matlab 7.4 environment running on an ordin-ary PC with 2.66 GHZ CPU and 2 GB RAM. Both the Gaussian radial basis function (RBF) activation function G(a,b,x) = exp( � kx � ak2/b) and the sigmoid additive activation function G(a,b,x) = 1/(1 + exp (�(a � x + b))) are used in ELM and V-ELM.

3.1. Comparisons with ELM

The performance of V-ELM is compared with ELM on 19 real world datasets in this subsection. The first 18 datasets aredownload from the UCI1 database while the last protein sequence classification dataset is from the Protein Information Re-source (PIR2) center. Table 1 shows the specifications of all these 19 datasets. In the experiments, the training and testing dataof the first 13 datasets in Table 1 are reshuffled at each trial of simulation while the training and testing data of the last 6 data-sets are fixed for all trials of simulations.

For both V-ELM and ELM, the number of hidden nodes is gradually increased and the optimal number of nodes is thenselected based on the cross-validation method. Average results of 50 trials of simulations for each fixed size of SLFN are ob-tained and the performance obtained by V-ELM and ELM is reported in this paper. For V-ELM, 7 independent ELMs are usedfor voting. Table 2 lists the classification results of V-ELM and ELM on all these 19 datasets. As observed from this table, theproposed V-ELM outperforms ELM in all 19 datasets. For several datasets, the enhancement of the success rate is apparent.

://www.ics.uci.edu/�mlearn/MLRepository.html.://pir.georgetown.edu/.

Page 6: Voting based extreme learning machine

Table 1Specifications of classification datasets.

Datasets # Attributes # Classes # Training data #Testing data

Balance 4 3 400 225Breast-can 30 2 300 269Diabetes 8 2 576 192Ecoli 7 8 168 168Glass 10 2 100 114Heart 13 2 100 170Iris 4 3 100 50Liver 6 2 200 145Sonar 60 2 100 108Soybean 35 4 22 21Spambase 57 2 2300 2301Waveform 21 3 3000 2000Wine 13 3 100 78Digit 16 10 7494 3498Hayes 4 3 132 28Monk1 6 2 124 432Monk2 6 2 169 432Monk3 6 2 122 432Protein 56 10 949 533

Table 2Performance comparisons with ELM.

Datasets ELM [14] V-ELM (K = 7)

Sigmoid RBF Sigmoid RBF

Testing (%) Testing (%)

Rate Dev Rate Dev Rate Dev Rate Dev

Balance 90.53 1.79 90.16 2.16 91.24 1.49 91.00 1.63Breast-can 95.52 1.38 94.45 1.60 96.75 0.94 96.39 0.92Diabetes 77.69 2.62 77.24 2.65 78.56 2.46 78.10 2.68Ecoli 85.56 2.31 85.43 2.31 86.42 1.81 86.41 2.03Glass 91.91 2.54 92.33 2.45 92.77 2.31 93.33 2.21Heart 80.68 2.79 71.34 4.66 82.53 2.27 77.05 2.68Iris 96.60 5.93 97.60 4.76 98.40 4.22 98.00 4.95Liver 70.01 3.84 68.36 3.75 72.17 2.81 71.37 3.11Sonar 73.31 4.29 72.67 3.91 79.11 3.51 78.69 3.72Soybean 99.00 2.21 88.61 7.77 99.83 0.86 95.54 5.54Spambase 90.50 0.68 88.81 0.79 91.60 0.64 90.65 0.58Waveform 85.50 0.83 84.59 0.79 86.48 0.56 86.26 0.63Wine 96.79 1.66 92.62 3.10 98.31 1.61 96.79 2.17Digit 97.01 0.23 96.40 0.39 97.22 0.11 97.13 0.15Hayes 75.86 4.76 61.14 7.09 79.00 4.87 66.57 4.99Monk1 82.78 2.34 78.45 3.26 85.75 1.41 82.98 2.10Monk2 79.27 2.11 81.25 2.37 82.81 1.15 85.84 1.27Monk3 87.38 2.83 79.02 2.81 90.44 1.00 85.12 1.64Protein 87.90 1.34 87.89 1.43 90.06 0.66 90.35 0.66

J. Cao et al. / Information Sciences 185 (2012) 66–77 71

For example, compared with ELM, the enhancements of success rates of the datasets Sonar, Hayes, Monk2, Monk3 andProtein are more than 3%. The standard deviations of V-ELM are much lower than ELM for all these 19 datasets, as shownin bold face in Table 2. For V-ELM, using sigmoid function as the activation function performs better than the one usingRBF activation function in general. Among these two activation functions, V-ELM with sigmoid function wins 16 out of 19datasets while V-ELM with RBF wins the remaining 3 datasets.

Table 3 shows the comparisons of the network complexities of ELM and V-ELM, including the training time, the timefactor and the number of hidden nodes. It can be found from this table that the computational time of using V-ELM is about7 times on average that of using ELM when K = 7 in the above simulations. However, the training phase can still be completedin less one second for most applications. The training time costed by V-ELM is linearly increased with a proportion to thenumber of independent training times used for voting.

3.2. Comparisons with SVM, OP-ELM, BP, and KNN

In this subsection, simulations using V-ELM, SVM [12], OP-ELM [28], BP [9,24,27], and KNN [7,2] are conducted on allthe above 19 datasets. Note that the results reported here are obtained by us using Matlab codes provided in

Page 7: Voting based extreme learning machine

Table 3Comparisons of network complexities with ELM.

Datasets ELM V-ELM (K = 7)

Sigmoid RBF Sigmoid RBF

Time Time factor Nodes Time Time factor Nodes Time Time factor Nodes Time Time factor Nodes

Balance 0.0088 1 58 0.0094 1 60 0.0278 3.1 38 0.0584 6.2 60Breast-can 0.0053 1 45 0.0153 1 85 0.0506 9.5 60 0.0856 5.5 75Diabetes 0.0016 1 16 0.0044 1 28 0.0134 8.3 22 0.0159 3.61 22Ecoli 0.0003 1 20 0.0003 1 24 0.0103 34 24 0.0091 30 24Glass 0.0006 1 26 0.0003 1 26 0.0056 9.3 24 0.0100 33 34Heart 0.0006 1 16 0.0009 1 16 0.0019 3.1 12 0.0081 9 20Iris 0.0003 1 28 0.0019 1 24 0.0063 21 24 0.0066 3.4 18Liver 0.0022 1 22 0.0019 1 30 0.0075 3.4 20 0.0119 6.2 24Sonar 0.0044 1 56 0.0028 1 50 0.0334 7.5 64 0.0356 12.7 64Soybean 0.0008 1 90 0.0014 1 100 0.0037 4.6 90 0.0078 5.5 100Spambase 0.3322 1 240 0.4594 1 280 3.2725 9.8 260 2.3675 5.1 240Waveform 0.1266 1 100 0.1128 1 100 0.8734 6.8 100 0.8944 7.9 100Wine 0.0006 1 24 0.0016 1 40 0.0053 8.8 20 0.0153 9.5 38Digit 0.8741 1 200 0.8484 1 200 6.1709 7 200 6.2997 7.4 200Hayes 0.0022 1 46 0.0053 1 58 0.0191 8.6 48 0.0294 5.5 58Monk1 0.0031 1 45 0.0063 1 70 0.0631 20 85 0.0394 6.2 70Monk2 0.0219 1 115 0.0225 1 110 0.1325 6 110 0.1231 5.4 105Monk3 0.0075 1 70 0.0072 1 75 0.0494 6.5 75 0.0450 6.2 70Protein 0.0951 1 125 0.0917 1 130 0.6853 7.2 155 0.6878 7.5 155

Table 4Performance Comparisons with SVM, OP-ELM, BP, and KNN.

Datasets SVM (RBF) OP-ELM BP KNN V-ELM (K = 7)Testing (%) Testing (%) Testing (%) Testing (%) Testing (%)

Rate Dev Rate Dev Rate Dev Rate Dev Rate Dev

Balance 95.88 1.31 92.31 1.83 90.92 2.14 87.00 1.80 91.24 1.49Breast-can 95.55 0.82 95.33 1.29 95.01 1.66 96.32 1.03 96.75 0.94Diabetes 77.31 2.73 77.34 3.17 77.23 2.81 74.09 2.73 78.56 2.46Ecoli 85.83 2.79 85.20 2.88 80.27 3.91 83.68 2.22 86.42 1.81Glass 91.84 2.78 91.65 3.23 91.30 3.09 90.18 2.60 93.33 2.21Heart 76.10 3.46 81.05 2.96 71.25 8.54 80.79 2.57 82.53 2.27Iris 94.36 2.76 97.80 8.93 95.60 3.00 96.04 2.23 98.40 4.22Liver 68.24 4.58 65.85 4.75 66.50 4.45 61.46 3.27 72.17 2.81Sonar 83.48 3.88 71.70 4.79 70.31 5.40 66.30 4.93 79.11 3.51Soybean 99.56 1.32 99.12 1.51 88.17 9.38 79.74 11.47 99.83 0.86Spambase 93.50 0.45 91.23 0.78 92.06 0.78 88.61 0.53 91.60 0.64Waveform 85.78 0.62 85.46 0.64 85.94 0.76 82.65 0.72 86.48 0.56Wine 97.48 1.57 98.18 1.72 94.10 3.12 96.23 2.01 98.31 1.61Digit 98.14 0.01 98.34 0.25 n n 97.54 0.01 97.22 0.11Hayes 75 0 70.43 4.95 74.43 7.08 75.00 0 79.00 4.87Monk1 94.44 0.01 74.79 3.91 69.99 13.82 80.56 0.01 85.75 1.41Monk2 84.72 0.01 70.35 3.58 72.84 2.92 71.53 0.01 85.84 1.27Monk3 90.04 0.01 88.77 2.31 80.41 6.07 80.79 0.01 90.44 1.00Protein 90.63 0.01 91.26 1.15 87.45 2.13 89.33 0.01 90.35 0.66

72 J. Cao et al. / Information Sciences 185 (2012) 66–77

[12,28,24,27,2,7]. For V-ELM, 7 independent ELMs are adopted for training and majority voting. For SVM, the GaussianRBF is used as the kernel function, the cost parameter C and the kernel parameter c are searched in a grid formed byC = [212,211, . . . ,2�2] and c = [24,23, . . . ,2�10] as suggested in [12] and the best combination of these two parameters isthen obtained in terms of the generalization performance. For OP-ELM, the three possible kernels, linear, sigmoid andGaussian are used as a combination suggested in [28] that using the combination of these three functions as the acti-vation function in OP-ELM is more robust and efficient than utilizing only one of them. For BP, the most frequentlyadopted Levenberg–Marquardt algorithm [24,27] is used to train the neural network. For KNN, 7 nearest neighborsare used and the Euclidean norm is adopted to calculate the distance. Average results with 50 trials of simulationsare reported in this paper.

Table 4 shows the performance comparisons of our method with SVM, OP-ELM, BP, and KNN. For each dataset, the highestsuccess testing rate and the lowest deviation valve are shown in bold face. From this table, it is easy to see that V-ELM winsthe highest testing success rates in 13 datasets, while SVM has the highest testing success rates in 4 datasets and OP-ELMreaches the highest testing success rates in 2 datasets. From these results, we can see that V-ELM outperforms the other fouralgorithms in general. For the dataset Digit, BP algorithm always ran out of memory in our ordinary PC. Table 5 shows the

Page 8: Voting based extreme learning machine

Table 5Comparisons of Network Complexities with SVM, OP-ELM, BP, and KNN.

Datasets SVM (RBF) OP-ELM BP KNN V-ELM (K = 7)

Trainingtime

Timefactor

SVs (C,c) Trainingtime

Timefactor

Nodes Trainingtime

Timefactor

Nodes Trainingtime

Timefactor

Trainingtime

Timefactor

Nodes

Balance 23.1122 831 47.18 (212,2�3) 0.4550 16 40 0.8478 30 25 0.0544 1.9 0.0278 1 38Breast-

can18.0719 357 56.22 (21,2�3) 0.6100 12 25 1.9875 39 20 0.0725 1.4 0.0506 1 60

Diabetes 364.6 27208 62.28 (2�1,2�1) 0.6709 50 16 0.4516 33 10 0.1353 10 0.0134 1 22Ecoli 10.942 1062 67.86 (210,2�7) 0.6021 58 20 0.8712 84 20 0.0437 4.2 0.0103 1 24Glass 4.1444 414 47.88 (21,22) 0.0459 4.5 8 0.3459 34 20 0.0384 3.8 0.0100 1 34Heart 6.7094 3531 81.6 (24,2�1) 0.0447 23 8 0.3063 161 5 0.0384 20 0.0019 1 12Iris 4.5488 722 23.3 (28,20) 0.0063 1 4 0.3050 48 5 0.0103 1.6 0.0063 1 24Liver 5.0789 677 156.2 (23,21) 0.0409 5.4 10 0.3234 43 15 0.0297 4.7 0.0075 1 20Sonar 8.9113 266 69.7 (28,2�1) 0.1391 4.1 12 3.1541 94 20 0.0681 2 0.0334 1 64Soybean 4.2516 1149 21.20 (27,2�5) 0.0524 14 80 0.4197 113 10 0.0037 1 0.0037 1 90Spambase 913.76 279 452.6 (211,2�6) 227.5824 69 240 9.5344 2.9 10 6.6422 2 3.2725 1 260Waveform 1253.3 1434 1863.6 (24,2�10) 3.2894 3.7 110 3.9066 4.4 10 2.3575 2.6 0.8734 1 100Wine 5.8941 1112 47.3 (22,2�1) 0.0628 11 28 0.5175 97 30 0.0181 3.4 0.0053 1 20Digit 5894.8 955 1133 (23,2�1) 628.5689 101 240 n n n 9.2844 1.5 6.1709 1 200Hayes 8.1366 426 82 (28,2�6) 0.0988 5.1 16 0.7588 39 45 0.0059 0.3 0.0191 1 48Monk1 6.6609 105 51 (212,2�3) 0.0500 0.7 28 0.3253 5.1 10 0.2597 4.1 0.0631 1 85Monk2 11.4766 93 101 (212,2�1) 0.0594 0.4 25 1.6578 13 150 0.2778 2.2 0.1231 1 105Monk3 6.4887 131 55 (28,2�2) 0.0516 1 36 0.3159 6.3 10 0.2719 5.5 0.0494 1 75Protein 274.9672 399 261 (27,24) 19.9381 28 160 1326 1927 60 0.3225 0.46 0.6878 1 155

Table 6Performance comparisons with several recent methods.

Datasets RFRC [5] RBFNN [33] MSCC [6] V-ELM (K = 7) V-ELM (K = 51)Testing (%) Testing (%) Testing (%) Testing (%) Testing (%)

Rate Dev Rate Dev Rate Dev Rate Dev Rate Dev

Balance 84.7 1.5 90.5 1.0 90.8 1.2 90.89 1.25 91.35 1.03

Breast-can 92.0 1.6 95.0 1.2 97.3 0.7 96.70 1.02 97.34 0.60Diabetes 70.7 3.2 74.2 2.3 76.5 1.1 77.61 1.62 77.78 1.77

Ecoli 81.8 3.3 85.2 2.7 85.0 2.4 86.42 1.81 86.68 1.99

Heart 80.9 2.2 82.5 2.3 84.2 1.8 82.89 2.80 83.31 2.55Iris 95.3 1.1 96.4 1.6 97.1 1.7 97.07 2.01 97.60 0.84Liver 61.0 2.4 70.8 3.6 68.2 5.7 71.38 2.66 72.00 2.79

Sonar 77.5 3.9 80.2 3.0 85.6 4.1 78.96 3.67 81.69 3.75Soybean 99.1 1.7 98.1 1.7 100 0 99.83 0.86 100 0Spambase 85.1 1.1 80.7 1.0 89.9 0.8 91.60 0.64 91.98 0.50

Wine 96.0 1.7 97.3 1.1 98.3 1.1 98.16 1.32 98.29 1.40

J. Cao et al. / Information Sciences 185 (2012) 66–77 73

network complexities of these five algorithms. It is easy to find from this table that with K = 7, V-ELM learns much faster thanSVM and BP for all the datasets, faster than OP-ELM in all the datasets except for Iris, Monk1 and Monk2, and faster than KNNin all the datasets except for Soybean, Hayes and Protein.

3.3. Comparisons with several other recent algorithms

The performance of V-ELM is next compared with three other classification algorithms recently reported in the literature,namely the robust fuzzy relational classifier (RFRC) [5], the radial basis function neural network (RBFNN) [33] and the mul-tiobjective simultaneous learning framework (MSCC) [6]. Note that the results of RFRC, RBFNN and MSCC are taken from ref-erences [5,6,33] directly. To make the comparison fair, in the simulations the training and testing datasets are randomlygenerated with half-half cases as done in [6] for V-ELM. Two different independent training numbers, K = 7 and K = 51,respectively, are used in V-ELM. All the results of V-ELM are obtained by averaging 50 trials.

Table 6 shows the simulation results of V-ELM and those of the three recent methods mentioned above. When K = 7, V-ELM performs better than RFRC and RBFNN in almost all datasets except for Sonar. The performance of V-ELM is about thesame as that of MSCC, with V-ELM winning the highest testing success rate in 5 out of 11 datasets (underlined in Table 6),while MSCC reaching the highest success rate in the remaining 6 datasets. However, when the number of independent ELMsis increased to K = 51, V-ELM performs better, with V-ELM winning the highest success rate in 8 out of 11 datasets (in boldface in Table 6).

Page 9: Voting based extreme learning machine

0 20 40 60 80 100 120 140 16070

72

74

76

78

80

82

84

X: 1Y: 70.03

INDEPENDENT TRAINING NUMBER K

SUC

CES

S R

ATE

(%) X: 35

Y: 80.05

X: 151Y: 80.74

X: 23Y: 83.04

X: 151Y: 83.23

X: 1Y: 80.15

V−ELM (Heart)V−ELM (Sonar)

Fig. 3. Success rates of the datasets Heart and Sonar using V-ELM (Sigmoid) with respect to the independent training number K.

74 J. Cao et al. / Information Sciences 185 (2012) 66–77

4. Discussions

In this section, we first discuss the performance of V-ELM with respect to different independent training numbers K. Then,we briefly discuss three recent methods [22,26,32] which also exploit multiple classifiers in ELM.

It was stated in Proposition 2.1 that for the V-ELM to work properly, a sufficiently large independent training number K isrequired. However, it may not be practically feasible to use V-ELM if the required K is too large, say well above 100 as thecomputational time of V-ELM increases with K proportionally, roughly about K times of that for a single ELM. It is henceimportant to know the performance of V-ELM as a function of K.

From the simulations conducted in the previous section, we know that for K = 7, V-ELM outperforms the original ELM byabout 2% in average for the 19 datasets and also outperforms most of other methods except for MSCC which has about thesame performance. When K increases to 35, the performance of V-ELM increases and in such a case, it also outperformsMSCC. The purpose of this section is to illustrate through simulations that in most practical applications, it suffices to imple-ment V-ELM when K is between 5 and 35. For this purpose, two datasets Heart and Sonar are selected. The independenttraining number K, which is also the number of ELMs, is gradually increased from 1 to 151 with the interval 2. For eachK, 5000 repetitions are conducted and the average results are reported here.

Fig. 3 depicts the curves of success rates of the datasets Heart and Sonar using V-ELM (using Sigmoid function) as a func-tion of the number of ELMs K. It is easy to find from Fig. 3 that the success rates for both Heart and Sonar datasets increasemonotonically with K. It is obvious that the changes are more dramatic at the beginning parts than at the ending parts. Forexample, in the dataset Heart, when K changes from 1 (corresponding to the original ELM algorithm) to 23, the success rateof V-ELM increases from 80.15% to 83.04%, an increase of 2.9%. However, when K is further increased from 23 to 151, thesuccess rate only increases to 83.23%, or slightly less than 0.2%. As the training time by V-ELM with K = 151 is near 10 timesof that with K = 23, we conclude that choosing K to be around 23 would be the best for this dataset. The situation is moreobvious for the dataset Sonar. when K changes from 1 (corresponding to the original ELM algorithm) to 35, the success rateof V-ELM increases from 70.03% to 80.05%, an increase of more than 10%. However, when K is further increased from 35 to151, the success rate only increases to 80.74%, or less than 0.8%. As the training time by V-ELM with K = 151 is about 5 timesof that with K = 35, we conclude that choosing K to be around 35 would be the best for the dataset Sonar.

Fig. 4 displays the standard deviations of success rates among 5000 realizations for the datasets Heart and Sonar usingV-ELM (Sigmoid) with respect to different independent training numbers. One can find that the standard deviations decreasedramatically when the number of independent training times K increases from 1 to 30 for both these two datasets. However,when keeping increasing the number of independent training times after 30, the variations of the standard deviations amongusing different number of independent training times K are very small for these two datasets.

From the simulations on the above two datasets and many more other datasets, we recommend to implement V-ELM bychoosing K between 5 and 35. The actual value of K depends on the application at hand. As the original ELM runs much fasterthan most other classification algorithms, the computational time of V-ELM with K in this range should still be comparablewith or faster than other classification algorithms.

For the other datasets we have used in our performance evaluation section, the same performance can be found as thetwo datasets Heart and Sonar we have discussed here. We do not display all these results to save space.

Page 10: Voting based extreme learning machine

0 20 40 60 80 100 120 140 1602

2.5

3

3.5

4

4.5

5

5.5

INDEPENDENT TRAINING NUMBER K

DEV

. OF

SUC

CES

S R

ATE

(%)

V−ELM (Heart)V−ELM (Sonar)

Fig. 4. Standard deviations of success rates of the datasets Heart and Sonar using V-ELM (Sigmoid) with respect to the independent training number K.

J. Cao et al. / Information Sciences 185 (2012) 66–77 75

Before ending this section, it is worth pointing out the difference between our method and three recent methods[22,26,32] which also exploit multiple classifiers in ELM. In [32], a multiple classifier system based on ELM was proposed.The main idea of [32] is to generate a set of classifiers using ELM and discard the ones with the recognition rate lowerthan the threshold that is calculated by averaging all the recognition rates among all these ELM classifiers, and then toupdate the whole classification system by generating new ELM classifiers instead of the discarded ones till the perfor-mance goal is reached. The difference between our algorithm and the method in [32] is that our voting procedure is oper-ated on the individual sample while the threshold in [32] is an average value derived based on the whole validation setamong all ELM classifiers. In addition, we do not adopt the updating procedure in our approach. In [22], an ensemble ofonline sequential extreme learning machine (EOS-ELM) was developed to lower the variance of online sequential extremelearning machine (OS-ELM). By implementing multiple OS-ELM networks for training and taking the average value as thefinal output, EOS-ELM is more stable than OS-ELM. However, EOS-ELM focused on the online sequential learning and theperformance of EOS-ELM has not been improved except for lowering the variance, as shown by the simulation results in[22]. In [26], the training dataset is divided into several non-overlap subsets and the training and validation phases arecompleted by conducting on each subset independently. This is different to our algorithm on that we make use of allthe information in the whole dataset to train each independent ELM network while only partial samples are used for train-ing each ELM in [26].

5. Conclusions

In this paper, we have proposed an improved classification algorithm named voting based extreme leaning machine totrain SLFNs. Based on the simulation results, comparisons and discussions, we have the following conclusions:

(1) Compared with the original ELM algorithm, the incorporation of the voting method enables V-ELM to achieve a muchhigher success classification rate in general. However, the improvement of performance of V-ELM is at the price ofincreasing the training time about K folds of that of the single ELM. Hence, a small independent training number Kshould always be tried first and gradually increased.

(2) Compared with several recent classification algorithms, V-ELM is able to obtain better success rates in most cases forthe many datasets we have at hand.

(3) Generally, the success rate obtained by V-ELM increases with K. For practical applications, we recommend to imple-ment V-ELM by choosing K between 5 and 35, starting from 5 and gradually increasing K until 35 or a satisfactory val-idation result is obtained before K reaches 35.

Acknowledgments

The authors thank the anonymous reviewers whose insightful and helpful comments greatly improved this paper.

Page 11: Voting based extreme learning machine

Table 7Comparisons of misclassification numbers between V-ELM and ELM.

Samples s1 s2 s3

ELM 1361 1055 1612V-ELM (K = 19) 800 196 1993V-ELM (K = 77) 491 0 2000

0 20 40 60 80 100 1200

200

400

600

800

1000

1200

1400

1600

1800

2000

INDEPENDENT TRAINING NUMBER K

MIS

CLA

SSIF

ICAT

ION

NU

MBE

RS

sample s1sample s2sample s3

Fig. 5. Misclassification numbers of the three samples with respect to the independent training number K.

76 J. Cao et al. / Information Sciences 185 (2012) 66–77

Appendix A

In this appendix, we will use a real world application to illustrate Propositions 2.1–2.3. Through this application, we willshow that the required number of ELMs for V-ELM to correctly classify xtest is related to the difference between PELM cjxtestð Þand PELM c1jxtestð Þ, where PELM cjxtestð Þ is the probability of correctly predicting the testing sample using ELM under alldifferent possible hidden node learning parameters and PELM c1jxtestð Þ ¼max PELM ijxtestð Þf gi2½1;...;C� and i–c is the incorrectprobability for the highest class c1 for the testing sample xtest. The larger the difference is, the smaller the number of ELMsis required.

The 10-label dataset Protein is adopted in this appendix. In the simulation, 949 samples are used as the training set. Threesamples, denoted as ‘‘s1’’, ‘‘s2’’ and ‘‘s3’’ are selected as testing samples, where s1, s2 and s3 belong to the classes 2, 8 and 5,respectively. 2000 repetitions of ELMs are conducted and the rate of a testing sample xtest being classified into the category iby these 2000 ELMs (2000 possible input weight matrices a and hidden bias vectors b) is used to approximate the probabilitythat ELM classifies xtest to the category i (under all different possible a and b), i.e., PELM ijxtestð Þ. In 2000 repeated trials, s1, s2

and s3 are correctly classified to the classes 2, 8 and 5 for 639, 945 and 388 times, respectively. Hence, the approximate cor-rect probability for s1, s2 and s3 is PELMð2js1Þ � 31:95%;PELMð8js2Þ � 47:25% and PELMð5js3Þ � 19:4%, respectively. It followsthat in these 2000 repeated trials, s1, s2 and s3 are misclassified to other classes for 1361, 1055 and 1612 times, or they havethe approximate incorrect probability of 68.05%, 52.75% and 80.6%, respectively. Note that for all the three testing samples s1,s2 and s3, the correct probability is lower than the corresponding incorrect probability. Moreover, s1, s2 and s3 are misclas-sified to classes 3, 2 and 4, for 586, 322 and 1540 times, respectively, which are the corresponding highest misclassificationclasses. Hence, the approximate incorrect probability for the highest class for s1, s2 and s3 is PELMð3js1Þ � 29:3%;

PELMð2js2Þ � 16:6% and PELMð4js3Þ � 77%, respectively.For samples s1 and s2, since the approximate correct probability is larger than the corresponding approximate incorrect

probability for the highest class, i.e., PELMð2js1Þ > PELMð3js1Þ and PELMð8js2Þ > PELMð2js2Þ, condition (5) is satisfied. By Propo-sition 2.1, the correct classification rates for these two samples using V-ELM could go to 100% if the independent trainingnumber K is sufficiently large. In our simulations with a total of 2000 trials, each trial is now run by V-ELM instead ofELM. For V-ELM with K = 19 at each trial, the misclassification numbers of s1 and s2 decrease from 1361 to 800, and from1055 to 196, respectively. When K is further increased to 77, the corresponding misclassification numbers decrease dramat-ically to 491 and 0, respectively. On the other hand, for sample s3, it goes in the opposite direction stated in Proposition 2.3 asthe approximate correct probability is smaller than the approximate incorrect probability for the highest class, i.e.,PELMð5js3Þ < PELMð4js3Þ. As a result, the misclassification number of s3 increases from 1612 to 1993 for K = 19 and further

Page 12: Voting based extreme learning machine

J. Cao et al. / Information Sciences 185 (2012) 66–77 77

to 2000 for K = 77. The comparisons of misclassification numbers between V-ELM and ELM are shown in Table 7 for the twocases of K = 19 and K = 77. For s1 and s2, the misclassification numbers are shown in bold face when using 19 and 77 inde-pendent ELMs in our proposed V-ELM.

Fig. 5 depicts the misclassification numbers of the three samples s1, s2 and s3 with respect to the independent trainingnumber K, ranging from 1 to 120. It illustrates all the three propositions, Propositions 2.1–2.3. In particular, since the differ-ence between PELMð8js2Þ and PELMð2js2Þð� 30%Þ is far greater than that between PELMð2js1Þ and PELMð3js1Þ ð� 2:65%Þ, it onlyrequires K to be 77 for the classification rate for s2 to reach 100%, while for s1, the required K is much bigger, far beyond 120as evident from Fig. 5. This has verified Proposition 2.2 clearly.

References

[1] P.L. Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of thenetwork, IEEE Transactions Information Theory 44 (2) (1998) 525–536.

[2] D. Bremner, E. Demaine, J. Erickson, J. Iacono, S. Langerman, P. Morin, G. Toussaint, ‘‘Output-sensitive algorithms for computing nearest-neighbordecision boundaries, Discrete and Computational Geometry 33 (4) (2005) 593C604.

[3] J.W. Cao, Z.P. Lin, G.-B. Huang, Composite function wavelet neural networks with extreme learning machine, Neurocomputing 73 (7-9) (2010) 1405–1416.

[4] J.W. Cao, Z.P. Lin, G.-B. Huang, Composite function wavelet neural networks with differential evolution and extreme learning machine, NeuralProcessing Letters 33 (3) (2011) 251–265.

[5] W.L. Cai, S.C. Chen, D.Q. Zhang, Robust fuzzy relational classifier incorporating the soft class labels, Pattern Recognition Letters 28 (16) (2007) 2250–2263.

[6] W.L. Cai, S.C. Chen, D.Q. Zhang, A multiobjective simultaneous learning framework for clustering and classification, IEEE Transactions on NeuralNetworks 21 (2) (2010) 185–200.

[7] T.M. Cover, P.E. Hart, Nearest neighbor pattern classification, IEEE Transactions on Information Theory 13 (1) (1967) 21–27.[8] P.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis, Wiley, New York, 1973.[9] S. Haykin, Neural Networks, A Comprehensive Foundation, second ed., Pearson education Press, 2001.

[10] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks 4 (1991) 251–257.[11] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2 (1989) 359–366.[12] C.-W. Hsu, C.-J. Lin, A comparison of methods for multiclass support vector machines, IEEE Transactions on Neural Networks 13 (2) (2002) 415–425.[13] Y.C. Hu, Finding useful fuzzy concepts for pattern classification using genetic algorithm, Information Sciences 175 (1-2) (2005) 1–19.[14] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (1-3) (2006) 489–501.[15] G.-B. Huang, L. Chen, C.-K. Siew, Universal approximation using incremental constructive feedforward networks with random hidden nodes, IEEE

Transactions on Neural Networks 17 (4) (2006) 879–892.[16] G.-B. Huang, M.-B. Li, L. Chen, C.-K. Siew, Incremental extreme learning machine with fully complex hidden nodes, Neurocomputing 71 (4–6) (2008)

576–583.[17] G.-B. Huang, L. Chen, Convex incremental extreme learning machine, Neurocomputing 70 (16–18) (2007) 3056–3062.[18] G.-B. Huang, L. Chen, Enhanced random search based incremental extreme learning machine, Neurocomputing 71 (16–18) (2008) 3460–3468.[19] M. van Heeswijk, Y. Miche, E. Oja, A. Lendasse, GPU-accelerated and parallelized elm ensembles for large-scale regression, Neurocomputing, in press,

doi:10.1016/j.neucom.2010.11.034.[20] B. Jin, Y.C. Tang, Y.Q. Zhang, Support vector machines with genetic fuzzy feature transformation for biomedical data classification, Information Sciences

177 (2) (2007) 476–489.[21] S.K. Kay, Fundamentals of Statistical Signal Processing: Detection Theory, 1st ed., Prentice Hall, 1998.[22] Y. Lan, Y.C. Soh, G.-B. Huang, Ensemble of online sequential extreme learning machine, Neurocomputing 72 (13–15) (2009) 3391–3395.[23] C.Y. Lee, C.J. Lin, H.J. Chen, A self-constructing fuzzy CMAC model and its applications, Information Sciences 177 (1) (2007) 264–280.[24] K. Levenberg, A method for the solution of certain non-linear problems in least squares, The Quarterly of Applied Mathematics (2) (1944) 164–168.[25] P. Lingras, C. Butz, Rough set based 1-v-1 and 1-v-r approaches to support vector machine multi-classification, Information Sciences 177 (18) (2007)

3782–3798.[26] N. Liu, H. Wang, Ensemble based extreme learning machine, IEEE Signal Processing Letters 17 (8) (2010) 754–757.[27] D. Marquardt, An algorithm for least-squares estimation of nonlinear parameters, SIAM Journal on Applied Mathematics 11 (1963) 431–441.[28] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, OP-ELM: Optimally pruned extreme learning machine, IEEE Transactions on Neural

Networks 21 (1) (2010) 158–162.[29] R. Polikar, Ensemble based systems in decision making, IEEE Circuits and Systems Magazine 6 (3) (2006) 21–45.[30] D. Serre, Matrices: Theory and Applications, Springer, New York, 2002.[31] R. Walter, Principles of Mathematical Analysis, third ed., McGraw-Hill, 1976.[32] D.H. Wang, ELM-based multiple classifier systems, in: Proceedings of International Conference on 9th Control, Automation, Robotics and Vision,

Singapore, December 2006.[33] Z.R. Yang, A novel radial basis function neural network for discriminant analysis, IEEE Transactions on Neural Networks 17 (3) (2006) 604–612.[34] C.-W.T. Yeu, M.-H. Lim, G.-B. Huang, A. Agarwal, Y.S. Ong, A new machine learning paradigm for terrain reconstruction, IEEE Geoscience and Remote

Sensing Letters 3 (3) (2006) 382–386.[35] R. Zhang, G.-B. Huang, N. Sundararajan, P. Saratchandran, Multi-category classification using extreme learning machine for microarray gene expression

cancer diagnosis, IEEE/ACM Transactions on Computational Biology and Bioinformatics 4 (3) (2007) 485–495.[36] M.J. Zolghadr, E.G. Mansoori, Weighting fuzzy classification rules using receiver operating characteristics (ROC) analysis, Information Sciences 177 (11)

(2007) 2296–2307.