regularized weighted ensemble of deep

Upload: anonymous-lvq83f8mc

Post on 07-Aug-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    1/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    DOI:10.5121/ijcsa.2015.5305 47

    Regularized Weighted Ensemble of DeepClassifiers 

    Shruti Asmita1 and K.K. Shukla

    2

    1Department of Computer Science, Banasthali University, Jaipur-302001,Rajasthan, India

    2Department of Computer Science and Engineering, Indian Institute of Technology,

    Banaras Hindu University, Varanasi-221005, Uttar Pradesh, India

     A BSTRACT  

     Ensemble of classifiers increases the performance of the classification since the decision of many experts

    are fused together to generate the resultant decision for prediction making. Deep learning is a

    classification algorithm where along with the basic learning technique, fine tuning learning is done forimproved precision of learning. Deep classifier ensemble learning is having a good scope of research.

    Feature subset selection is another for creating individual classifiers to be fused for ensemble learning. All

    these ensemble techniques faces ill posed problem of overfitting. Regularized weighted ensemble of deep

    support vector machine performs the prediction analysis on the three UCI repository problems IRIS,

     Ionosphere and Seed data set, thereby increasing the generalization of the boundary plot between the

    classes of the data set. The singular value decomposition reduced norm 2 regularization with the two level

    deep classifier ensemble gives the best result in our experiments.

     KEYWORDS

     Deep learning, support vector machine, feature subset selection, singular value decomposition,

    regularization

    1.INTRODUCTION

    Machine learning is a domain of computational statistics, a specialized field of prediction making.This aims at artificial learning i.e. the construction of such algorithms which are capable of

    learning from data [1]. Such learning is based on the development of model from training dataand hence making decisions using the model on the test data. Supervised machine learning [2] is

    marked by the presence of a supervisor in a way that training set comprising of a number of

    inputs and corresponding output i.e. associated label is provided to the machine for initiallearning and model forming. Later with the help of this generated model, required output is

    generated on any input not present in the training set. On the other side, the unsupervised learning[2] does not contain any such supervisor. These try to find out hidden relation between the

    unlabelled data. Classification, regression etc. are techniques of supervised learning whereasclustering, self-organizing neural network map etc. are techniques of unsupervised learning.

    Other learning approaches in existence are semi supervised learning, reinforcement learning,developmental learning etc.

    In classification [3], the training data is divided into two or more classes. A model is required to

    be formed which can distinguish between the category and generate an ability to place new input

    instances in the correct class to which it belongs. The performance measure of classification is theclassification accuracy. The goal of any learning lies in achieving best possible classification

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    2/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    48

    accuracy. Several classification algorithms are being applied onto various datasets but the scopeof improvement in the performance through the use of new techniques is always there. Machine

    learning aims at obtaining high test accuracy. Number of popular classifiers used widely forseveral classification techniques are k nearest neighbour classifier, decision tree classifier,

    frequent pattern classifier, bayes classifier, rule based classifier, support vector machine (SVM)

    classifier etc. [4] Among these SVM [5] classifier is most studied and implemented classifierthese days because of its high accuracy and exceptional ability to model complex non-linear

    decision boundaries by mapping non-linear data to higher dimensions. Hence both linear as wellas non-linear data can be well classified by this classifier. Also, because of the presence of

    support vectors in SVM classifiers, the compactness of the classification is very high. Groups of

    people can often make better decisions than individuals [6]. Hence the ensemble of classificationmodels results into improved classification accuracy than the individual classifier model.

    The task of prediction can be time series where the training data for model generation is recorderover a long span of time and in such cases batch learning is done [7]. In batch learning, the model

    generated on individual batches till the previous time unit is ensembled to form the resultantmodel for testing of present batch data. Another prediction can be non-time series where the

    training data for model generation contains various instances at one particular time instant. Batch

    learning is not feasible in such classifications since all the instances are equally related to eachother. Hence for obtaining the ensemble of classifiers, the techniques possible for individualmodel generation are bagging [6] with bootstrap subsampling, deep learning and feature subsetselection etc. These techniques aim at increasing the diversity for the ensemble of classifiers.

    Even in the ensemble of classifier model, there occurs an ill posed problem of overfitting. Thisproblem can be handled through regularization. The vector norms applied in the process of

    regularization handles overfitting by reducing the mean squared distance between the traininginstances.

    This paper deals with three prediction problems, first, the prediction of type of IRIS plant from

    among Iris Setosa, Iris Versicolour and Iris Virginica, second, the prediction of a good radarreturn or a bad radar return from the Ionosphere and third, the prediction of type of wheat kernel

    from among Kama, Rosa and Canadian variety. The prediction making for above is done through

    regularized weighted ensemble of deep support vector machine classifiers. The individual modelsfor the ensemble learning are generated through feature subset selection and deep learning. Theweights are assigned to each individual model by majority voting technique. These weights are

    then regularized through four variations i.e. norm 1, norm 2, tikhonov and singular value

    decomposition (SVD) reduced norm 2 regularization. This form of regularization reduces thecurvature of each depression and convolution of the non-linear boundary plot of SVM and hence

    the loss function is modified to promote generalization and provide the essential curve fitting overthe input feature vectors for classification. To the best of our knowledge, this technique of

    regularization of weights with deep learning and such ensemble learning approaches in thesupervised machine learning task, for dealing with the problem of overfitting of the classifiers has

    yet not been applied to such prediction problems. In the stretch of paper firstly the detail aboutdataset and background concepts are discussed. Moving further the algorithm, framework,

    experiment results and comparison analysis is done.

    2. DATA SET

    Three prediction problems used in this paper are summarized in table 1. The training set and test

    set comprise of 70% and 30% of the whole database respectively. This ratio of 7:3 is an arbitraryratio but is chosen because it is a good practical ratio according to most of the experiments in

    machine learning.

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    3/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    49

    2.1.IRIS Dataset

    Iris database is created by R.A. Fisher and donated by Michael Marshall in July 1988 [8]. This isa popular dataset and is being successfully used in several problems related to prediction and

    pattern recognition. The data set contains 3 classes specifying the type of iris plant from among

    Iris Setosa, Iris Versicolour and Iris Virginica. There are a total of 50 instances per class in thewhole dataset. The classification problem is the prediction of category of Iris plant. The four

    attributes or features in record of the dataset are sepal length (cm), sepal width (cm), petal length(cm) and petal width (cm). Table 2 describes the number of instances of each class in total,

    training and test data of Iris data set. Table 3 describes major previous related work done on Irisdata.

    Table 1. Instances distribution in training and test set of data

    S.No.

    Dataset Year ofdata set

    creation

    Numberof classes

    Numberof

    features

    Totalnumber

    of

    instances

    Trainingnumber of

    instances

    Testnumber of

    instances

    1 IRIS 1988 3 4 150 105 45

    2 Ionosphere 1989 2 34 351 246 1053 Seed 2012 3 7 210 147 63

    Table 2. Number of instances of each class in total, training and test data set of Iris Data set

    S. No. Feature Total number of

    instance

    Training

    number of

    instance

    Test number of

    instance

    1 Iris Setosa 50 35 15

    2 Iris Versicolour 50 35 15

    3 Iris Virginica 50 35 15

    Table 3. Previous major experiments reported on Iris data set and classification accuracy achieved in each

    case

    S. No. Year of

    Research

    Problem Statement Reported

    classification

    accuracy (%)

    1. 2014 Neuro-fuzzy classifier system [9] 96.70

    2. 2013 Evolving neural network ensembles using string genetic

    algorithms for pattern classification [10]

    93.30

    3. 2012 Hybrid SVM and decision tree classifier [11] 97.08

    4. 2012 Classifier ensemble for SVM [12] 95.00

    5. 2011 One class SVM weighted bagging [13] 92.00

    6. 2010 Large margin classifier SVM [14] 95.307. 2010 Feature subset selection in neural network classifier[15] 97.00

    8. 2008 SVM based semi supervised classification [16] 95.00

    9. 2003 SVM Ensemble with majority voting [17] : SVM: Bagging

    : Boosting

    96.5096.80

    97.20

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    4/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    50

    2.2.Ionosphere Dataset

    Ionosphere database is created at Johns Hopkins University and donated Vince Sigillito in 1989[18]. For collection of the dataset, radar system is used. This radar contains phased array of 16

    high frequency antennas with the help of which the free electrons in the ionosphere are recorded.

    The two classes into which the categorization has to be done are “good” and “bad” ionosphere.Predictions are done on the basis of 34 attributes. This large number of attribute lists marks this

    dataset different from the other two dataset mentioned in this section. Table 4 shows the numberof instances of each class in total, training and test data of Ionosphere data set. Table 5 shows

    previous major similar contribution on Ionosphere data set

    Table 4. Number of instances of each class in total, training and test data set of Ionosphere Data set

    S. No. Feature Total number of

    instance

    Training

    number of

    instance

    Test number

    of instance

    1 Good radar signal 224 168 56

    2 Bad radar signal 127 78 49

    Table 5. Previous major experiments reported on Ionosphere data set and classification accuracy achieved

    in each case

    S. No. Year of

    Research

    Problem Statement Reported

    classification

    accuracy (%)

    1 2014 Classifier ensemble based on weighted accuracy and

    diversity [19]

    94.00

    2 2014 Weighted classifier ensemble SVM [20] 94.00

    3 2013 Artificial immune recognition through SVM

    classification[21]

    93.00

    4 2013 One class ensemble classifier majority voting approach[22]

    89.80

    5 2010 Fast local Radial basis function kernel SVMclassification[23]

    93.72

    6 2008 Oblique decision tree embedded with SVM

    classification [24]

    92.59

    7 2008 SVM infinite ensemble learning [25] 92.00

    8 2006 Evolving ensemble of classifiers with majority voting

    [26]

    81.00

    2.3. Seed Dataset

    Seed database is one of the new database and hence has a very few previous experiments in itslist. Dataset mentions in it the geometrical properties of the kernel which is a characteristic to

    differentiate varieties of wheat i.e. Kama, Rosa and Canadian. For the collection of the datasetsome X-ray techniques are used [27]. Seven parameters of wheat kernels which forms the feature

    set in the dataset are area (A), compactness (C = 4*pi*A/P^2), perimeter (P), length of kernel,

    asymmetry coefficient, width of kernel, and length of kernel groove. Table 6 shows number ofinstances of each class in total, training and test data set of seed data. Table 7 shows the major

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    5/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    51

    previous similar contribution on seed data. (To the best of our knowledge, this data has beenworked upon on the similar proposals, only by its developers, till date. Hence a single previous

    work is reported in table 7).

    Table 6. Number of instances of each class in total, training and test data set of Seed Data set

    S. No. Feature Total number of

    instance

    Training

    number of

    instance

    Test number of

    instance

    1 Kama 70 49 21

    2 Kosa 70 49 21

    3 Canadian 70 49 21

    Table 7. Previous major experiments reported on seed data set and classification accuracy achieved in each

    case

    S. No. Year of

    Research

    Problem Statement Reported

    classificationaccuracy (%)

    1 2012 Complete gradient clustering with K Mean Algorithm

    [28]

    92.00

    3.BACKGROUND APPROACH

    3.1.SVM Classifier

    Origin of SVM classifiers lies in VC dimensions. VC dimension is defined on a set of function. It

    is the maximum number of points that can be separated in all possible ways by that set of

    function. The non-linearly separable data are transformed to higher dimensions for achieving

    classification through SVM (figure 1). The margin between the classes can be soft margin or hardmargin (figure 2). In case of soft margin classifiers, the model generated contains the

    compensation of the misclassified instances. However the hard margin does not allow anymisclassification. Instead, it plots strict non-linear boundary to avoid misclassification. SVM

    classifies the data through hinge loss optimization function. Soft margin classification is moreprevalent than hard margin classification since the later faces a very high rate of overfitting.

    Figure 1. SVM

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    6/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    52

    3.2.Ensemble of classifiers

    Ensemble Learning is the process of training multiple learning machines individually and therebycombining their outputs similar to a committee of decision makers. The principle behind this

    method of decision making is that the individual predictions combined appropriately, should have

    better overall accuracy, on average, than any individual committee member [29]. PrimeAggregation method applied in the ensemble learning are voting techniques such as majority

    voting, borda count aggregation, behaviour knowledge based aggregation, dynamic classifier

    selection etc. [30]. Out of these, our proposed learning technique uses majority voting [31]aggregation. The three versions of majority voting are unanimous voting, simple voting andplurality voting. Plurality voting is the most optimal form of majority voting.

    Majority voting in the proposed statement of this paper aims at giving high weightage to more

    qualified experts in the ensemble of classifiers. The expertise is inversely proportional to theclassification error.

    Figure 2. Hard Margin SVM plot

    3.3.Feature Subset Selection

    Feature selection algorithms attempt to select features which are useful and deselect the features

    which are not helpful or destructive to learning [32]. Feature subset selection is an importantphase of pre-processing in machine learning [33]. At times in this phase some feature areremoved totally. However these removed features may become important when incorporated in

    some combination with other features. This disadvantage of feature selection can be removed byutilizing it in ensemble learning. Here several combinations of features are selected through some

    algorithms to form individual models to be ensembled. Various selection algorithms areexhaustive selection (evaluation of all possible subset of features), branch and bound selection(evaluation using branch and bound algorithm), sequential forward selection (SFS) (select best

    single feature and then add one feature at a time in combination which maximizes decision

    accuracy), sequential backward selection (SBS) (select all the features and remove one feature ata time which maximizes the decision accuracy) and best individual feature selection (evaluationof all the N features individually and then taking the best set of features) etc. [34]. SFS is bottom

    up procedure and SBS is top down procedure. Here the exhaustive selection is most ideal

    approach but is feasible only when the number of attributes is few in numbers. Otherwise thepossible combination can shoot exponentially in number, not possible to handle.

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    7/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    53

    3.4.Deep Learning

    Deep SVM is inspired from the success of deep neural networks [35], deep belief networks [35],

    and deep Boltzmann machine [36] etc. Multilayer perceptron with many hidden layers is anexample of deep learning. Deep learning is a type of machine learning techniques that learns

    multiple levels of representations in deep architectures [37]. There are chances of theconventional classifiers to get trapped in local optima of objective function. But the deep

    architectures learns the feature representations through both supervised training and fine tuning atfurther deep phases of learning. First phase of deep SVM is the standard training process. Then in

    the second phase, the kernel activations of the support vectors of first phase are set as inputs for

    another SVM and so on till whatever level of tuning is required to be done [38]. Usually thetuning starts to repeat after 3-4 levels of deep learning. This training procedure is greedy in

    nature. This makes the computationally very efficient. Ensemble of each phase of learning in the

    deep learning further increases the precision of the model. However, there exist fine tuninglearning, but the model function still over fits the data points due to non-linear kernel activation

    learning.

    3.5.Regularization

    The concept of regularization came into existence in 1990’s. In the supervised machine learningproblems, accurate prediction is more important than the close fit of the function onto the data.Hence generalization is appreciated or in other words overfitting of function has to be checked. In

    figure 3 the blue curve is a 2 degree curve, red curve is a 4 degree curve and the green curve is the

    8 degree curve which is the maximum out of the three. The green curve plots the close fitboundary between the two classes, but the test accuracy decreases. However the blue curve shows

    minimum training accuracy but chances of betterment in test accuracy is the maximum in thiscase. Green curve marks overfitting. Hence it can be said that the overfitting occurs when

    generalization is decreased. Regularization is a measure to check this overfitting. This providesproblem stability. Regularization restricts the hypothesis space to a linear function or a

    polynomial of a particular degree according to the scenarios and smoothness to the function isprovided by putting the function in Reproducing kernel hilbert space (RKHS). A regularization

    parameter ‘λ ’ associated with the regularization term of optimization function which controls the

    trade-off between stability and accuracy.

    Figure 3. Fitting of classifier on the data set

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    8/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    54

    In case of the ensemble learning, the regularization can be applied to the optimization of the lossfunction. By doing this the degree of the best fit polynomial is reduced and test classification

    accuracy is improved. On the other side, overfitting can also be dealt with by keeping the degreeof the best fit function constant and regularizing the weightage associated to each individual

    classifiers participating in the ensemble learning. This reduces the curvature of each positive or

    negative depression in the curve without reducing the degree of whole curve. Hence the lossfunction is modified to provide the boundary fitting over the input feature vectors.

    Another statistical technique is bootstrap resampling in which a new set dataset DT’ is drawn out

    from the previous dataset DT by random sampling with replacement. Bagging is performed by

    applying this in several iterations and then performing ensemble learning onto this. For a largeDT, the number of individual samples that are not present in any of the bootstrapped dataset is

    large. The probability that first training sample is not selected once is (1- 1/N) and not selected at

    all is (1-1/N)N [1]. Since N -> ᴔ, 1/e = 0.36 .Hence only about 63% of original training samples

    are represented in any bootstrapped set. Since bagging reduces variance, it provides an alternative

    approach to regularization [6] because even if each classifier is individually overfit, they arelikely to be overfit to different things.

    4.PROPOSED WORK

    In our work, regularized ensemble of deep SVM classifier has been used which shows a markableimprovement in the classification accuracy of prediction problems. For training and optimization

    of our problem, we have used a popular library libSVM [40,41]. The ensemble of deep classifiers

    is generated using four different frameworks shown in fig 4, fig 5, fig 6, fig 7. Fig 4 showsensemble of classifiers based on feature subset selection framework where the individual models

    are formed by different training on different feature subset. Even those features which do not

    contribute well in isolation or total combination, may work well in some combinations. Thisexplores all the best possible decisions using feature combinations. Fig 5 shows the ensemble of

    deep classifiers level 1 where each individual model is generated by the training in each phase ofdeep learning. This provides improved basic training through fine tuning of deep phases. Fig 6

    shows the ensemble of deep classifiers level 2, where fine tuning at a further level is done. Fig 7

    shows a combination of motive achieved in fig 4 and fig 6 i.e. ensemble of deep classifierslearning with feature subset selection.

    Figure 4. Ensemble of classifiers based on feature subset selection framework

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    9/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    55

    Figure 5. Ensemble of deep classifiers level 1 framework

    Figure 6. Ensemble of deep classifiers level 2 framework

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    10/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    56

    Figure 7. Ensemble of deep classifiers learning with feature subset selection

    For SVM, the loss function optimized is the hinge loss L(f(x),y)=max(0,1-y.f(x)). It has beenobserved that the regularization technique that generates the best accuracy for our proposed work

    is the singular value decomposition (SVD) reduced weight matrix with regularization parameterʎ1 and square of norm 2 of weight matrix with regularization parameter ʎ2. Other regularization

    factors are norm1, norm2 and tikhonov regularization. The objective function is described in

    equation 1:

    (1)

    Here βi is achieved through regularized majority voting.

    Algorithm 1: Regularized ensemble of classifiers using exhaustive feature subset selection

    1: Start2: Find all the possible combinations of features

    3: Train the SVM classifier all combinations received in 1

    4: Estimate the weights {β1……… βt} associated with each individual model through regularizedmajority voting

    5: Evaluate ensemble of classifier model

    6: Report ensemble model, classification accuracy on Test data set and the weights {β1……… βt}.

    7: End

    Algorithm 2: Regularized ensemble of classifiers using best N feature subset selection

    1: Start

    2: Train the SVM classifier on all individual features.3: Record the accuracy generated and the corresponding feature in descending order.4: Train SVM classifiers on Classifierset= {Best N, Best N-1, Best N-2…… Best 1}

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    11/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    57

    4: Estimate the weights {β1………  βt} associated with each member of Classifierset  throughregularized majority voting technique

    5: Evaluate ensemble of classifier model6: Report ensemble model, classification accuracy on Test data set and the weights {β1……… βt}.

    7: End

    Algorithm 3: Regularized ensemble of deep classifiers

    1: Start

    2: for level= 1: t

    3: Train the SVM classifier on data set D and record the model generated in [Model]4: Generate new data set D’ with the support vectors of model generated

    5: D=D’

    6: end for7: Estimate the weights {β1……… βt} associated with each member of [Model] through regularized

    majority voting technique5: Evaluate ensemble of classifier model

    6: Report ensemble model, classification accuracy on Test data set and the weights {β1……… βt}.

    7: End

    Regularization parameter λ  associated with the regularization term is an important term to controlthe trade-off between stability and accuracy. There are many regularization techniques in

    existence and this is also a topic under further research. L1 Regularization is norm 1regularization factor which penalizes all the factors equally. This focuses on selection of only the

    relevant factors. Its numerical definition is λ 1.||β||1. L1 penalty is linear which tends to producemany points with zero curvature. A disadvantage with this regularizer is slow convergence in caseof large scale problems. Secondly, L2 regularizer minimizes curvature at all the points in the

    curve by applying penalty that scales square of curvature. Its numerical definition is λ 1.||β||2.Complexity of L2 regularization is greater than L1 regularizer. Thirdly,

    Tikhonov regularizer is a special case of L2 Regularization numerically defined by term

    (λ 1)2

    .(||β||2)2

    . Further the SVD reduced norm 2 regularization is represented as λ 1. SVD(β) + λ 2.(||β||2). SVD has multiple roles and can be viewed as a method for transforming correlatedvariables into a set of uncorrelated ones that better expose the various relationships among the

    original data items, a method for identifying and ordering the dimensions along which data points

    exhibit the most variation and a method for data reduction by finding the best approximation ofthe original data points using fewer dimensions. Regularization path varies with the experimental

    conditions.

    5.EXPERIMENTS

    In all the experiments listed below, SVM classifier is used because it evaluates dot products of

    vectors in the higher dimension to construct the dividing boundary. The choice of a kernelfunction depends on the model to plot. A polynomial kernel allows to model feature conjunctions

    up to the order of the polynomial. Radial basis functions (RBF) allows plotting circular

    boundaries in higher dimensions. Linear kernel allows putting linear boundaries in higherdimensions. Multiclass classification is best achieved through RBF. If ƴ    is the kernel bandwidth

    parameter and (Xi , Xj) is vector to be transformed to higher dimensions, equations 2 shows RBFkernel equation.

    (2) 

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    12/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    58

    Other important algorithm used is the parameter estimation algorithm of Grid search. In v-foldcross-validation, the training set is divided into v subsets of equal size. Classifiers are trained on

    v-1 subsets and are tested on one subset. Hence each instance is predicted once and so the crossvalidation accuracy is the percentage of data which are correctly classified. The kernel parameters

    (C, ƴ  ) are estimated using cross-validation. Various combination of (C, ƴ  ) is tried and one with

    best cross validation accuracy is picked. In the experiments of our proposed work, libSVM library[40, 41], is used for training multi class SVMs with RBF kernel. The features in the training and

    test datasets were scaled in the range [-1, +1]. 10 fold cross validation is used for choosing thekernel bandwidth parameter ƴ   and SVM C parameter through grid search. The range of (C, ƴ  ) are

    [2-10

    ,2-9

    , …..25] and [2

    -5, 2

    -4,…..2

    10] respectively. The range for regularization parameter λ 1 and λ 2 

    is 0 < λ 1 < 0.5 and 0 < λ 2 < 0.5 respectively. Five cases of experiment are described below. Resultsof bagging technique are listed in table 8 for a comparative vision.

    Case 1: Bagging Ensemble of classifiers

    Case 2: Ensemble of classifiers based on feature subset selectionCase 3: Ensemble of classifiers in deep learning level 1

    Case 4: Ensemble of classifiers in deep learning level 2Case 5: Ensemble of classifiers in deep learning level 1 with the feature subset selection

    Cases 2,3,4,5 have subcases for the following regularization schemes:Setting 1: SVD reduced Norm 2 regularization 

    Setting 2: Norm 1 regularization

    Setting 3: Norm 2 regularizationSetting 4: Tikhonov regularization

    Table 8. Results of Bagging ensemble of classifiers in all the three dataset

    S.No. Dataset Classification

    accuracy (%) in

    Bagging Ensemble of

    classifiers

    1. IRIS Data set 96.66

    2. Ionosphere Dataset

    87.87

    3. Seed Data set 95.38

    For the feature subset selection IRIS data set uses Algorithm 1 i.e. exhaustive feature subset

    selection. This is most optimal selection algorithm. For Ionosphere and Seed data set, since the

    number of features or the attributes is very large in number, it is very lengthy and highly complexto find out all the possible combinations of attributes. Hence they both use Algorithm 2 i.e. best N

    feature subset selection. Fig 8 represents the classification accuracy results of experiments on

    IRIS dataset. Fig 9 and fig 10 represents 2D and 3D scatterplot where different colours mark thedifferent class vectors respectively. Similarly Fig 11, fig 12, fig 13 are the corresponding results

    on Ionosphere dataset and fig 14, fig 15 and fig 16 are the corresponding results on Seed dataset.

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    13/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    59

    Figure 8. Results of experiments on IRIS Dataset

    Figure 9. 2D Scatter plot between all pair of attributes in IRIS dataset

    Figure 10. 3D Scatter plot between all pair of attributes in IRIS dataset

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    14/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    60

    Figure 11. Results of experiment on Ionosphere dataset

    Figure 12. 2D Scatter plot on the best set of features in Ionosphere dataset.

    Figure 13. 3D Scatter plot on the best set of features in Ionosphere dataset

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    15/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    61

    Figure 14. Results of experiment on seed dataset

    Figure 15. 2D Scatter plot on the best set of features in Seed dataset.

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    16/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    62

    Figure 16. 3D Scatter plot on the best set of features in Seed dataset.

    6.OBSERVATION

    The results in all the above three set of experiments show the improved classification accuracy

    than the major reported previous results, in the case of ensemble of deep classifiers level 2 withthe SVD reduced norm 2 regularizations which is nearly 99%. Time taken in this particular case

    for various dataset is reported in table 9. It is to be noted that time taken in case of ionospheredata is comparatively larger than other two dataset due to comparatively large number of features

    in it. The deep learning on the complete dataset is generating better results than the deep learning

    on the feature subset selection schemes. This is because the fine tuning in the presence of all thefeatures is better in comparison to the feature subset. The penalty in Norm 1 regularization deletes

    many noise features by estimating their coefficients to zero since it is not differentiable at zero.

    Whereas the penalty in Norm 2 regularization uses all the input features in classification becauseit is differentiable at all points in the function. Hence Norm 2 regularization achieves higher ordersmoothness for curve estimation.

    Table 9. Time taken in deep learning level 2 with full feature set ensemble learning with SVD

    reduced Norm 2 regularization

    S. No. Dataset Time (sec)

    1. IRIS Data set 16.37

    2. Ionosphere Data set 123.78

    3. Seed Data set 32.78

    Next, since the bagging model shows the inclusion of only about 63% of the original trainingsamples in any bootstrapped set (as discussed in section 3.5), the regularization provided by thistechnique is not as smooth as the ensemble of deep classifiers. Analysis of the regularizers

    applied above can be done on the basis of worst case time complexity. In Norm 1 regularization,there are total of (t-1) sum operations computed at run of algorithm. Time Complexity O(t) is

    reported. In Norm 2 regularization, there are total of (t-1) sum operations, t operations to squareall the elements, and 1 square root operation is computed. Time complexity O(3t) is reported. One

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    17/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    63

    degree regularization parameter is applied. In the tikhonov regularization, time complexity O(3t)is same as L2 regularization but here 2 degree regularization parameter is applied. In

    (SVD+Norm2), there are two expressions involved. O(t2) for SVD computation summed with

    O(3t) for norm 2 computation. Hence time complexity O(t2) is reported.

    7.CONCLUSION

    The deep learning approach for the improvement in the classification accuracy is very prevalent

    in the artificial neural network field. The deep SVM classifier is still an emerging concept. Herethe experiments prove a good scope of deep learning with SVM classifiers. Regularization ofdeep learning has further marked an improvement in classification accuracy. Many other

    regularization techniques could be applied for comparison and better results. Other feature

    selection strategies such as SFS and SBS could also be applied for feature subset selection.

    REFERENCES

    [1] Rob Schapire, “Theoretical Machine Learning”, COS 511, Lec No. 1, p. 1-6, 2008

    [2] R. Sathya, Annamma Abraham, “Comparison of Supervised and Unsupervised Learning Algorithms

    for Pattern Classification”, IJARAI, Vol 2, No. 2, 2013[3] D. Michie, D.J. Spiegelhalter, C.C. Taylor, “Machine Learning, Neural and Statistical Classification”,

    Tutorial section 2.1, p. 6-16, 1994

    [4] S. B. Kotsiantis, “Supervised Machine Learning: A Review of Classification Techniques”, Informatica,

    Vol 31, p. 249-268, 2007

    [5] Koby Crammer, Yoram Singer, “On the Algorithmic Implementation of Multiclass Kernel-based

    Vector Machines”, JMLR 2, p. 256-295, 2001

    [6] Hal Daume III, “A course in Machine Learning”, Ensemble learning CIML, V0-8, Ch 1,p. 148-155,

    2012

    [7] A.Vergara, Shankar Vembu, Tuba Ayhanb, Margaret A. Ryanc, Margie L. Homerc, Ramon Huertaa

    “Chemical gas sensor drift compensation using classifier ensembles” . Sensors and Actuators B, p.

    166-167 2012.

    [8] Fisher’s IRIS dataset, UCI repository, https://archive.ics.uci.edu/ml/datasets/Iris, 1988

    [9] Vaishali Arya, R.K.Rathy, “An Efficient Neuro-Fuzzy Approach for Classification of Iris Dataset”,

    International Conference on Reliability, Optimization and Information Technology, p. 161- 165, 2014.[10] Xiaoyang Fu and Shuqing Zhang, “Evolving Neural Network Ensembles Using Variable String

    Genetic Algorithm for Pattern Classification”, Sixth International Conference on Computational

    Intelligence, p. 81-85 2013.

    [11] Anshu Bharadwaj, Sonajharia Minz, “Hybrid Approach for Classification using Support Vector

    Machine and Decision Tree”, International Conference on Advances in Electronics, Electrical and

    Computer Science Engineering, p. 337-341, 2012.

    [12] Hamid Parvin, Sajad Parvin, “Robust Classifier Ensemble for Improving the Performance of

    Classification”, Eleventh Mexican International Conference on Artificial Intelligence, IEEE special

    session, Vol 11, p. 52-57, 2012.

    [13] Xue-Fang Chen, Hong-Jie Xing, Xi-Zhao Wang, “A modified AdaBoost method for one-class SVM

    and its application to novelty detection”, IEEE, Vol 11 p. 3506-3511, 2011.

    [14] Hakan Cevikalp, Bill Triggs , Hasan Serhan Yavuz , Yalc, Mahide, Atalay Barkana, “Large margin

    classifiers based on affine hulls” Elsevier, Vol 73, p. 3160-3168, 2010.

    [15] A. Marcano-Cedeño, J. Quintanilla-Domínguez, M.G. Cortina-Januchs, D. Andina, “Feature SelectionUsing Sequential Forward Selection and classification applying Artificial Metaplasticity Neural

    Network”, IEEE, No. 36, p. 2845-2850, 2010

    [16] Narendra S. Chaudhari, Aruna Tiwari, Jaya Thomas,“Performance Evaluation of SVM Based Semi-

    supervised Classification Algorithm”, 10th Intl. Conf. on Control, Automation, Robotics and Vision,

    No. 10, p. 1942-1947, 2008

    [17] Hyun-Chul Kim, Shaoning Pang, Hong-Mo Je, Daijin Kim, Sung Yang Bang, “ Constructing support

    vector machine ensemble”, The journal of the pattern recognition society, Vol 36, p. 2757-2767, 2003.

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    18/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    64

    [18] Vince Sigillito, Ionosphere Dataset , UCI repository, https://archive.ics.uci.edu/ml/datasets/Ionosphere,

    1989

    [19] Xiaodong Zeng, Derek F. Wong, Lidia S. Chao, “Constructing Better Classifier Ensemble Based on

    Weighted Accuracy and Diversity Measure”, The Scientific World Journal, Volume 2014, Article No.

    961747, p. 1-12, 2014

    [20] Shasha Mao, LichengJiao, LinXiong, ShuipingGou, BoChen, Sai-KitYeung, “Weighted classifier

    ensemble based on quadratic form”, Elsevier Vol 48, Issue 5, p. 1688-1706, 2014[21] Darwin Tay, Chueh Loo Poh, Richard I. Kitney, “An Evolutionary Data-Conscious Artificial Immune

    Recognition System” , Proceedings of the 15th annual conference on Genetic and evolutionary

    computation, p. 1101-1108, 2013

    [22] Eitan Menahem, Lior Rokach, Yuval Elovici, “Combining One-Class Classifiers via Meta Learning”,

    Proceedings of 22 ACM international conference on information & knowledge management, No. 22,

    p. 2435-2440, 2013

    [23] Nicola Segata , Enrico Blanzieri, “Fast and Scalable Local Kernel Machines”, JMLR, Vol 1, p. 1883-

    1926, 2010

    [24] Vlado Menkovski, Ioannis T. Christou, Sofoklis Efremidis, “Oblique Decision Trees Using

    Embedded Support Vector Machines in Classifier Ensembles” , Vol 11, p. 1-6, 2008

    [25] Hsuan-Tien Lin , Ling Li, “Support Vector Machinery for Infinite Ensemble Learning”, JMLR , Vol

    9, p. 285-312, 2008

    [26] Albert Hung-Ren Ko, Robert Sabourin, Alceu de Souza Britto, “Evolving Ensemble of Classifiers in

    Random Subspace”, Proceedings of the 8th annual conference on Genetic and evolutionarycomputation, p. 1473-1480, 2006

    [27] Gorzata’s Seed Data set, UCI repository, https://archive.ics.uci.edu/ml/datasets/seeds, 2012

    [28] M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, “A Complete

    HGradient Clustering Algorithm for Feature Analysis of X-ray Images”, Information Technology in

    Biomedicine, Springer-Verlag, p. 15-24, 2010

    [29] Gavin Brown, Encyclopaedia of Machine Learning Vol 1, p. 312-320, 2010

    [30] Robi Polaker, Ensemble based systems in decision making, IEEE, Vol 6, Issue 3, p. 21-45

    [31] Hyun-Chul Kim, Shaoning Pang, Hong-Mo Je, Daijin Kim, and Sung-Yang Bang, Support Vector

    Machine Ensemble with Bagging, Springer, LNCS 2388, p. 397-408, 2002

    [32] David W. Opitz, “Feature Selection for Ensembles”, American Association for Artificial Intelligence,

    AAAI Proceeding No. 99, p.1-6, 1999

    [33] Mohamed A. Aly, “Novel Methods for the Feature Subset Ensembles Approach”, International

    Journal of Artificial Intelligence and Machine Learning, Vol. 6, No. 4, p. 1-7, 2006

    [34] Anil K. Jain, Robert P.W. Duin, Jianchang Mao, “Statistical Pattern Recognition: A Review”,IEEEtransactions on pattern analysis and machine intelligence, Vol 22, Issue 1, p. 4-37, 2000

    [35] Dong Yu and Li Deng, “Deep Learning and Its Applications to Signal and Information Processing” ,

    IEEE processing magazine Vol 28, Issue 1, p. 145-154, 2011

    [36] Nitish Srivastava, Ruslan Salakhutdinov, “Multimodal Learning with Deep Boltzmann Machines”,

    ICML, 25 Annual Conferrence on learning theory, No. 25, p. 1-9, 2012

    [37] Xue-Wen chen, Xiaotong Lin, “Big Data Deep Learning: Challenges and Perspectives”, IEEE Access,

    Vol 2, p. 514-525, 2014.[38] Azizi Abdullah, Remco C. Veltkamp, Marco A. Wiering, “An Ensemble of Deep Support Vector

    Machines for Image Categorization”, International Conference of Soft Computing and Pattern

    Recognition, p.301-306, 2009.

    [39] Hal Daume III, From zero to reproducing kernel hilbert spaces in twelve pages or less, p.1-12, 2004

    [40] C.-C. Chang, C.-J. Lin, LIBSVM: A Library for Support Vector Machines, Software. Available at

    http://www.csie.ntu.edu.tw/cjlin/libsvm, 2001.

    [41] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin, Department of Computer Science, NationalTaiwan University,Taipei 106, Taiwan, 2003, Practical Guide to Support Vector Classification, p 1-

    16, 2003.

  • 8/20/2019 Regularized Weighted Ensemble of Deep

    19/19

    International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.3, June 2015

    65

    Authors

    Ms. Shruti Asmita (B.Tech., 2013 – KEC Ghaziabad, Uttar Pradesh Technical University, Lucknow) is a

    M.Tech. Computer Science scholar at Banasthali University, Jaipur and pursuing her research internship at

    IIT-BHU (CSE), Varanasi. Her research interests include data mining, image processing, machine learning

    and sensor networks etc.

    Dr. K.K. Shukla (Ph. D., 1993 - Institute of Technology (BHU), Varanasi) is professor and current head of

    department at Indian Institute of Technology, Banaras Hindu University Varanasi, India. He has been

    awarded B.Tech from APSU, Rewa in 1980, M.Tech. from IT (BHU) in 1982 and PhD from IT (BHU) in

    1993. He is having research and teaching experience of 30 years, He is having more than 120 research

    papers in reputed journals and conferences and more than 90 citations. His present research collaborations

    in India include ISRO and TCS. Out of India research collaborations includes INRIA, France and ETS,

    Canada. He has many popular books under his authorship on subjects Neuro-computers, RTS Scheduling,

    Fuzzy modelling and Image Compression. His field of research includes image processing and pattern

    recognition, fuzzy logics, wireless sensor networks and machine learning etc.