a ba-based algorithm for parameter optimization of support vector machine

A BA-based algorithm for parameter optimizationof support vector machine

Alaa Tharwat

Electrical Dept. - Suez Canal University- EgyptScientific Research Group in Egypt (SRGE)

Email: [email protected]

November 3, 2016

Alaa Tharwat November 3, 2016 1 / 26

Agenda

The main objective.

Support Vector Machine (SVM).

Bat Algorithm (BA).

The proposed Model.

Experimental Results and Discussions.

Conclusions and Future Work.


The main objective

The main objective is to:

Optimize the SVM parameters.


Support Vector Machine (SVM)

1

2

3

4

5

6

1 2 3 4 5 6

x2 Classe1

7

7

Classe2

x1

w.xi +b=0

1/||w||

1/||w||

w.x

i +b=-1

w.xi +b=+1

H1

H2

SupporteVectors

Margin=2/||w||

Figure: The structure of building a classifier, which includes N samples and cdiscriminant functions or classes.



The aim of SVM is to select the values of w and b to orientate thehyperplane to be as far as possible from the closest samples and toconstruct the two planes, H1 and H2, as follows:

H1 → wT xi + b = +1 for yi = +1

H2 → wT xi + b = −1 for yi = −1(1)

These two equations can be combined as follows:

yi(wT xi + b)− 1 ≥ 0 ∀i = 1, 2, . . . , N (2)

In SVM, the margin width needs to be maximized subject to Eq. (2) asfollows:

min1

2‖w‖2

s.t. yi(wT xi + b)− 1 ≥ 0 ∀i = 1, 2, . . . , N

(3)



In Eq. (3), minimizing ‖w‖ is equivalent to minimizing 12 ‖w‖

2.Moreover, Eq. (3) represents quadratic programming problem that isformalized into Lagrange formula by combining the objective function(min 1

2 ‖w‖2) and the constraints (yi(wT xi + b)− 1 ≥ 0) as follows:

min LP =‖w‖2

2−∑i

αi(yi(wT xi + b)− 1)

=‖w‖2

2−∑i

αiyi(wT xi + b) +

N∑i=1

αi

(4)

where αi ≥ 0, i = 1, 2, . . . , N represent the Lagrange multipliers andeach Lagrange multiplier (αi) corresponds to one training sample (xi, yi)and LP represents the primal problem.



To calculate w, b, and α which minimize Eq. (4), LP is differentiatingwith respect to w and b and setting the derivatives to zero as follows:

∂LP

∂w= 0⇒ w =

N∑i=1

αiyixi (5)

∂LP

∂b= 0⇒

N∑i=1

αiyi = 0 (6)

Substituting Eqs. (5 and 6) into Eq. (4), hence the dual problem can bewritten as follows:

max LD =

N∑i=1

αi −1

2

∑i,j

αiαjyiyjxTi xj

s.t. αi ≥ 0 ,N∑i=1

αiyi = 0 ∀i = 1, 2, . . . , N

(7)



In SVM, most of αi’s are zeros; thus, sparseness is a commonproperty of SVM. The non-zero α’s are corresponding to SupportVectors (SVs), which are the samples closest to the separatinghyperplane; hence, SVs achieved the maximum width margin.

A new sample x0 is classified by evaluating y0 = sgn (wT x0 + b) andif y0 is positive; thus, the new sample belongs to the positive class;otherwise, it belongs to the negative class.



In the case of non-separable data, more misclassified samples result.Therefore, the constraints of linear SVM must be relaxed by adding aslack variable, εi, as denoted in Eq. (8).

wT xi + b ≥ +1− εi for yi = +1

wT xi + b ≤ −1 + εi for yi = −1(8)

yi(wT xi + b)− 1 + εi ≥ 0 where εi ≥ 0 (9)

min1

2‖w‖2 + C

N∑i=1

εi

s.t. yi(wT xi + b)− 1 + εi ≥ 0 ∀i = 1, 2, . . . , N

(10)

Equation (10) is formalized into Lagrange formula as follows:

LP =1

2‖w‖2 + C

N∑i=1

εi −N∑i=1

αi[yi(wT xi + b)− 1 + εi] (11)



∂LP

∂εi= 0⇒ C = αi + εi (12)

From Eq. (12) it can be noticed that αi is limited by the upper-bound C.If the data are non-linearly separable, SVM uses kernel functions to mapthe data into a higher dimensional space using a nonlinear function φ.

min1

2‖w‖2 + C

N∑i=1

εi

s.t. yi(wTφ(xi) + b)− 1 + εi ≥ 0 ∀i = 1, 2, . . . , N

(13)


Support Vector Machine (SVM) SVM Parameter Optimization

0 0.5 1 1.5 2 2.5 3 3.5 4 4.50

2

4

6

8

10

12

(a) C = 0.01

0 0.5 1 1.5 2 2.5 3 3.5 4 4.51.5

2

2.5

3

3.5

4

4.5

5

(b) C = 1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.51.5

2

2.5

3

3.5

4

4.5

5

(c) C = 100

Figure: The effect of the penalty parameter (C) with linear SVM. Decisionboundaries (blue lines), two planes (black lines), support vectors are marked ingreen squares, and misclassified samples are marked with red squares.



Table: The training error rate, number of SVs, and number of misclassifiedsamples of the linear and RBF kernel SVM using different values of C.

CLinear Kernel RBF kernel (σ = 0.1)

Trainingerror (%)

# SVs# Misc.Samples

Trainingerror (%)

# SVs# Misc.Samples

0.01 53.75 52 26 9.51 768 820.1 7.14 34 4 6.14 453 531 0 18 0 1.04 187 9

10 0 8 0 1.04 75 9100 0 4 0 0.46 40 4

1000 0 3 0 3.82 40 33



0 0.2 0.4 0.6 0.8 10.4

0.5

0.6

0.7

0.8

0.9

1

(a) C = 0.01

0 0.2 0.4 0.6 0.8 10.4

0.5

0.6

0.7

0.8

0.9

1

(b) C = 1

0 0.2 0.4 0.6 0.8 10.4

0.5

0.6

0.7

0.8

0.9

1

(c) C = 100

Figure: The effect of the penalty parameter (C) with nonlinear SVM, whereRBF kernel was used and σ = 0.1. Decision boundaries (black lines), supportvectors are marked in green squares, and misclassified samples are marked withred squares.



0 0.2 0.4 0.6 0.8 10.4

0.5

0.6

0.7

0.8

0.9

1

(a) σ = 0.05

0 0.2 0.4 0.6 0.8 10.4

0.5

0.6

0.7

0.8

0.9

1

(b) σ = 0.1

0 0.2 0.4 0.6 0.8 10.4

0.5

0.6

0.7

0.8

0.9

1

(c) σ = 0.2

Figure: The effect of the RBF parameter (σ) with nonlinear SVM when C = 10.Decision boundaries (blue lines), support vectors are marked in green squares,and misclassified samples are marked with red squares.



Table: Training error, number of support vectors, and number of misclassifiedsamples of the RBF kernel SVM when C = 10 using different values of σ.

σ Training error (%) # SVs # Misc. Samples

0.01 44.96 324 3880.05 0.23 100 20.1 1.04 74 90.2 30.59 216 264


Bat Algorithm (BA)

1 Bats’ Positions (Xi): The positions of the bats are used to calculatethe objective function at that location.

2 Bats’ Velocity (Vi): The directed velocity of the bats is used tomove the bats in the search space to the optimal solution.

3 Pulse Rate (ri) and ri ∈ [0, 1]) is updated, i.e. increased, accordinglyas the iteration proceeds as follows rt+1

i = r0i [1− exp(−γt)], whereγ > 0 is a constant and t is the current iteration.

4 Frequency (fi): this parameter is used to adjust the velocity of bats.

5 Loudness (Ai): The loudness of the emitted sound varies from a highloudness when looking for a prey to a low loudness when the prey isnear.


The Proposed Model: BA-SVM

IntializeYBAYparameters

TraininigYset

Dataset

Scaling

TestingYset

TrainYSVMYclassifier

TrainedYSVMclassifier

FitnessYevaluationYhF)

SatisfyingYstoppingcriterion

Yes

OptimizedYhCYandYσ)

NoP

aram

etersYhCYandYσ

)

GeneratesYnewYsolutionsYusingYbatY

algorithm

Population

Figure: Flowchart of the proposed model (BA-SVM).


The Proposed Model: BA-SVM

Data preprocessing.

Parameters’ Initialization.

Fitness evaluation.

Minimize : F =Ne

N(14)

Termination criteria.

Updating positions.


Experimental Results and Discussions

Table: Datasets description.

Dataset Dimension # Samples # Classes

Iris (D1) 4 150 3Ionosphere (D2) 34 351 2

Liver-disorders (D3) 6 345 2Breast Cancer (D4) 13 683 2

Sonar (D5) 60 208 2Tic-Tac-Toc (D6) 9 958 2

Glass (D7) 9 214 7Wine (D8) 13 178 3

Pima Indians Diabetes (D9) 8 768 2


Experimental Results and Discussions Parameter setting for BA

5 10 15 20 25 30 35 402.5

3

3.5

4

4.5

5

5.5

Number of bats

Tes

ting

Err

or r

ate

(%)

(a)

0 10 20 30 40 500

10

20

30

40

Number of bats

CP

U T

ime

(sec

s)

(b)

Figure: Effect of the number of bats on the performance of BA-SVM model foriris dataset: (a) Testing error rate of the BA-SVM model with different numbersof bats; (b) CPU time of the BA-SVM using different numbers of bats.



0 20 40 60 80 1001

1.5

2

2.5

3

3.5

4

4.5

5

5.5

Number of Iterations

Tes

ting

Err

or R

ate

(%)

First runSecond runThird run

(a)

0 20 40 60 80 1005

10

15

20

25

30

35

Number of Iterations

CP

U T

ime

(sec

s)

First runSecond runThird run

(b)

Figure: Effect of the number of iterations on the performance of BA-SVMmodel for iris dataset using three runs. (a) Testing error rate of the BA-SVMmodel with different numbers of iterations; (b) CPU time of the BA-SVM usingdifferent numbers of iterations.



Table: The initial parameters of bat algorithm.

Parameter Value

Frequency (fmin and fmax) fmin = 0 and fmax = 2Pulse Rate (r) 0.5Loudness (A) 0.5Population size 20Maximum number of iterations 20


Experimental Results and Discussions BA-SVM vs. Grid Search

Table: Results of the proposed BA-SVM algorithm and grid search SVMalgorithm (using RBF kernel).

DatasetGrid Search SVM BA-SVM p-Value for

Wilcoxontesting

Costtime (secs)

Testerror (%)

Costtime (secs)

Testerror (%)

D1 268.8 0± 0.2 168.2 0± 0 <0.005D2 1064.2 2.1± 0.6 645.3 0.3± 0.2 <0.005D3 4399.2 15.1± 2.3 2820.4 12± 1.2 <0.005D4 35630.0 2.4± 0.7 25450.0 0.8± 0.3 <0.005D5 532.9 1.2± 0.5 319.1 0.9± 0.8 0.0052D6 53824.2 6.7± 1.4 37120.6 2.1± 0.6 <0.005D7 1710.6 17.4± 2.4 1056.7 13.5± 1.2 <0.005D8 587.36 3.7± 1.0 367.1 0.3± 0.2 <0.005D9 2028.8 19.8± 2.7 1276.0 14.3± 1.5 <0.005


Experimental Results and Discussions BA-SVM vs. other Optimization Algorithms

Table: Comparison between the BA-SVM and PSO+SVM approach proposed byLin [15] and GA+SVM approach proposed by huang[16] (%).

Dataset(1)

BA-SVM(2)

PSO+SVM(3)

GA+SVM

p-Value forWilcoxon

testing(1 vs. 2)

p-Value forWilcoxon

testing(1 vs. 3)

D1 0.0± 0.0 2.0± 0.23 0.0± 0.0 <0.005 0.0052

D2 0.30± 0.20 2.50± 0.65 0.57± 0.53 <0.005 <0.005

D3 12.0± 1.20 14.23± 1.93 16.86± 2.45 <0.005 <0.005

D4 0.80± 0.30 2.05± 0.74 1.0± 0.42 <0.005 <0.005

D5 0.90± 0.80 11.68± 2.64 8.40± 3.14 <0.005 <0.005

D6 2.10± 0.60 6.48± 2.47 8.41± 2.19 <0.005 <0.005

D7 13.50± 1.20 21.96± 4.59 19.26± 4.55 <0.005 <0.005

D8 0.30± 0.20 0.40± 0.33 0.0± 0.15 <0.005 0.0058

D9 14.30± 1.50 19.79± 6.12 16.4± 4.21 <0.005 <0.005


Experimental Results and Discussions BA-SVM vs. other Optimization Algorithms

0

2

4

6

8

10 0

1

2

3

4

50

0.5

1

1.5

2

log (C)log (σ)

Tes

t Err

or (

%)

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

(a) Error surface

0.20.4

0.6 0.6

0.6

0.6

0.6

0.6

0.6

0.6

0.8

0.8

0.8 0.8

0.8

0.8

0.8

0.8

0.8

0.8

0.8

0.8

0.8

0.8

0.8 0.8 0.8

0.8

0.8 0.8

1

1

1 1 1 1

11

1

1.2

1.2

1.2 1.2 1.2

1.4

1.4

1.4 1.4 1.41.6 1.6 1.61.8 1.8 1.8

log (C)lo

g (σ

)

1 2 3 4 5 6 7 8 9

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

(b) Contour plot

Figure: Test error surface and contour plot with parameters on iris dataset.


Conclusions and Future Work

The parameters of SVM and the influence of each parameter.

How BA optimizes the SVM parameters.


a ba-based algorithm for parameter optimization of support vector machine

Education