support vector machine figure 6.5 displays the architecture of a support vector machine....

Support Vector Machine

Figure 6.5 displays the architecture of a support vector machine.

Irrespective of how a support vector machine is implemented, it differs from the conventional approach to the design of a multilayer perceptron in a fundamental way.

In the conventional approach, model complexity is controlled by keeping the number of features (i.e., hidden neurons) small. On the other hand, the support vector

machine offers a solution to the design of a learning machine by controlling model complexity independently of dimensionality, as summarized here (Vapnik, 1995,1998):

• Conceptual problem. Dimensionality of the feature (hidden) space is purposely made very large to enable the construction of a decision surface in the form of a hyperplane in that space. For good generalization

performance, the model complexity I scontrolled by imposing certain constraints on the construction of the separating hyperplane, which results in the extraction of a fraction of the training data as support vectors.

• Computational problem. Numerical optimization in a high-dimensional space suffers from the curse of dimensionality. This computational problem is avoided by using the notion of an inner-product kernel (defined in accordance with Mercer's theorem) and solving the dual form of the constrained optimization problem formulated in the input (data) space.

Support vector machine

• An approximate implementation of the method of structural risk minimization

• Patten classification and nonlinear regression• Construct a hyperplane as the decision surfac

e in such a way that the margin of separation between positive and negative examples is maximized

• We may use SVM to construct : RBNN, BP

Optimal Hyperplane for Linearly Separable Pattens

• Consider training sample

where is the input pattern, is the desired output:

It is the equation of a decision surface in the form of a hyperplane

)},{(1dx iiN

i

xi d i

11

11

00

00

dbXWdbXWii

Tii

T

for

for

• The closest data point is called the margin of separation

• The goal of a SVM is to find the particular hyperplane of which the margin is maximized

• Optimal hyperplane

000bXW

T

Given the training set the pair must satisfy the constraint:

The particular data point for which the first or second line of the above equation is a satisfied with the equality sign are called support vectors

)},{( dx iiT

),(00 bW

11

11

00

00

dbxWdbxWii

Tii

T

for

for

),( dx ii

• Finding the optimal hyperplane

with maximum margin= 2ρ, ρ=1/||w||2• It is equivalent to minimize the cost functio

n• According to kuhn-Tucker optimization the

ory : we state the problem as

NiforbXWd i

T

i,....,2,11)(

WWT

W 21)(

Given the training sample

find the Lagrange multiplier

that maximize the objective function

subject to the constraints

(1)

(2)

Nidx ii,...,2,1)},{(

Nii

,...,2,1}{

XXdd j

T

iji

N

i

N

jji

N

ii

Q

1 11 2

1)(

01

d i

N

ii0 i

and find the optimal weight vector

…(1)

W 0

XdW ii

N

ii

0

,00

• We may solve the constrained optimization problem using the method of Lagrange multipliers (Bertsekas, 1995)

• Fisrt, we construct the Lagrangian function:

, where the nonnegative variables αare called Lagrange multipliers.

• The optimal solution is determined by the saddle point of the Lagrangian function J, which has to be minimized with respect to w and b; it also has to be maximized with respect to α.

]1)([2

1),,(

1

bxwdwwabwJ iT

N

iii

T

• Condition 1:

• Condition 2:

N

iiii xdw

w

abwJ

1

,0),,(

0 ,0),,(

1

N

iiidb

abwJ

• The previous Lagrangian function can be expanded term by term, as follows:

• The third term on the right-hand side is zero by virtue of the optimality condition. Furthermore, we have

N

ii

N

iii

N

ii

Tii

T dbxwdwwbwJ1112

1),,(

N

ij

N

j

Tijiji

N

ii

Tii

T xxddxwdww1 11

• Accordingly, setting the objective function J(w,b, α)=Q(α), we may reformulate the Lagrangian equation as:

• We may now state the dual problem:

Given the training sample {(xi,di)}, find the Lagrange multipliers αi that maximize the objective function Q(α), subject to the constrains:

N

ij

N

j

Tijiji

N

ii xxddQ

1 11 2

1)(

0 )2(

0 )1(1

i

i

N

iid

Optimal Hyperplane for Nonseparable Patterns

1.Nonlinear mapping of an input vector into a high-dimensional feature space 2.Construction of an optimal hyperplane for separating the features

• Given a set of nonseparable training data, it is not possible to construct a separating hyperplane without encountering classification errors.

• Nevertheless, we would like to find an optimal hyperplane that minimizes the probability of classification error, averaged over the training set.

• The optimal hyperplane equation

will violate in two conditions:

1. The data point (xi, di) falls inside the region of separation but on the right side of the decision surface.

2. The data point (xi, di) falls on the wrong sid of the decision surface.

Thus, we introduce a new set of nonnegative “slack variable” ξinto the definition of the hyperplane:

Niforbxwd iT

i ,....,2,11)(

Niforb ii

T

i xwd ,....,2,11)(

• For 0 ξ 1, the data point falls inside the ≦ ≦region of separation but on the right side of the decision surface.

• For ξ>1, it falls on the wrong side of the separation hyperplane.

• The support vectors are those particular data points that satisfy the new separating hyperplane equation precisely even if ξ>0.

• We may now formally state the primal problem for the nonseparable case as:

And such that the weight vector w and the slack variable ξminimize the cost function:

Where C is a user-specified positive parameter.

i allfor 0

,....,2,11)(

i

ii

T

iNiforbxwd

N

ii

TCW ww

12

1)(

• We may formulate the dual problem for nonseparable patterns as:

Given the training sample {(xi,di)}, find the Lagrange multipliers αi that maximize the objective function Q(α),

• subject to the constrains:

N

ij

N

j

Tijiji

N

ii xxddQ

1 11 2

1)(

C

d

i

i

N

ii

0 )2(

0 )1(1

Inner-Product Kernal

Let Φdenotes a set of nonlinear transformation from the input space to feature space. We may define a hyperplane acting as the decision surface as follows:

We may simplify it as

By assuming

0)(1

bxj

m

jjW

0)(0

XW j

m

jj

bX 00 w,1)(

)](),..,(),(),([)( 20 1 XXXXX m

T

0)( XWT

)(1

XdW i

N

iii

• According to the condition 1 of the optimal solution of Lagrange function, we now transform the sample point to its feature space and obtain:

• Substituting it into wTφ(x)=0, we obtain:

N

iiii xdw

1

)(

)()(1

xxdwN

iiii

Define the inner-product kernel

Type of SVM Kernals

Polynomial (xTxi+1)p

RBFN exp(-1/2σ2||x-xi||2)

)()(),( i

T

i XXXXK

)()(0

ij

m

jj XX

•We may formulate the dual problem for nonseparable patterns as: Given the training sample {(xi,di)}, find the Lagrange multipliers αi that maximize the objective function Q(α),

subject to the constrains:

),(2

1)(

1 11

N

ij

N

jijiji

N

ii xxKddQ

C

d

i

i

N

ii

0 )2(

0 )1(1

• According to the Kuhn-Tucker conditions, the solution αi has to satisfy the following conditions:

• Those points with αi>0 are called support vectors which can be divided into two types. If 0<αi<C, the corresponding training points just lie on one of the margin. If αi=C, this type of support vectors are regarded as misclassified data.

NiC

bxwd

ii

iiT

ii

,...,2,1 ,0)( )2(

0)1))((( )1( 00

EXAMPLE:XOR

• To illustrate the procedure for the design of a support vector machine, we revisit the XOR (Exclusive OR) problem discussed in Chapters 4 and 5. Table 6.2 presents a summary of the input vectors and desired responses for the four possible states.

To proceed, let (Cherkassky and Mulier, 1998)

• With and ,we may thus express the inner-product kernel

in terms of monomials of various orders as follows:

The image of the input vector X induced in the feature space is therefore deduced to be

2)1(),( iT

i XXXXK TxxX ],[ 21 T

iii xxX ],[ 21

),( ixxK

221122

222121

21

21 2221),( iiiiiii xxxxxxxxxxxxXXK

Similarly,

From Eq.(6.41), we also find that

4,3,2,1,]2,2,,2,,1[)( 212221

21 ixxxxxxX T

iiii

TxxxxxxX ]2,2,,2,,1[)( 212221

21

9111

1911

1191

1119

K

• The objective function for the dual form is therefore (see Eq(6.40))

Optimizing with respect to the Lagrange multipliers yields the following set of simultaneous equations:

413121214321 2229(

2

1)( Q

)929229 2443

234232

22

)(Q

19

19

19

19

4321

4321

4321

4321

• Hence, the optimum values of the Lagrange multipliers are

• This result indicates that in this example all four input vectors are support vectors. The optimum value of is)(Q

8

14,03,02,01,0

41}{ iiX

4

1)(0 Q

• Correspondingly, we may write

or

From Eq.(6.42), we find that the optimum weight vector is

4

1

2

1 2

0 W

2

10 W

• The first element of indicates that the bias is zero.

• The optimal hyperplane is defined by (see Eq.6.33)

0

0

02

10

0

2

2

1

2

1

1

2

2

1

2

1

1

2

2

1

2

1

1

2

2

1

2

1

1

8

1

)]()()()([8

143210 XXXXW

0)(0 XW T

• That is,

which reduces to

0]

2

2

2

1

][0,0,0,2

1,0,0[

2

1

22

21

21

x

x

x

xx

x

021 xx

• The polynomial form of support vector machine for the XOR problem is as shown in fig6.6a. For both and ,

the output ; and for both , and and , we have .thus the XOR problem is solved as indicated in fig6.6b

121 xx

121 xx 1y11 x 12 x 11 x

12 x 1y

support vector machine figure 6.5 displays the architecture of a support vector machine....

Documents

optimal hyperplane slide

support vector slide

bp slide

separation hyperplane

optimal hyperplane equation

particular hyperplane

good generalization

optimal weight vector