support vector machine figure 6.5 displays the architecture of a support vector machine....

38
Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machi ne is implemented, it differs from the c onventional approach to the design of a multilayer perceptron in a fundamental w ay. In the conventional approach, model comple xity is controlled by keeping the number of features (i.e., hidden neurons) small. On the other hand, the support vector

Upload: aubree-okeefe

Post on 29-Mar-2015

234 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

Support Vector Machine

Figure 6.5 displays the architecture of a support vector machine.

Irrespective of how a support vector machine is implemented, it differs from the conventional approach to the design of a multilayer perceptron in a fundamental way.

In the conventional approach, model complexity is controlled by keeping the number of features (i.e., hidden neurons) small. On the other hand, the support vector

Page 2: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

machine offers a solution to the design of a learning machine by controlling model complexity independently of dimensionality, as summarized here (Vapnik, 1995,1998):

• Conceptual problem. Dimensionality of the feature (hidden) space is purposely made very large to enable the construction of a decision surface in the form of a hyperplane in that space. For good generalization

Page 3: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

performance, the model complexity I scontrolled by imposing certain constraints on the construction of the separating hyperplane, which results in the extraction of a fraction of the training data as support vectors.

Page 4: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• Computational problem. Numerical optimization in a high-dimensional space suffers from the curse of dimensionality. This computational problem is avoided by using the notion of an inner-product kernel (defined in accordance with Mercer's theorem) and solving the dual form of the constrained optimization problem formulated in the input (data) space.

Page 5: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it
Page 6: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

Support vector machine

• An approximate implementation of the method of structural risk minimization

• Patten classification and nonlinear regression• Construct a hyperplane as the decision surfac

e in such a way that the margin of separation between positive and negative examples is maximized

• We may use SVM to construct : RBNN, BP

Page 7: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

Optimal Hyperplane for Linearly Separable Pattens

• Consider training sample

where is the input pattern, is the desired output:

It is the equation of a decision surface in the form of a hyperplane

)},{(1dx iiN

i

xi d i

11

11

00

00

dbXWdbXWii

Tii

T

for

for

Page 8: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• The closest data point is called the margin of separation

• The goal of a SVM is to find the particular hyperplane of which the margin is maximized

• Optimal hyperplane

000bXW

T

Page 9: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

Given the training set the pair must satisfy the constraint:

The particular data point for which the first or second line of the above equation is a satisfied with the equality sign are called support vectors

)},{( dx iiT

),(00 bW

11

11

00

00

dbxWdbxWii

Tii

T

for

for

),( dx ii

Page 10: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• Finding the optimal hyperplane

with maximum margin= 2ρ, ρ=1/||w||2• It is equivalent to minimize the cost functio

n• According to kuhn-Tucker optimization the

ory : we state the problem as

NiforbXWd i

T

i,....,2,11)(

WWT

W 21)(

Page 11: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

Given the training sample

find the Lagrange multiplier

that maximize the objective function

subject to the constraints

(1)

(2)

Nidx ii,...,2,1)},{(

Nii

,...,2,1}{

XXdd j

T

iji

N

i

N

jji

N

ii

Q

1 11 2

1)(

01

d i

N

ii0 i

Page 12: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

and find the optimal weight vector

…(1)

W 0

XdW ii

N

ii

0

,00

Page 13: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• We may solve the constrained optimization problem using the method of Lagrange multipliers (Bertsekas, 1995)

• Fisrt, we construct the Lagrangian function:

, where the nonnegative variables αare called Lagrange multipliers.

• The optimal solution is determined by the saddle point of the Lagrangian function J, which has to be minimized with respect to w and b; it also has to be maximized with respect to α.

]1)([2

1),,(

1

bxwdwwabwJ iT

N

iii

T

Page 14: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• Condition 1:

• Condition 2:

N

iiii xdw

w

abwJ

1

,0),,(

0 ,0),,(

1

N

iiidb

abwJ

Page 15: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• The previous Lagrangian function can be expanded term by term, as follows:

• The third term on the right-hand side is zero by virtue of the optimality condition. Furthermore, we have

N

ii

N

iii

N

ii

Tii

T dbxwdwwbwJ1112

1),,(

N

ij

N

j

Tijiji

N

ii

Tii

T xxddxwdww1 11

Page 16: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• Accordingly, setting the objective function J(w,b, α)=Q(α), we may reformulate the Lagrangian equation as:

• We may now state the dual problem:

Given the training sample {(xi,di)}, find the Lagrange multipliers αi that maximize the objective function Q(α), subject to the constrains:

N

ij

N

j

Tijiji

N

ii xxddQ

1 11 2

1)(

0 )2(

0 )1(1

i

i

N

iid

Page 17: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

Optimal Hyperplane for Nonseparable Patterns

1.Nonlinear mapping of an input vector into a high-dimensional feature space 2.Construction of an optimal hyperplane for separating the features

Page 18: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• Given a set of nonseparable training data, it is not possible to construct a separating hyperplane without encountering classification errors.

• Nevertheless, we would like to find an optimal hyperplane that minimizes the probability of classification error, averaged over the training set.

Page 19: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• The optimal hyperplane equation

will violate in two conditions:

1. The data point (xi, di) falls inside the region of separation but on the right side of the decision surface.

2. The data point (xi, di) falls on the wrong sid of the decision surface.

Thus, we introduce a new set of nonnegative “slack variable” ξinto the definition of the hyperplane:

Niforbxwd iT

i ,....,2,11)(

Niforb ii

T

i xwd ,....,2,11)(

Page 20: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• For 0 ξ 1, the data point falls inside the ≦ ≦region of separation but on the right side of the decision surface.

• For ξ>1, it falls on the wrong side of the separation hyperplane.

• The support vectors are those particular data points that satisfy the new separating hyperplane equation precisely even if ξ>0.

Page 21: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• We may now formally state the primal problem for the nonseparable case as:

And such that the weight vector w and the slack variable ξminimize the cost function:

Where C is a user-specified positive parameter.

i allfor 0

,....,2,11)(

i

ii

T

iNiforbxwd

N

ii

TCW ww

12

1)(

Page 22: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• We may formulate the dual problem for nonseparable patterns as:

Given the training sample {(xi,di)}, find the Lagrange multipliers αi that maximize the objective function Q(α),

• subject to the constrains:

N

ij

N

j

Tijiji

N

ii xxddQ

1 11 2

1)(

C

d

i

i

N

ii

0 )2(

0 )1(1

Page 23: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

Inner-Product Kernal

Let Φdenotes a set of nonlinear transformation from the input space to feature space. We may define a hyperplane acting as the decision surface as follows:

We may simplify it as

By assuming

0)(1

bxj

m

jjW

0)(0

XW j

m

jj

bX 00 w,1)(

)](),..,(),(),([)( 20 1 XXXXX m

T

0)( XWT

)(1

XdW i

N

iii

Page 24: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• According to the condition 1 of the optimal solution of Lagrange function, we now transform the sample point to its feature space and obtain:

• Substituting it into wTφ(x)=0, we obtain:

N

iiii xdw

1

)(

)()(1

xxdwN

iiii

Page 25: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

Define the inner-product kernel

Type of SVM Kernals

Polynomial (xTxi+1)p

RBFN exp(-1/2σ2||x-xi||2)

)()(),( i

T

i XXXXK

)()(0

ij

m

jj XX

Page 26: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

•We may formulate the dual problem for nonseparable patterns as: Given the training sample {(xi,di)}, find the Lagrange multipliers αi that maximize the objective function Q(α),

subject to the constrains:

),(2

1)(

1 11

N

ij

N

jijiji

N

ii xxKddQ

C

d

i

i

N

ii

0 )2(

0 )1(1

Page 27: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• According to the Kuhn-Tucker conditions, the solution αi has to satisfy the following conditions:

• Those points with αi>0 are called support vectors which can be divided into two types. If 0<αi<C, the corresponding training points just lie on one of the margin. If αi=C, this type of support vectors are regarded as misclassified data.

NiC

bxwd

ii

iiT

ii

,...,2,1 ,0)( )2(

0)1))((( )1( 00

Page 28: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

EXAMPLE:XOR

• To illustrate the procedure for the design of a support vector machine, we revisit the XOR (Exclusive OR) problem discussed in Chapters 4 and 5. Table 6.2 presents a summary of the input vectors and desired responses for the four possible states.

To proceed, let (Cherkassky and Mulier, 1998)

Page 29: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• With and ,we may thus express the inner-product kernel

in terms of monomials of various orders as follows:

The image of the input vector X induced in the feature space is therefore deduced to be

2)1(),( iT

i XXXXK TxxX ],[ 21 T

iii xxX ],[ 21

),( ixxK

221122

222121

21

21 2221),( iiiiiii xxxxxxxxxxxxXXK

Page 30: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

Similarly,

From Eq.(6.41), we also find that

4,3,2,1,]2,2,,2,,1[)( 212221

21 ixxxxxxX T

iiii

TxxxxxxX ]2,2,,2,,1[)( 212221

21

9111

1911

1191

1119

K

Page 31: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• The objective function for the dual form is therefore (see Eq(6.40))

Optimizing with respect to the Lagrange multipliers yields the following set of simultaneous equations:

413121214321 2229(

2

1)( Q

)929229 2443

234232

22

)(Q

Page 32: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

19

19

19

19

4321

4321

4321

4321

Page 33: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• Hence, the optimum values of the Lagrange multipliers are

• This result indicates that in this example all four input vectors are support vectors. The optimum value of is)(Q

8

14,03,02,01,0

41}{ iiX

4

1)(0 Q

Page 34: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• Correspondingly, we may write

or

From Eq.(6.42), we find that the optimum weight vector is

4

1

2

1 2

0 W

2

10 W

Page 35: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• The first element of indicates that the bias is zero.

• The optimal hyperplane is defined by (see Eq.6.33)

0

0

02

10

0

2

2

1

2

1

1

2

2

1

2

1

1

2

2

1

2

1

1

2

2

1

2

1

1

8

1

)]()()()([8

143210 XXXXW

0)(0 XW T

Page 36: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it
Page 37: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• That is,

which reduces to

0]

2

2

2

1

][0,0,0,2

1,0,0[

2

1

22

21

21

x

x

x

xx

x

021 xx

Page 38: Support Vector Machine Figure 6.5 displays the architecture of a support vector machine. Irrespective of how a support vector machine is implemented, it

• The polynomial form of support vector machine for the XOR problem is as shown in fig6.6a. For both and ,

the output ; and for both , and and , we have .thus the XOR problem is solved as indicated in fig6.6b

121 xx

121 xx 1y11 x 12 x 11 x

12 x 1y