classification: support vector machine

Classification: Support Vector Machine

10/10/07

What hyperplane (line) can separate the two classes of data?


But there are many other choices!

Which one is the best?


But there are many other choices!

Which one is the best?

M: margin

1iy

00 Tix

1iy

M

Optimal separating hyperplane

The best hyperplane is the one that maximizes the margin, M.

1iy

1iy 1iy

1iy

M

A hyperplane is

}0)(:{ 0 Txxfx

Computing the margin width

1iy

1iy

xT + 0

= 1

xT + 0

= 0

xT + 0

= -1

x+

x-

Find x+ and x- on the “plus” and “minus” plane, so that x+ - x- is perpendicular to .

Then M = | x+ - x- |

Find x+ and x- on the “plus” and “minus” plane, so that x+ - x- is perpendicular to .

Then M = | x+ - x- |

Since x+T + 0 = 1

x-T + 0 = -1

(x+ - x-)T = 2

A hyperplane is

}0)(:{ 0 Txxfx

Computing the margin width

1iy

1iy

xT + 0

= 1

xT + 0

= 0

xT + 0

= -1

x+

x-

M = | x+ - x- | = 2/| |

The hyperplane is separating if

The maximizing problem is

subject to

ixy Tii ,0)( 0

,

2max

Computing the marginal width

M

support vector

1iy

1iyixy T

ii ,1)( 0

Optimal separating hyperplaneRewrite the problem as

subject to

Lagrange function

To minimize, set partial derivatives to be 0

Can be solved by quadratic programming.

2||||2

1min

i

TiiiP xyL ]1)([||||

2

10

2

iii

iiii

y

xy

0

ixy Tii ,1)( 0

What is the best hyperplane?

1iy

1iy

When the two classes are non-separable

Idea: allow some points to lie on the wrong side, but not by much.

ii

Support vector machineWhen the two classes are not separable, the problem is

slightly modified:

Find

subject to

Can be solved using quadratic programming.

constant,0

,1)( 0

ii

iTii ixy

1iy

1iy

0,

2||||2

1min

Convert a nonseparable to separable case by nonlinear transformation

non-separable in 1D

Convert a nonseparable to separable case by nonlinear transformation

separable in 1D

))(,()( xfxxh

• Introduce nonlinear kernel functions h(x), and work on the transformed functions.

Then the separating function is

In fact, all you need is the kernel function:

Common kernels:

))(( 0 Txhy

)'(),()',( xhxhxxK

Kernel function

Applications

Prediction of central nervous systems embryonic tumor outcome• 42 patient samples

• 5 cancer types

• Array contains 6817 genes

• Question: are different tumors types distinguishable from gene expression pattern?

(Pomeroy et al. 2002)

Gene expressions within a cancer type cluster together


PCA based on all genes


PCA based on a subset of informational genes


(Khan et al. 2001)

Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks

•Four different cancer types.

•88 samples

•6567 genes

•Goal: to predict cancer types from gene expression data

(Khan et al. 2001)

Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks

Procedures

• Filter out genes that have low expression values (retain 2308 genes)

• Dimension reduction by using PCA --- select top 10 principle components

• 3 fold cross-validation:

(Khan et al. 2001)

Artificial Neural Network

(Khan et al. 2001)

Procedures

• Filter out genes that have low expression values (retain 2308 genes)

• Dimension reduction by using PCA --- select top 10 principle components

• 3 fold cross-validation:

• Repeat 1250 times.

(Khan et al. 2001)

(Khan et al. 2001)

Acknowledgement

• Sources of slides:– Cheng Li– http://www.cs.cornell.edu/johannes/papers/20

01/kdd2001-tutorial-final.pdf– www.cse.msu.edu/~lawhiu/

intro_SVM_new.ppt

http://www.cs.cornell.edu/johannes/papers/2001/kdd2001-tutorial-final.pdf

http://www.cs.cornell.edu/johannes/papers/2001/kdd2001-tutorial-final.pdf

Aggregating predictors

• Sometimes aggregating several predictors can perform better than each single predictor alone. Aggregating is achieved by weighted sum of different predictors, which can be the same kind of predictors obtained from slightly perturbed training datasets.

• Key to the improvement of accuracy is the instability of individual classifiers, such as the classification trees.

AdaBoost

• Step 1: Initialization the observation weights • Step 2: For m = 1 to M,

– Fit a classifier Gm(X) to the training data using weight wi– Compute

– Compute

– Set • Step 3: Output

N

i i

N

i miim

w

XGyIwerr

1

1))((

)/)1log(( mmm errerr

NixGyIww immmii ,,1))],((exp[

M

mmm xGsignxG

1

)()(

NiNwi ,,1,/1

misclassified obs are given more weights

Boosting

• Substituting, we get the Lagrange (Wolf) dual function

subject to

To complete the steps, see Burges et al.

• If then

These xi’s are called the support vectors.

is only determined by the support vectors

Optimal separating hyperplane

i j j

Tijijii iD xxyyL

2

1

ii ,0

i

iii xy

,0i 1)( 0 Tii xy

The Lagrange function is

Setting the partial derivatives to be 0.

Substituting, we get

Subject to

i i iii

Tiiii iP xyL )]1()([||||

2

10

2

ii

iii

iiii

y

xy

0

i j j

Tijijii iD xxyyL

2

1

0,0 i iii y

Support vector machine

classification: support vector machine

Documents