classification: support vector machine

Classification: Support Vector Machine

10/10/07

What hyperplane (line) can separate the two classes of data?

But there are many other choices!

Which one is the best?

What hyperplane (line) can separate the two classes of data?

But there are many other choices!

Which one is the best?

M: margin

00 Tix

Optimal separating hyperplane

The best hyperplane is the one that maximizes the margin, M.

1iy 1iy

A hyperplane is

}0)(:{ 0 Txxfx

Computing the margin width

xT + 0

Find x+ and x- on the “plus” and “minus” plane, so that x+ - x- is perpendicular to .

Then M = | x+ - x- |

Find x+ and x- on the “plus” and “minus” plane, so that x+ - x- is perpendicular to .

Then M = | x+ - x- |

Since x+T + 0 = 1

x-T + 0 = -1

(x+ - x-)T = 2

A hyperplane is

}0)(:{ 0 Txxfx

Computing the margin width

xT + 0

M = | x+ - x- | = 2/| |

The hyperplane is separating if

The maximizing problem is

subject to

ixy Tii ,0)( 0

Computing the marginal width

support vector

1iyixy T

ii ,1)( 0

Optimal separating hyperplaneRewrite the problem as

subject to

Lagrange function

To minimize, set partial derivatives to be 0

Can be solved by quadratic programming.

2||||2

TiiiP xyL ]1)([||||

ixy Tii ,1)( 0

What is the best hyperplane?

When the two classes are non-separable

Idea: allow some points to lie on the wrong side, but not by much.

Support vector machineWhen the two classes are not separable, the problem is

slightly modified:

subject to

Can be solved using quadratic programming.

constant,0

,1)( 0

iTii ixy

2||||2

Convert a nonseparable to separable case by nonlinear transformation

non-separable in 1D

Convert a nonseparable to separable case by nonlinear transformation

separable in 1D

))(,()( xfxxh

• Introduce nonlinear kernel functions h(x), and work on the transformed functions.

Then the separating function is

In fact, all you need is the kernel function:

Common kernels:

))(( 0 Txhy

)'(),()',( xhxhxxK

Kernel function

Applications

Prediction of central nervous systems embryonic tumor outcome• 42 patient samples

• 5 cancer types

• Array contains 6817 genes

• Question: are different tumors types distinguishable from gene expression pattern?

(Pomeroy et al. 2002)

Gene expressions within a cancer type cluster together

PCA based on all genes

PCA based on a subset of informational genes

(Khan et al. 2001)

Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks

•Four different cancer types.

•88 samples

•6567 genes

•Goal: to predict cancer types from gene expression data

(Khan et al. 2001)

Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks

Procedures

• Filter out genes that have low expression values (retain 2308 genes)

• Dimension reduction by using PCA --- select top 10 principle components

• 3 fold cross-validation:

(Khan et al. 2001)

Artificial Neural Network

(Khan et al. 2001)

Procedures

• Filter out genes that have low expression values (retain 2308 genes)

• Dimension reduction by using PCA --- select top 10 principle components

• 3 fold cross-validation:

• Repeat 1250 times.

(Khan et al. 2001)

Acknowledgement

• Sources of slides:– Cheng Li– http://www.cs.cornell.edu/johannes/papers/20

01/kdd2001-tutorial-final.pdf– www.cse.msu.edu/~lawhiu/

intro_SVM_new.ppt

Aggregating predictors

• Sometimes aggregating several predictors can perform better than each single predictor alone. Aggregating is achieved by weighted sum of different predictors, which can be the same kind of predictors obtained from slightly perturbed training datasets.

• Key to the improvement of accuracy is the instability of individual classifiers, such as the classification trees.

AdaBoost

• Step 1: Initialization the observation weights • Step 2: For m = 1 to M,

– Fit a classifier Gm(X) to the training data using weight wi– Compute

– Compute

– Set • Step 3: Output

i miim

XGyIwerr

)/)1log(( mmm errerr

NixGyIww immmii ,,1))],((exp[

mmm xGsignxG

NiNwi ,,1,/1

misclassified obs are given more weights

Boosting

• Substituting, we get the Lagrange (Wolf) dual function

subject to

To complete the steps, see Burges et al.

• If then

These xi’s are called the support vectors.

is only determined by the support vectors

Optimal separating hyperplane

Tijijii iD xxyyL

iii xy

,0i 1)( 0 Tii xy

The Lagrange function is

Setting the partial derivatives to be 0.

Substituting, we get

Subject to

i i iii

Tiiii iP xyL )]1()([||||

Tijijii iD xxyyL

0,0 i iii y

Support vector machine

classification: support vector machine

Documents

multi-phase classification by a least-squares support vector...

final year project - medical image classification using...

diagnosing diabetes using support vector machine in...

support vector machine & image classification applications

evolution strategies-tuned support vector … · evolution...

automatic covid 19 infected chest x ray image classification...

an introduction to support vector machine classification...

least square support vector machine based multiclass ... ·...

image classification for mapping oil palm distribution via...

arrhythmia classification using support vector machine...

application of support vector machine in...

support vector machine classification – basic …...a...

support vector machine classification of object … ·...

support vector machine based classification of 3

fast support vector machine training and classification on...

electroencphalogram signals classification using … ·...

support vector machine classification using mahalanobis...

greg grudicintro ai1 support vector machine (svm)...

classification of human's driving behavior using support...

support vector machine classification of drunk driving