support vector machines h. clara pong julie horrocks 1, marianne van den heuvel 2,francis tekpetey...

Support Vector MachinesSupport Vector MachinesH. Clara PongH. Clara Pong

Julie HorrocksJulie Horrocks11, Marianne Van den Heuvel, Marianne Van den Heuvel22,Francis Tekpetey,Francis Tekpetey3,3, B. Anne Croy B. Anne Croy4.4.

1 Mathematics & Statistics, University of Guelph, 1 Mathematics & Statistics, University of Guelph, 2 Biomedical Sciences, University of Guelph, 2 Biomedical Sciences, University of Guelph, 3 Obstetrics and Gynecology, University of Western Ontario, 3 Obstetrics and Gynecology, University of Western Ontario, 4 Anatomy & Cell Biology, Queen’s University4 Anatomy & Cell Biology, Queen’s University

OutlineOutline

BackgroundBackground Separating Hyper-plane & Basis Expansion Separating Hyper-plane & Basis Expansion Support Vector MachinesSupport Vector Machines SimulationsSimulations RemarksRemarks

BackgroundBackground

MotivationMotivation The IVF (In-Vitro Fertilization) project The IVF (In-Vitro Fertilization) project 18 infertile women 18 infertile women each undergoing the IVF treatment each undergoing the IVF treatment Outcome (Outputs, Y’s) : Binary (pregnancy)Outcome (Outputs, Y’s) : Binary (pregnancy) Predictor (Inputs, X’s): Longitudinal data (adhesion)Predictor (Inputs, X’s): Longitudinal data (adhesion)

CD56 bright cells

BackgroundBackground

Classification methods methods Relatively new method: Relatively new method: Support Vector MachinesSupport Vector Machines

- V. Vapnik: first proposed in 1979

- Maps input space into a high dimensional feature space input space into a high dimensional feature space

- Constructs a linear classifier in the new feature space feature space

Traditional method:Traditional method: Discriminant Analysis Discriminant Analysis- R.A. Fisher:R.A. Fisher: 1936

- Classify according to the values from the discriminant functions

- Assumption: the predictors X in a given class has a Multi-Normal distribution.

Separating Hyper-planeSeparating Hyper-plane

Suppose there are 2 classes (A, B) y = 1 for group A, y = -1 for group B.

Let a hyper-plane be defined as f(X) = β0 +βTX = 0

then f(X) is the decision boundary that separates the two groups.

f(X) = β0 +βTX > 0 for X Є A

f(X) = β0 +βTX < 0 for X Є B

Given X0 Є A, misclassified when f(X0 ) < 0.Given X0 Є B , misclassified when f(X0 ) > 0.

f(X) = β0 +βTX = 0

A: f(X)>0

B: f(X)<0

f(X) = β 0 +βT X = 0

Separating Hyper-planeSeparating Hyper-plane

The perceptron learning algorithm search for a hyper-plane that minimizes the distance of misclassified points to the decision boundary.

However this does not provide a unique solution.

Mi

ii )ββ(xy -) βD(β 00,

Optimal Separating Hyper-planeOptimal Separating Hyper-plane

Let C be the distance of the closest point from the two groups to the

hyper-plane.

The Optimal Separating hyper-plane is the unique separating

hyper-plane f(X) = β0* +β*TX = 0, where (β*

0 ,β*T) maximizes C.

f(X) = β0* +β*TX = 0

CC


Maximization Problem:Maximization Problem:

C

f(X) = β0* +β*TX = 0

C

(the support vectors)

i 1)( .. ||||min

i )( .. max

02

,

01||||,0,

21

0

Ti

Ti

i

i

xyts

CxytsC

smultiplier LaGrange are s' where

21 max

...1 ...1...1

i

Nij

Tiji

Njji

NiiDi

xxyyL

Dual LaGrange problem:

Subjects to 1. αi [yi (xiTβ+ β0) -1] = 0 2. αi ≥ 0 all i=1…N 3. β = Σ i=1..N αi yixi 4. Σ i=1..N αi yi = 0 5. The Kuhn Tucker Conditions

f(X) only depends on the xi’s where αi ≠ 0


C

f(X) = β0* +β*TX = 0

C

(the support vectors)

Basis ExpansionBasis Expansion

Suppose there are p inputs X=(xSuppose there are p inputs X=(x1 … x … xpp))

Let hLet hkk(X) be a transformation that maps X from R(X) be a transformation that maps X from Rp R.R.

hhkk(X) is called the basis function.(X) is called the basis function.

H = {hH = {h11(X), … ,h(X), … ,hmm(X)} is the basis of a new feature space (dim=m)(X)} is the basis of a new feature space (dim=m)

Example: X=(xX=(x1,x2)) H = {hH = {h1(X), h(X), h2(X),h(X),h3(X)}(X)}

hh1(X) = h(X) = h1(x(x1,x2) = x) = x1,

hh2(X) = h(X) = h2(x(x1,x2) =) = xx2,

hh3(X) = h(X) = h3(x(x1,x2) =) = xx11xx22

X_new = H(X)= (x1, x2, x1x2) x1

x2

x1x2

x1 +

x2 +

Support Vector MachinesSupport Vector Machines

The optimal hyper-plane {X| The optimal hyper-plane {X| f(X) = β0* +β*TX=0 }.

f(X) = β0* +β*TX is called the Support Vector Classifier. is called the Support Vector Classifier.

Separable Case:Separable Case: all points are outside of the margins

The classification rule is the sign of the decision function.

][ )( *0

* TXsignXG

1Yfor )(

1Yfor )(

i

i

CXf

CXf

i

i

f(X) = β 0* +β*TX = 0

CC


Hyper-plane:Hyper-plane: {X| {X| f(X) = β0 +βTX = 0 }

Non-separable Case:Non-separable Case: training data is non-separable.

f(X) = β 0 +βT X = 0

Si = C – yi f(Xi) when Xi crosses the margin and it’s zero when Xi outside.

Xi crosses the margin of its group when C – yi f(Xi) > 0.

y if(X i)

S i

C

Let ξiC =Si, ξi is the proportional of C that the prediction has crossed the margin.

Misclassification occurs when Si > C (ξi > 1).


The overall misclassification is The overall misclassification is ΣξΣξi , i , and is bounded by δ.

i )1()( ,0 s.t. ||||min

i )1()( s.t. max

01

2,

01||||,,

0

0

iT

iii

N

i i

iT

ii

xy

CxyC

Maximization Problem:Maximization Problem:

N

i

N

j jT

ijiji

N

i iD xxyyLi 1 11 2

1 max

Dual LaGrange problem: (non-separable case)

s.t. . 0≤ αi ≤ ζ , Σ αi yi = 0Subjects to

1. αi [yi (xiTβ+ β0) –(1-ξi)] = 02. vi ≥ 0 all i=1…N3. β = Σ αiyixi

4. The Kuhn Tucker Conditions

i


SVM search for an optimal hyper-plane in a new feature SVM search for an optimal hyper-plane in a new feature

space where the data are more separate.space where the data are more separate.

)()( )( 1 k

m

k kk β for some XhXfXf F,

The linear classifier becomes TXHXf )()( 0

N

i

N

j jT

ijiji

N

i iD XHXHyyLi 1 11

)()(2

1 max

Dual LaGrange problem:

Suppose H = {h1(X), … ,hm(X)} is the basis for the new feature space F.Suppose H = {h1(X), … ,hm(X)} is the basis for the new feature space F.

All elements in the new feature space is a linear basis expansion of X.


).(),( ofproduct inner theis )(),( )()( jijijT

i XHXHXHXHXHXH

For example: 2''21 ),( ),( ),,( XXXXKxxX

),,(),,,(

))((2))(())((

)(),( ),(

'21

'2

'2

'1

'1212211

'22

'11

'22

'22

'11

'11

2'22

'11

2''

xxxxxxxxxxxx

xxxxxxxxxxxx

xxxxXXXXK

)(),( ),( '' XHXHXXKKernel:

This implies

213222

211

321

)(,)(,)(

))(),(),(()(

xxXhxXhxXh

XhXhXhXH

The kernel and the basis transformation define one another.


N

i

N

j jijiji

N

i iD XXKyyL1 11

),(2

1

Dual LaGrange function:

This shows the basis transformation in SVM does not need to be define explicitly.

)/||||(exp ),( '' XXXXK

The most common kernels:

1. dth Degree Polynomial:

2. Radial Basis:

3. Neural Network:

d'' ),1( ),( XXXXK

),( tanh ),( 2'

1' CXXCXXK

SimulationsSimulations

3 cases3 cases 100 simulations per case100 simulations per case Each simulation consists of 200 points Each simulation consists of 200 points

100 points from each group100 points from each group Input space: 2 dimensionalInput space: 2 dimensional Output: 0 or 1 (2 groups)Output: 0 or 1 (2 groups)

Half of the points are randomly selected as Half of the points are randomly selected as the training set.the training set.

X=(x1,x2), Y є {0,1}


Case 1 (Normal with same covariance matrix)Case 1 (Normal with same covariance matrix)

Black ~ group 0Red ~ group 1

10

01,

1

0~0| 11NYX

10

01,

1

2~1| 22NYX


Case 1Case 1

Misclassifications (in 100 simulations)Misclassifications (in 100 simulations)

TrainingTraining TestingTesting

MeanMean SdSd MeanMean SdSd

LDALDA 7.85%7.85% 2.652.65 8.07%8.07% 2.512.51

SVMSVM 6.98%6.98% 2.332.33 8.48%8.48% 2.812.81


Case 2 (Normal with unequal covariance matrixes)Case 2 (Normal with unequal covariance matrixes)


211 20

01,

1

0~0| NYX

10

02,

1

2~1|

2

22NYX


Case 2Case 2




QDAQDA 15.5%15.5% 3.753.75 16.84%16.84% 3.483.48

SVMSVM 13.6%13.6% 4.034.03 18.8%18.8% 4.014.01


Case 3 (Non-normal)Case 3 (Non-normal)


1.5,6.5 ~0|1 UniformYx 5.2,8~0|2 sdmeanNYx

0 1|, 21 group aroundscatter randomlyYxx


Case 3Case 3




QDAQDA 14%14% 3.793.79 16.8%16.8% 3.633.63

SVMSVM 9.34%9.34% 3.463.46 14.8%14.8% 3.213.21


Paired t-test for differences in misclassifications Paired t-test for differences in misclassifications

Ho: Ho: mean different = 0; mean different = 0; Ha: Ha: mean different mean different ≠≠ 0 0

Case 1 Case 1 mean different (LDA - SVM) = - 0.41 , se = 0.3877mean different (LDA - SVM) = - 0.41 , se = 0.3877

t = -1.057, p-value = 0.29 t = -1.057, p-value = 0.29 (insignificant)(insignificant)

Case 2Case 2 mean different (QDA - SVM) = -1.96 , se = 0.4170mean different (QDA - SVM) = -1.96 , se = 0.4170

t = -4.70, p-value = 8.42e-06 t = -4.70, p-value = 8.42e-06 (significant)(significant)

Case 3Case 3 mean different (QDA - SVM) = 2, sd= 0.4218mean different (QDA - SVM) = 2, sd= 0.4218

t = 4.74, p-value = 7.13e-06 t = 4.74, p-value = 7.13e-06 (significant)(significant)

RemarksRemarks

Support Vector MachinesSupport Vector Machines Maps the original input space onto a feature space of higher dimension Maps the original input space onto a feature space of higher dimension No assumption on the distributions of X’sNo assumption on the distributions of X’s

PerformancePerformance The performances of Discriminant Analysis and SVM are similar (when (X|The performances of Discriminant Analysis and SVM are similar (when (X|

Y) has a Normal distribution and share the same Y) has a Normal distribution and share the same ΣΣ))

Discriminant Analysis has a better performanceDiscriminant Analysis has a better performance

(when the covariance matrices for the two groups are different)(when the covariance matrices for the two groups are different)

SVM has a better performanceSVM has a better performance

(when the input (X) violated the distribution assumption)(when the input (X) violated the distribution assumption)

ReferenceReference1. N. Cristianini, and J. Shawe-Taylor An introduction to Support Vector

Machines and other kernel-based learning methods. New York: Cambridge University Press, 2000.

2. J. Friedman, T. Hastie, and R. Tibshirani The Elements of Statistical Learning. NewYork: Springer, 2001.

3. D. Meyer, C. Chang, and C. Lin. R Documentation: Support Vector Machines. http://www.maths.lth.se/help/R/.R/library/e1071/html/svm.html Last updated: March 2006

4. H. Planatscher and J. Dietzsch. SVM-Tutorial using R (e1071-package) http://www.potschi.de/svmtut/svmtut.htm

5. M. Van Den Heuvel, J. Horrocks, S. Bashar, S. Taylor, S. Burke, K. Hatta, E. Lewis, and A. Croy. Menstrual Cycle Hormones Induce Changes in Functional Interac-tions Between Lymphocytes and Endothelial Cells. Journal of Clinical Endocrinology and Metabolism, 2005.

support vector machines h. clara pong julie horrocks 1, marianne van den heuvel 2,francis tekpetey...

Documents