svm lecture 2

Support Vector MachineSVM

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Slides for guest lecture presented by Linda Sellie in Spring 2012 for CS6923, Machine Learning, NYU-Poly

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

with a few corrections...

Lisa Hellerstein

Typewritten Text

http://www.svms.org/tutorials/Hearst-etal1998.pdf

http://www.cs.cornell.edu/courses/cs578/2003fa/slides_sigir03_tutorial-modified.v3.pdf





Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

These slides were prepared by Linda Sellie and Lisa Hellerstein

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

+

-----

-

++

+

++++

g(x)?

Which Hyperplane?

If wT = (3, 4) & w0 = −10

g(x) = wT x +w0

+

-----

-

++

+

++++

(2,1)

(0,2.5)

(3,1)

(2,2)

(1,1)

g(x)

g(x) > 0 then f (x) = 1g(x) ≤ 0 then f (x) = −1

-

g(x) = (3, 4)T x −10

g(0,5 / 2) = (3, 4)i(0,5 / 2)−10 = 0

If wT = (3, 4) & w0 = −10

g(x) = wT x +w0

+

-----

-

++

+

++++

(2,1)

(0,2.5)

(3,1)

(2,2)

(1,1)

g(2,1) = (3, 4)i(2,1)−10 = 0

g(x)

g(2,2) = (3, 4)i(2,2)−10 = 4 > 0

g(3,1) = (3, 4)i(3,1)−10 = 3 > 0

g(1,1) = (3, 4)i(1,1)−10 = −3 ≤ 0

g(x) > 0 then f (x) = 1g(x) ≤ 0 then f (x) = −1

f (2,2)= 1

f (3,1)

so f (2,1) = f (0,5 / 2) = −1

= 1

f (1,1)= −1

-

g(x) = (3, 4)T x −10

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

with shared variance for each feature (for each xi, requiring estimated variance of distributions for p[xi|+] and p[xi|-] to be the same)

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

(for the usual Gaussian Naive Bayes, where you don't required shared variance for each feature, discriminant function is quadratic.)

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

that is, if you have Boolean features, and you treat them as discrete/categorical features, running the standard NB algorithm for discrete/categorical features will produce a linear discriminant.

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

+

----

-

++

+

++++

g(x)

-

+

----

-

++

+

++++

g(x)

-

Which line (hyperplane) to choose?

margin margin

Maximal Margin Hyperplane

How to compute the distance from a point on the plane to the hyperplane

x = xp + rww

xpx a point

the normal projection of onto the plane

x

g(x) = wT x +w0 = wT xp + rww

⎛⎝⎜

⎞⎠⎟+w0

= wT r ww wTw = w 2

Observe that

= w r

Thus r = g(x)w

xp

w

x3

x2

x1

r x∀x,g(x) = wT x +w0 = 0

wT = (3, 4) w0 = −10g(x) = wT x +w0

+

-----

-

++

+

+++

+

(3,1)

(2,2)

(1,1)

g(x)

--(1,.5)

+

Distance Formula: r = g(x)w

g(2,2)(3, 4)

= (3, 4)i(2,2)−105

= 4 / 5

g(1,.5)(3, 4)

= (3, 4)i(1,.5)−105

= −1

g(3,1)(3, 4)

= (3, 4)i(3,1)−105

= 3 / 5

g(1,1)(3, 4)

= (3, 4)i(1,1)−105

= −3 / 5

+

-----

-

++

+

+++

+

(3,1)

(2,2)

(1,1)

g(x)

--(1,.5)

+

Distance Formula: r = g '(x)w 'g '(2,2)(1, 4 / 3)

= (1, 4 / 3)i(2,2)−10 / 35 / 3

= 4 / 5

g '(1,.5)(1, 4 / 3)

= (1, 4 / 3)i(1,.5)−10 / 35 / 3

= −1

g '(3,1)(1, 4 / 3)

= (1, 4 / 3)i(3,1)−10 / 35 / 3

= 3 / 5

g '(1,1)(1, 4 / 3)

= (1, 4 / 3)i(1,1)−10 / 35 / 3

= −3 / 5

g '(x) = w 'T x +w '0 =13g(x)

w ' = 13w = (1, 4 / 3)T w '0 = −10 / 3

+

-----

-

++

+

++++

-

We want to classify points in space.Which hyperplane does SVM choose?

+

+

- -----

+

----

-

++

+

++++

g(x)

-

Maximal Margin Hyperplane

The margin is the geometric distance between the closest training example to the hyperplane,

support vector

support vector

support vectormargin

y1g(x1)||w ||

= y2g(x2 )||w ||

x1x2

+

-----

-

++

+

++++

(3,1)

(2,2)

(1,1)

(0,2.5)

(2,1)

g(x)

-

The hyperplane is defined by all the points which satisfy g(x) = (3, 4)ix −10 = 0

g(0,2.5) = 0

(0, 3.3)

g(2,1) = (3, 4)i(2,1)−10 = 0e.g.

All the points above the line are positive

e.g. g(2,2) = (3, 4)i(2,2)−10 = 4g(x) = (3, 4)ix −10 > 0

All the points below the line are negativeg(x) = (3, 4)ix −10 < 0

g(1,1) = (3, 4)i(1,1)−10 = −3e.g.

g(3,1) = (3, 4)i(3,1)−10 = 3

We use the hyperplane to classify a point x f (x) = 1 if wix +w0 > 0f (x) = −1 if wix +w0 ≤ 0

g(x) = (3, 4)ix −10

Notice that for any hyperplane we have an infinite number of formulas that describe it!

If (3, 4)ix −10 = 0, so does (1/3) (3, 4)ix −10( ) = 0so does 23 (3, 4)ix −10( ) = 0so does .9876 (3, 4)ix −10( ) = 0

Lisa Hellerstein

Typewritten Text

if it is a maximum margin hyperplane -- since in such a hyperplane, the distance to the closest + example must be equal to the distance to the closest - example

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

is (the functional margin is 1.)

The canonical hyperplane for a set of training examples

y(i ) (w 'T x(i ) +w '0 ) ≥1+

-----

-

++

+

+++

+

(3,1)

(2,2)

(1,1)

g(x)

1(1, 4 / 3)i(2,2)−10 / 3 ≥1

--

g '(3,1) = (1, 4 / 3)i(3,1)−10 / 3 = 1g '(1,1) = (1, 4 / 3)i(1,1)−10 / 3 = −1

S = {< (1,1),−1>,< (2,2),1>,< (1,1 / 2),−1>,< (3,2),1> ...}

(3,2)

-(1,.5)

(1,1)−1(1, 4 / 3)i(1,1 / 2)−10 / 3 ≥1

g '(x) = (1, 4 / 3)ix −10 / 3

1

1

For a canonical hyperplane w.r.t. a fixed set of training examples, , the margin is computed by

+

-----

-

++

+

+++

+

(3,1)

(2,2)

(1,1)

--

g(x) = wT x +w0

(3,2)

-(1,.5)

(1,1)

S

ρ = g(x+ )w

= g(x− )w

= 1w

Remember the distance from a point to the hyperplane is ρ = g(x)

wg(x) = wT x +w0

x

g(x) = (1, 4 / 3)T x +10 / 3

ρ = 1(1, 4 / 3)

= 11+16 / 9

= 35

3 / 5

Distance of x to the hyperplane is : r = g(x)w

= ±1w

assuming canonical

hyperplane

For the canonical hyperplane the margin is 1w

To find the maximal canonical hyperplane,the goal is to minimize w

For a set of training examples, we can find themaximum margin hyperplane in polynomial time!

To do this - we reformulate the problem as an optimization problem

There is an algorithm to solve an optimization problem if it has this form:

min:Subject to:

Where is convex, is convex, and is affine.We can use the standard techniques to find the optimal.

f (x)∀i,gi (w) ≤ 0∀i,hi (w) = 0

f ∀i,gi ∀i,hi

Finding the largest geometric margin by finding which

max: ρ

Subject to:

g(x)

y(i ) (wT x(i ) +w0 )w

≥ ρ∀i

Finding the largest geometric marginby finding which

min: w

Subject to:

g(x)

y(i ) (wT x(i ) +w0 ) ≥1∀i

Finding the largest marginby finding which

min:12w 2

Subject to:

g(x)

y(i ) (wT x(i ) +w0 ) ≥1∀i

Solving this constrained quadratic optimization requires The Karush, Kuhn, Tucker (KKT) conditions are met.

The Karush, Kuhn, Tucker (KKT) conditions imply that is non-zero only if it is a support vector!

vi

is

The hypothesis for the set of training examples,

+

-----

-

++

+

+++

+

(3,1)

(2,2)

(1,1)

g(x) = wT x + b = 0--

S = {< (1,1),−1>,< (2,2),1>,< (1,1 / 2),−1>,< (3,2),1>,...}

(3,2)

-(1,.5)

(1,1)

g(x) = v1(2, 7 / 4)ix + v2 (3,1)ix + v3 i(1,1)ix + w0

(2, 7 / 4)

Note that only the support vectors are in the hypothesis.

g(x) = wT x + b = −1

g(x) = wT x + b = 1

NonLinearly Separable DataIII

positivenegative

-

++++ ++

----

---

g(x)+

++ +

+

+ +

+

- -+

++

++

+

+(0,0) (0.5,0)

(−1,−1)

(1,1)+

g(x) = wT x +w0X

(0,1.25)

(−0.25,0.25)

(0.5,−0.5)

--- -

----

- ---- +

+

++

(−0.75,−0.25)

(−1.25,0)

(−1,1)

(1,−0.75)

Linearly separable?

-

++++ ++

----

---

g(x)+

++ +

+

+ +

+

- -+

++

++

+

+(0,0) (0.5,0)

(−1,−1)

(1,1)+

g(x) = wTφ(x)+w0w = (1,1)

φ(x) = x12 , x2

2⎡⎣ ⎤⎦Transform feature space

φ : x→φ(x)

w0 = −1

(0,1.25)

(−0.25,0.25)

(0.5,−0.5)

--- -

----

- ---- +

+

++

(−0.75,−0.25)

(−1.25,0)

(−1,1)

(1,−0.75) --(0,0) (0,0.25)-(0.25,0.25)

+(1,0.56)+(1,1)

- +(1.5625,0)

+(0,1.5625)

-(0.56,0.0625)

Linearly separable?

-

++++ ++

----

---

g(x)+

++ +

+

+ +

+

- -+

++

++

+

+(0,0) (0.5,0)

(−1,−1)

(1,1)+

g(x) = wTφ(x)+w0w = (1,1)

g(.5,−.5) = (1,1)i(0.25,0.25)−1≤ 0

φ(x) = x12 , x2

2⎡⎣ ⎤⎦Transform feature space

φ : x→φ(x) g(1,1) = (1,1)i(1,1)−1> 0g(−1,−1) = (1,1)i(1,1)−1> 0

w0 = −1

(0,1.25)

(−0.25,0.25)

(0.5,−0.5)

--- -

----

- ---- +

+

++

(−0.75,−0.25)

(−1.25,0)

(−1,1)

(1,−0.75) --(0,0) (0,0.25)-(0.25,0.25)

+(1,0.56)+(1,1)

- +(1.5625,0)

+(0,1.5625)

-(0.56,0.0625)

Linearly separable?

++-- -- + -- - --

Linearly separable?

10 2 3............

Linearly separable?

Yes, by transforming the feature space!φ(x) = x, x2⎡⎣ ⎤⎦

g(φ(x)) = (1,−3)iφ(x)+ 2f (x) = 1 if g(φ(x)) = (1,−3)iφ(x)+ 2 > 0f (x) = −1 if g(φ(x)) = (1,−3)iφ(x)+ 2 < 0

++-- -- + -- - --10 2 3

............

Lisa Hellerstein

Typewritten Text

There is an error in this slide. The given g is equal to x - 3x^2 + 2, which is positive iff x > -2/3 and < 1 (check this by factoring). So this slide and the next one can be fixed by relabeling the points on the line accordingly. (Alternatively, change phi(x) to be [x^2,x] instead of [x,x^2]. Then the labeling on the line is correct, but the rest of the example needs to be changed.)

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Linearly separable?

Yes, by transforming the feature space!

g(φ(1 / 2)) = g(1 / 2,1 / 4) = (1,−3)i(1 / 2,1 / 4)+ 2

φ(x) = x, x2⎡⎣ ⎤⎦

g(φ(x)) = (1,−3)iφ(x)+ 2f (x) = 1 if g(φ(x)) = (1,−3)iφ(x)+ 2 > 0f (x) = −1 if g(φ(x)) = (1,−3)iφ(x)+ 2 < 0

g(φ(3 / 2)) = g(3 / 2,9 / 4) = (1,−3)i(3 / 2,9 / 4)+ 2g(φ(2)) = g(2, 4) = (1,−3)i(2, 4)+ 2

++-- -- + -- - --10 2 3

............

- -

φ(x) = x, x2⎡⎣ ⎤⎦

--

+++(7 / 4,49 /16)(3 / 2,9 / 4)

(5 / 4,25 /16)

-(7 / 4,49 /16)

(3 / 4,9 /16)

g(φ(x)) = (1,−3)iφ(x)+ 2transforming the feature space using

The points are now linearly separable

These points become linearly separable by++-- -- + -- - --

10 2 3............

Kernel Function

K (x, z) = φ(x)iφ(y)

Transform the feature spacemap x to φ(x)

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

KERNEL TRICK Never compute phi(x). Just compute K(x,z) Why is this enough? If work with dual representation of hyperplane (and dual quadratic program), only use of new features is in inner products!

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Lisa Hellerstein

Typewritten Text

Non-Separable DataIV

+

-----

-

++

+

+++

+

(3,1)

---(1,1)

What if data is not linearly separable for only a few points?

x

-

+

+++

+++

++ +

------

-

+

-----

-

++

+

+++

+

(3,1)

--(1,1)

What if a small number of points prevents the margin from being large?

x

- +

+++

+++

++ +

------

+

-----

-

++

+

+++

+

(3,1)

---(1,1)

What if a small number of points prevents the margin from being large?

x

- +

+++

+++

++ +

------

+

-----

-

++

+

+++

+

(3,1)

---(1,1)

- +

+++

+++

++ +

------

λ large λ small

-+ +

-- -

++

λ =∞?What if

svm lecture 2

Documents