svm lecture 2

58
Support Vector Machine SVM

Upload: heenal-mehta

Post on 23-Dec-2015

250 views

Category:

Documents


1 download

DESCRIPTION

machine learning

TRANSCRIPT

Page 1: SVM Lecture 2

Support Vector MachineSVM

Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Slides for guest lecture presented by Linda Sellie in Spring 2012 for CS6923, Machine Learning, NYU-Poly
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
with a few corrections...
Lisa Hellerstein
Typewritten Text
Page 2: SVM Lecture 2

http://www.svms.org/tutorials/Hearst-etal1998.pdf

http://www.cs.cornell.edu/courses/cs578/2003fa/slides_sigir03_tutorial-modified.v3.pdf

Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
These slides were prepared by Linda Sellie and Lisa Hellerstein
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Page 3: SVM Lecture 2
Page 4: SVM Lecture 2

+

-----

-

++

+

++++

g(x)?

Which Hyperplane?

Page 5: SVM Lecture 2
Page 6: SVM Lecture 2

If wT = (3, 4) & w0 = −10

g(x) = wT x +w0

+

-----

-

++

+

++++

(2,1)

(0,2.5)

(3,1)

(2,2)

(1,1)

g(x)

g(x) > 0 then f (x) = 1g(x) ≤ 0 then f (x) = −1

-

g(x) = (3, 4)T x −10

Page 7: SVM Lecture 2

g(0,5 / 2) = (3, 4)i(0,5 / 2)−10 = 0

If wT = (3, 4) & w0 = −10

g(x) = wT x +w0

+

-----

-

++

+

++++

(2,1)

(0,2.5)

(3,1)

(2,2)

(1,1)

g(2,1) = (3, 4)i(2,1)−10 = 0

g(x)

g(2,2) = (3, 4)i(2,2)−10 = 4 > 0

g(3,1) = (3, 4)i(3,1)−10 = 3 > 0

g(1,1) = (3, 4)i(1,1)−10 = −3 ≤ 0

g(x) > 0 then f (x) = 1g(x) ≤ 0 then f (x) = −1

f (2,2)= 1

f (3,1)

so f (2,1) = f (0,5 / 2) = −1

= 1

f (1,1)= −1

-

g(x) = (3, 4)T x −10

Page 8: SVM Lecture 2
Page 9: SVM Lecture 2
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
with shared variance for each feature (for each xi, requiring estimated variance of distributions for p[xi|+] and p[xi|-] to be the same)
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
(for the usual Gaussian Naive Bayes, where you don't required shared variance for each feature, discriminant function is quadratic.)
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
that is, if you have Boolean features, and you treat them as discrete/categorical features, running the standard NB algorithm for discrete/categorical features will produce a linear discriminant.
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Page 10: SVM Lecture 2
Page 11: SVM Lecture 2

+

----

-

++

+

++++

g(x)

-

+

----

-

++

+

++++

g(x)

-

Which line (hyperplane) to choose?

margin margin

Maximal Margin Hyperplane

Page 12: SVM Lecture 2

How to compute the distance from a point on the plane to the hyperplane

x = xp + rww

xpx a point

the normal projection of onto the plane

x

g(x) = wT x +w0 = wT xp + rww

⎛⎝⎜

⎞⎠⎟+w0

= wT r ww wTw = w 2

Observe that

= w r

Thus r = g(x)w

xp

w

x3

x2

x1

r x∀x,g(x) = wT x +w0 = 0

Page 13: SVM Lecture 2

wT = (3, 4) w0 = −10g(x) = wT x +w0

+

-----

-

++

+

+++

+

(3,1)

(2,2)

(1,1)

g(x)

--(1,.5)

+

Distance Formula: r = g(x)w

g(2,2)(3, 4)

= (3, 4)i(2,2)−105

= 4 / 5

g(1,.5)(3, 4)

= (3, 4)i(1,.5)−105

= −1

g(3,1)(3, 4)

= (3, 4)i(3,1)−105

= 3 / 5

g(1,1)(3, 4)

= (3, 4)i(1,1)−105

= −3 / 5

Page 14: SVM Lecture 2

+

-----

-

++

+

+++

+

(3,1)

(2,2)

(1,1)

g(x)

--(1,.5)

+

Distance Formula: r = g '(x)w 'g '(2,2)(1, 4 / 3)

= (1, 4 / 3)i(2,2)−10 / 35 / 3

= 4 / 5

g '(1,.5)(1, 4 / 3)

= (1, 4 / 3)i(1,.5)−10 / 35 / 3

= −1

g '(3,1)(1, 4 / 3)

= (1, 4 / 3)i(3,1)−10 / 35 / 3

= 3 / 5

g '(1,1)(1, 4 / 3)

= (1, 4 / 3)i(1,1)−10 / 35 / 3

= −3 / 5

g '(x) = w 'T x +w '0 =13g(x)

w ' = 13w = (1, 4 / 3)T w '0 = −10 / 3

Page 15: SVM Lecture 2

+

-----

-

++

+

++++

-

We want to classify points in space.Which hyperplane does SVM choose?

+

+

- -----

Page 16: SVM Lecture 2

+

----

-

++

+

++++

g(x)

-

Maximal Margin Hyperplane

The margin is the geometric distance between the closest training example to the hyperplane,

support vector

support vector

support vectormargin

y1g(x1)||w ||

= y2g(x2 )||w ||

x1x2

Page 17: SVM Lecture 2

+

-----

-

++

+

++++

(3,1)

(2,2)

(1,1)

(0,2.5)

(2,1)

g(x)

-

The hyperplane is defined by all the points which satisfy g(x) = (3, 4)ix −10 = 0

g(0,2.5) = 0

(0, 3.3)

g(2,1) = (3, 4)i(2,1)−10 = 0e.g.

All the points above the line are positive

e.g. g(2,2) = (3, 4)i(2,2)−10 = 4g(x) = (3, 4)ix −10 > 0

All the points below the line are negativeg(x) = (3, 4)ix −10 < 0

g(1,1) = (3, 4)i(1,1)−10 = −3e.g.

g(3,1) = (3, 4)i(3,1)−10 = 3

We use the hyperplane to classify a point x f (x) = 1 if wix +w0 > 0f (x) = −1 if wix +w0 ≤ 0

g(x) = (3, 4)ix −10

Page 18: SVM Lecture 2

Notice that for any hyperplane we have an infinite number of formulas that describe it!

If (3, 4)ix −10 = 0, so does (1/3) (3, 4)ix −10( ) = 0so does 23 (3, 4)ix −10( ) = 0so does .9876 (3, 4)ix −10( ) = 0

Page 19: SVM Lecture 2
Page 20: SVM Lecture 2
Lisa Hellerstein
Typewritten Text
if it is a maximum margin hyperplane -- since in such a hyperplane, the distance to the closest + example must be equal to the distance to the closest - example
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Page 21: SVM Lecture 2

is (the functional margin is 1.)

The canonical hyperplane for a set of training examples

y(i ) (w 'T x(i ) +w '0 ) ≥1+

-----

-

++

+

+++

+

(3,1)

(2,2)

(1,1)

g(x)

1(1, 4 / 3)i(2,2)−10 / 3 ≥1

--

g '(3,1) = (1, 4 / 3)i(3,1)−10 / 3 = 1g '(1,1) = (1, 4 / 3)i(1,1)−10 / 3 = −1

S = {< (1,1),−1>,< (2,2),1>,< (1,1 / 2),−1>,< (3,2),1> ...}

(3,2)

-(1,.5)

(1,1)−1(1, 4 / 3)i(1,1 / 2)−10 / 3 ≥1

g '(x) = (1, 4 / 3)ix −10 / 3

1

1

Page 22: SVM Lecture 2

For a canonical hyperplane w.r.t. a fixed set of training examples, , the margin is computed by

+

-----

-

++

+

+++

+

(3,1)

(2,2)

(1,1)

--

g(x) = wT x +w0

(3,2)

-(1,.5)

(1,1)

S

ρ = g(x+ )w

= g(x− )w

= 1w

Remember the distance from a point to the hyperplane is ρ = g(x)

wg(x) = wT x +w0

x

g(x) = (1, 4 / 3)T x +10 / 3

ρ = 1(1, 4 / 3)

= 11+16 / 9

= 35

3 / 5

Page 23: SVM Lecture 2

Distance of x to the hyperplane is : r = g(x)w

= ±1w

assuming canonical

hyperplane

For the canonical hyperplane the margin is 1w

To find the maximal canonical hyperplane,the goal is to minimize w

Page 24: SVM Lecture 2
Page 25: SVM Lecture 2

For a set of training examples, we can find themaximum margin hyperplane in polynomial time!

To do this - we reformulate the problem as an optimization problem

Page 26: SVM Lecture 2

There is an algorithm to solve an optimization problem if it has this form:

min:Subject to:

Where is convex, is convex, and is affine.We can use the standard techniques to find the optimal.

f (x)∀i,gi (w) ≤ 0∀i,hi (w) = 0

f ∀i,gi ∀i,hi

Page 27: SVM Lecture 2

Finding the largest geometric margin by finding which

max: ρ

Subject to:

g(x)

y(i ) (wT x(i ) +w0 )w

≥ ρ∀i

Page 28: SVM Lecture 2

Finding the largest geometric marginby finding which

min: w

Subject to:

g(x)

y(i ) (wT x(i ) +w0 ) ≥1∀i

Page 29: SVM Lecture 2

Finding the largest marginby finding which

min:12w 2

Subject to:

g(x)

y(i ) (wT x(i ) +w0 ) ≥1∀i

Page 30: SVM Lecture 2
Page 31: SVM Lecture 2
Page 32: SVM Lecture 2

Solving this constrained quadratic optimization requires The Karush, Kuhn, Tucker (KKT) conditions are met.

The Karush, Kuhn, Tucker (KKT) conditions imply that is non-zero only if it is a support vector!

vi

Page 33: SVM Lecture 2
Page 34: SVM Lecture 2

is

The hypothesis for the set of training examples,

+

-----

-

++

+

+++

+

(3,1)

(2,2)

(1,1)

g(x) = wT x + b = 0--

S = {< (1,1),−1>,< (2,2),1>,< (1,1 / 2),−1>,< (3,2),1>,...}

(3,2)

-(1,.5)

(1,1)

g(x) = v1(2, 7 / 4)ix + v2 (3,1)ix + v3 i(1,1)ix + w0

(2, 7 / 4)

Note that only the support vectors are in the hypothesis.

g(x) = wT x + b = −1

g(x) = wT x + b = 1

Page 35: SVM Lecture 2

NonLinearly Separable DataIII

Page 36: SVM Lecture 2
Page 37: SVM Lecture 2

positivenegative

Page 38: SVM Lecture 2

++-

Page 39: SVM Lecture 2

-

++++ ++

----

---

g(x)+

++ +

+

+ +

+

- -+

++

++

+

+(0,0) (0.5,0)

(−1,−1)

(1,1)+

g(x) = wT x +w0X

(0,1.25)

(−0.25,0.25)

(0.5,−0.5)

--- -

----

- ---- +

+

++

(−0.75,−0.25)

(−1.25,0)

(−1,1)

(1,−0.75)

Linearly separable?

Page 40: SVM Lecture 2

-

++++ ++

----

---

g(x)+

++ +

+

+ +

+

- -+

++

++

+

+(0,0) (0.5,0)

(−1,−1)

(1,1)+

g(x) = wTφ(x)+w0w = (1,1)

φ(x) = x12 , x2

2⎡⎣ ⎤⎦Transform feature space

φ : x→φ(x)

w0 = −1

(0,1.25)

(−0.25,0.25)

(0.5,−0.5)

--- -

----

- ---- +

+

++

(−0.75,−0.25)

(−1.25,0)

(−1,1)

(1,−0.75) --(0,0) (0,0.25)-(0.25,0.25)

+(1,0.56)+(1,1)

- +(1.5625,0)

+(0,1.5625)

-(0.56,0.0625)

Linearly separable?

Page 41: SVM Lecture 2

-

++++ ++

----

---

g(x)+

++ +

+

+ +

+

- -+

++

++

+

+(0,0) (0.5,0)

(−1,−1)

(1,1)+

g(x) = wTφ(x)+w0w = (1,1)

g(.5,−.5) = (1,1)i(0.25,0.25)−1≤ 0

φ(x) = x12 , x2

2⎡⎣ ⎤⎦Transform feature space

φ : x→φ(x) g(1,1) = (1,1)i(1,1)−1> 0g(−1,−1) = (1,1)i(1,1)−1> 0

w0 = −1

(0,1.25)

(−0.25,0.25)

(0.5,−0.5)

--- -

----

- ---- +

+

++

(−0.75,−0.25)

(−1.25,0)

(−1,1)

(1,−0.75) --(0,0) (0,0.25)-(0.25,0.25)

+(1,0.56)+(1,1)

- +(1.5625,0)

+(0,1.5625)

-(0.56,0.0625)

Linearly separable?

Page 42: SVM Lecture 2

++-- -- + -- - --

Linearly separable?

10 2 3............

Page 43: SVM Lecture 2

Linearly separable?

Yes, by transforming the feature space!φ(x) = x, x2⎡⎣ ⎤⎦

g(φ(x)) = (1,−3)iφ(x)+ 2f (x) = 1 if g(φ(x)) = (1,−3)iφ(x)+ 2 > 0f (x) = −1 if g(φ(x)) = (1,−3)iφ(x)+ 2 < 0

++-- -- + -- - --10 2 3

............

Lisa Hellerstein
Typewritten Text
There is an error in this slide. The given g is equal to x - 3x^2 + 2, which is positive iff x > -2/3 and < 1 (check this by factoring). So this slide and the next one can be fixed by relabeling the points on the line accordingly. (Alternatively, change phi(x) to be [x^2,x] instead of [x,x^2]. Then the labeling on the line is correct, but the rest of the example needs to be changed.)
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Page 44: SVM Lecture 2

Linearly separable?

Yes, by transforming the feature space!

g(φ(1 / 2)) = g(1 / 2,1 / 4) = (1,−3)i(1 / 2,1 / 4)+ 2

φ(x) = x, x2⎡⎣ ⎤⎦

g(φ(x)) = (1,−3)iφ(x)+ 2f (x) = 1 if g(φ(x)) = (1,−3)iφ(x)+ 2 > 0f (x) = −1 if g(φ(x)) = (1,−3)iφ(x)+ 2 < 0

g(φ(3 / 2)) = g(3 / 2,9 / 4) = (1,−3)i(3 / 2,9 / 4)+ 2g(φ(2)) = g(2, 4) = (1,−3)i(2, 4)+ 2

++-- -- + -- - --10 2 3

............

Page 45: SVM Lecture 2

- -

φ(x) = x, x2⎡⎣ ⎤⎦

--

+++(7 / 4,49 /16)(3 / 2,9 / 4)

(5 / 4,25 /16)

-(7 / 4,49 /16)

(3 / 4,9 /16)

g(φ(x)) = (1,−3)iφ(x)+ 2transforming the feature space using

The points are now linearly separable

These points become linearly separable by++-- -- + -- - --

10 2 3............

Page 46: SVM Lecture 2

Kernel Function

K (x, z) = φ(x)iφ(y)

Transform the feature spacemap x to φ(x)

Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
KERNEL TRICK Never compute phi(x). Just compute K(x,z) Why is this enough? If work with dual representation of hyperplane (and dual quadratic program), only use of new features is in inner products!
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Lisa Hellerstein
Typewritten Text
Page 47: SVM Lecture 2
Page 48: SVM Lecture 2
Page 49: SVM Lecture 2
Page 50: SVM Lecture 2
Page 51: SVM Lecture 2

Non-Separable DataIV

Page 52: SVM Lecture 2

+

-----

-

++

+

+++

+

(3,1)

---(1,1)

What if data is not linearly separable for only a few points?

x

-

+

+++

+++

++ +

------

Page 53: SVM Lecture 2

-

+

-----

-

++

+

+++

+

(3,1)

--(1,1)

What if a small number of points prevents the margin from being large?

x

- +

+++

+++

++ +

------

Page 54: SVM Lecture 2
Page 55: SVM Lecture 2
Page 56: SVM Lecture 2

+

-----

-

++

+

+++

+

(3,1)

---(1,1)

What if a small number of points prevents the margin from being large?

x

- +

+++

+++

++ +

------

+

-----

-

++

+

+++

+

(3,1)

---(1,1)

- +

+++

+++

++ +

------

λ large λ small

-+ +

-- -

++

λ =∞?What if

Page 57: SVM Lecture 2
Page 58: SVM Lecture 2