introduction to svms. svms geometric –maximizing margin kernel methods –making nonlinear...
TRANSCRIPT
![Page 1: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/1.jpg)
Introduction to SVMsIntroduction to SVMs
![Page 2: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/2.jpg)
SVMs
• Geometric – Maximizing Margin
• Kernel Methods– Making nonlinear decision boundaries linear– Efficiently!
• Capacity– Structural Risk Minimization
![Page 3: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/3.jpg)
Linear Classifiersf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
![Page 4: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/4.jpg)
Linear Classifiersf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
![Page 5: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/5.jpg)
Linear Classifiersf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
![Page 6: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/6.jpg)
Linear Classifiersf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
![Page 7: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/7.jpg)
Linear Classifiersf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Any of these would be fine..
..but which is best?
![Page 8: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/8.jpg)
Classifier Marginf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
![Page 9: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/9.jpg)
Maximum Marginf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with the maximum margin.
This is the simplest kind of SVM (Called an LSVM)Linear SVM
![Page 10: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/10.jpg)
Maximum Marginf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with the, um, maximum margin.
This is the simplest kind of SVM (Called an LSVM)
Support Vectors are those datapoints that the margin pushes up against
Linear SVM
![Page 11: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/11.jpg)
Why Maximum Margin?
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with the, um, maximum margin.
This is the simplest kind of SVM (Called an LSVM)
Support Vectors are those datapoints that the margin pushes up against
1. Intuitively this feels safest.
2. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification.
3. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing.
4. Empirically it works very very well.
![Page 12: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/12.jpg)
A “Good” Separator
X
X
O
OO
O
OOX
X
X
X
X
XO
O
![Page 13: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/13.jpg)
Noise in the Observations
X
X
O
OO
O
OOX
X
X
X
X
XO
O
![Page 14: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/14.jpg)
Ruling Out Some Separators
X
X
O
OO
O
OOX
X
X
X
X
XO
O
![Page 15: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/15.jpg)
Lots of Noise
X
X
O
OO
O
OOX
X
X
X
X
XO
O
![Page 16: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/16.jpg)
Maximizing the Margin
X
X
O
OO
O
OOX
X
X
X
X
XO
O
![Page 17: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/17.jpg)
Specifying a line and margin
• How do we represent this mathematically?
• …in m input dimensions?
Plus-Plane
Minus-Plane
Classifier Boundary
“ Predict Class
= +1”
zone
“ Predict Class
= -1”
zone
![Page 18: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/18.jpg)
Specifying a line and margin
• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }
Plus-Plane
Minus-Plane
Classifier Boundary
“ Predict Class
= +1”
zone
“ Predict Class
= -1”
zone
Classify as..
wx+b=1
wx+b=0
wx+b=-
1
+1 if w . x + b >= 1
-1 if w . x + b <= -1
Universe explodes
if -1 < w . x + b < 1
![Page 19: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/19.jpg)
Computing the margin width
• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
“ Predict Class
= +1”
zone
“ Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width
How do we compute M in terms of w and b?
![Page 20: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/20.jpg)
Computing the margin width
• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
“ Predict Class
= +1”
zone
“ Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width
How do we compute M in terms of w and b?
Let u and v be two vectors on the Plus Plane. What is w . ( u – v ) ?
And so of course the vector w is also perpendicular to the Minus Plane
![Page 21: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/21.jpg)
Computing the margin width
• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }• The vector w is perpendicular to the Plus Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.
“ Predict Class
= +1”
zone
“ Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width
How do we compute M in terms of w and b?
x-
x+
Any location in m: not necessarily a datapoint
Any location in Rm: not necessarily a datapoint
![Page 22: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/22.jpg)
Computing the margin width
• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }• The vector w is perpendicular to the Plus Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.• Claim: x+ = x- + w for some value of . Why?
“ Predict Class
= +1”
zone
“ Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width
How do we compute M in terms of w and b?
x-
x+
![Page 23: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/23.jpg)
Computing the margin width
• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }• The vector w is perpendicular to the Plus Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.• Claim: x+ = x- + w for some value of . Why?
“ Predict Class
= +1”
zone
“ Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width
How do we compute M in terms of w and b?
x-
x+
The line from x- to x+ is perpendicular to the planes.
So to get from x- to x+ travel some distance in direction w.
![Page 24: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/24.jpg)
Computing the margin width
What we know:• w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + w• |x+ - x- | = M
It’s now easy to get M in terms of w and b
“ Predict Class
= +1”
zone
“ Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width
x-
x+
![Page 25: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/25.jpg)
Computing the margin width
What we know:• w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + w• |x+ - x- | = M
It’s now easy to get M in terms of w and b
“ Predict Class
= +1”
zone
“ Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width
w . (x - + w) + b = 1
=>
w . x - + b + w .w = 1
=>
-1 + w .w = 1
=>
x-
x+
w.w
2λ
![Page 26: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/26.jpg)
Computing the margin width
What we know:• w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + w• |x+ - x- | = M•
“ Predict Class
= +1”
zone
“ Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width =
M = |x+ - x- | =| w |=
x-
x+
w.w
2λ
wwww
ww
.
2
.
.2
www .|| λλ
ww.
2
![Page 27: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/27.jpg)
Learning the Maximum Margin Classifier
Given a guess of w and b we can• Compute whether all data points in the correct half-planes• Compute the width of the margin
So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How?
Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method?
“ Predict Class
= +1”
zone
“ Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width =
x-
x+ww.
2
![Page 28: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/28.jpg)
• Linear Programming
find w
argmax cw
subject to
wai bi, for i = 1, …, m
wj 0 for j = 1, …, n
Don’t worry…
it’s good for you…
Don’t worry…
it’s good for you…
There are fast algorithms for solving linear programs including the
simplex algorithm and Karmarkar’s algorithm
![Page 29: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/29.jpg)
Learning via Quadratic Programming
• QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.
![Page 30: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/30.jpg)
Quadratic Programming
2maxarg
uuud
u
Rc
TT Find
nmnmnn
mm
mm
buauaua
buauaua
buauaua
...
:
...
...
2211
22222121
11212111
)()(22)(11)(
)2()2(22)2(11)2(
)1()1(22)1(11)1(
...
:
...
...
enmmenenen
nmmnnn
nmmnnn
buauaua
buauaua
buauaua
And subject to
n additional linear inequality constraints
e a
dd
ition
al
linear
eq
uality
co
nstra
ints
Quadratic criterion
Subject to
![Page 31: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/31.jpg)
Quadratic Programming
2maxarg
uuud
u
Rc
TT Find
Subject to
nmnmnn
mm
mm
buauaua
buauaua
buauaua
...
:
...
...
2211
22222121
11212111
)()(22)(11)(
)2()2(22)2(11)2(
)1()1(22)1(11)1(
...
:
...
...
enmmenenen
nmmnnn
nmmnnn
buauaua
buauaua
buauaua
And subject to
n additional linear inequality constraints
e a
dd
ition
al
linear
eq
uality
co
nstra
ints
Quadratic criterion
There exist algorithms for
finding such constrained
quadratic optima much
more efficiently and
reliably than gradient
ascent.
(But they are very fiddly…you
probably don’t want to
write one yourself)
![Page 32: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/32.jpg)
Learning the Maximum Margin ClassifierGiven guess of w , b we can• Compute whether all data
points are in the correct half-planes
• Compute the margin width
Assume R datapoints, each (xk,yk) where yk = +/- 1
“ Predict Class
= +1”
zone
“ Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M =
ww.
2
What should our quadratic optimization criterion be?
How many constraints will we have?
What should they be?
Minimize w.w
R
w . xk + b >= 1 if yk = 1
w . xk + b <= -1 if yk = -1
![Page 33: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/33.jpg)
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1:
Find minimum w.w, while minimizing number of training set errors.
Problem: Two things to minimize makes for an ill-defined optimization
![Page 34: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/34.jpg)
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1.1:
Minimize
w.w + C (#train errors)
There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is?
Tradeoff parameter
![Page 35: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/35.jpg)
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1.1:
Minimize
w.w + C (#train errors)
There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is?
Tradeoff parameterCan’t be expressed as a Quadratic
Programming problem.
Solving it may be too slow.
(Also, doesn’t distinguish between disastrous errors and near
misses)
So… any
other
ideas?
![Page 36: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/36.jpg)
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 2.0:
Minimize w.w + C (distance of error points to their correct place)
![Page 37: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/37.jpg)
Learning Maximum Margin with NoiseGiven guess of w , b we can
• Compute sum of distances of points to their correct zones
• Compute the margin width
Assume R datapoints, each (xk,yk) where yk = +/- 1
wx+b=1
wx+b=0
wx+b=-
1
M =
ww.
2
What should our quadratic optimization criterion be?
How many constraints will we have?
What should they be?
![Page 38: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/38.jpg)
04/21/23 38
Large-margin Decision Boundary• The decision boundary should be as far away from the data of both
classes as possible– We should maximize the margin, m– Distance between the origin and the line wtx=k is k/||w||
Class 1
Class 2
m
![Page 39: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/39.jpg)
04/21/23 39
Finding the Decision Boundary
• Let {x1, ..., xn} be our data set and let yi {1,-1} be the class label of xi
• The decision boundary should classify all points correctly
• The decision boundary can be found by solving the following constrained optimization problem
• This is a constrained optimization problem. Solving it requires some new tools– Feel free to ignore the following several slides; what is important is the
constrained optimization problem above
![Page 40: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/40.jpg)
04/21/23 40
Back to the Original Problem
• The Lagrangian is
– Note that ||w||2 = wTw
• Setting the gradient of w.r.t. w and b to zero, we have
![Page 41: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/41.jpg)
• The Karush-Kuhn-Tucker conditions,
04/21/23 41
![Page 42: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/42.jpg)
04/21/23 42
The Dual Problem• If we substitute to , we have
• Note that
• This is a function of i only
![Page 43: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/43.jpg)
43
The Dual Problem
• The new objective function is in terms of i only
• It is known as the dual problem: if we know w, we know all i; if we know all i, we know w
• The original problem is known as the primal problem• The objective function of the dual problem needs to be maximized!• The dual problem is therefore:
Properties of i when we introduce the Lagrange multipliers
The result when we differentiate the original Lagrangian w.r.t. b
![Page 44: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/44.jpg)
04/21/23 44
The Dual Problem
• This is a quadratic programming (QP) problem
– A global maximum of i can always be found
• w can be recovered by
![Page 45: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/45.jpg)
04/21/23 45
Characteristics of the Solution• Many of the i are zero
– w is a linear combination of a small number of data points– This “sparse” representation can be viewed as data compression
as in the construction of knn classifier
• xi with non-zero i are called support vectors (SV)
– The decision boundary is determined only by the SV
– Let tj (j=1, ..., s) be the indices of the s support vectors. We can write
• For testing with a new data z
– Compute and classify z
as class 1 if the sum is positive, and class 2 otherwise
– Note: w need not be formed explicitly
![Page 46: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/46.jpg)
04/21/23 46
6=1.4
A Geometrical Interpretation
Class 1
Class 2
1=0.8
2=0
3=0
4=0
5=0
7=0
8=0.6
9=0
10=0
![Page 47: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/47.jpg)
04/21/23 47
Non-linearly Separable Problems
• We allow “error” i in classification; it is based on the output of the discriminant function wTx+b
• i approximates the number of misclassified samples
Class 1
Class 2
![Page 48: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/48.jpg)
Learning Maximum Margin with NoiseGiven guess of w , b we can
• Compute sum of distances of points to their correct zones
• Compute the margin width
Assume R datapoints, each (xk,yk) where yk = +/- 1
wx+b=1
wx+b=0
wx+b=-
1
M =
ww.
2
What should our quadratic optimization criterion be?
Minimize
R
kkεC
1
.2
1ww
7
11 2
How many constraints will we have? R
What should they be?
w . xk + b >= 1-k if yk = 1
w . xk + b <= -1+k if yk = -1
![Page 49: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/49.jpg)
Learning Maximum Margin with NoiseGiven guess of w , b we can
• Compute sum of distances of points to their correct zones
• Compute the margin width
Assume R datapoints, each (xk,yk) where yk = +/- 1
wx+b=1
wx+b=0
wx+b=-
1
M =
ww.
2
What should our quadratic optimization criterion be?
Minimize
R
kkεC
1
.2
1ww
7
11 2
Our original (noiseless data) QP had m+1 variables: w1, w2, … wm, and b.
Our new (noisy data) QP has m+1+R variables: w1, w2, … wm, b, k , 1 ,… R
m = # input dimension
s
How many constraints will we have? R
What should they be?
w . xk + b >= 1-k if yk = 1
w . xk + b <= -1+k if yk = -1
R= # records
![Page 50: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/50.jpg)
Learning Maximum Margin with NoiseGiven guess of w , b we
can• Compute sum of
distances of points to their correct zones
• Compute the margin width
Assume R datapoints, each (xk,yk) where yk = +/- 1
How many constraints will we have? R
What should they be?
w . xk + b >= 1-k if yk = 1
w . xk + b <= -1+k if yk = -1
wx+b=1
wx+b=0
wx+b=-
1
M =
ww.
2
What should our quadratic optimization criterion be?
Minimize
R
kkεC
1
.2
1ww
7
11 2
There’s a bug in this QP. Can you spot it?
![Page 51: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/51.jpg)
Learning Maximum Margin with NoiseGiven guess of w , b we can
• Compute sum of distances of points to their correct zones
• Compute the margin width
Assume R datapoints, each (xk,yk) where yk = +/- 1
wx+b=1
wx+b=0
wx+b=-
1
M =
ww.
2
What should our quadratic optimization criterion be?
Minimize
R
kkεC
1
.2
1ww
7
11 2
How many constraints will we have? 2R
What should they be?
w . xk + b >= 1-k if yk = 1
w . xk + b <= -1+k if yk = -1
k >= 0 for all k
![Page 52: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/52.jpg)
Learning Maximum Margin with NoiseGiven guess of w , b we can
• Compute sum of distances of points to their correct zones
• Compute the margin width
Assume R datapoints, each (xk,yk) where yk = +/- 1
wx+b=1
wx+b=0
wx+b=-
1
M =
ww.
2
What should our quadratic optimization criterion be?
Minimize
R
kkεC
1
.2
1ww
7
11 2
How many constraints will we have? 2R
What should they be?
w . xk + b >= 1-k if yk = 1
w . xk + b <= -1+k if yk = -1
k >= 0 for all k
![Page 53: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/53.jpg)
An Equivalent Dual QP
Maximize
R
k
R
lkllk
R
kk Qααα
1 11 2
1where ).( lklkkl yyQ xx
Subject to these constraints:
kCαk 0 01
R
kkk yα
Minimize
R
kkεC
1
.2
1ww
w . xk + b >= 1-k if yk = 1
w . xk + b <= -1+k if yk = -1
k >= 0 , for all k
![Page 54: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/54.jpg)
An Equivalent Dual QP
Maximize
R
k
R
lkllk
R
kk Qααα
1 11 2
1where ).( lklkkl yyQ xx
Subject to these constraints:
kCαk 0
Then define:
R
kkkk yα
1
xw
kk
KKKK
αK
εyb
maxarg where
.)1(
wx
Then classify with:
f(x,w,b) = sign(w. x - b)
01
R
kkk yα
![Page 55: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/55.jpg)
Example XOR problem revisited:
Let the nonlinear mapping be :
(x) = (1,x12, 21/2 x1x2, x2
2, 21/2 x1 , 21/2 x2)T
And: (xi)=(1,xi12, 21/2 xi1xi2, xi2
2, 21/2 xi1 , 21/2 xi2)T
Therefore the feature space is in 6D with input data in 2D
x1 = (-1,-1), d1= - 1 x2 = (-1, 1), d2= 1 x3 = ( 1,-1), d3= 1 x4 = (-1,-1), d4= -1
![Page 56: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/56.jpg)
Q(a)= ai – ½ ai aj di dj xi) Txj)
=a1 +a2 +a3 +a4 – ½(9 a1 a1 - 2a1 a2 -2 a1 a3 +2a1 a4
+9a2 a2 + 2a2 a3 -2a2 a4 +9a3 a3 -2a3 a4 +9 a4 a4 )
To minimize Q, we only need to calculate
(due to optimality conditions) which gives
1 = 9 a1 - a2 - a3 + a4
1 = -a1 + 9 a2 + a3 - a4
1 = -a1 + a2 + 9 a3 - a4
1 = a1 - a2 - a3 + 9 a4
4,...,1,0)( i
iaaQ
![Page 57: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/57.jpg)
The solution of which gives the optimal values: a0,1 =a0,2 =a0,3 =a0,4 =1/8
w0 = a0,i di xi) = 1/8[x1)- x2)- x3)+ x4)]
0
0
02
10
0
2
2
1
2
1
1
2
2
1
2
1
1
2
2
1
2
1
1
2
2
1
2
1
1
8
1
Where the first element of w0 gives the bias b
![Page 58: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/58.jpg)
From earlier we have that the optimal hyperplane is defined by:
w0T x) = 0
That is:
0
2
2
2
1
0002
100 21
2
1
22
21
21
xx
x
x
x
xx
x
w0T x)
which is the optimal decision boundary for the XOR problem. Furthermore we note that the solution is unique since the optimal decision boundary is unique
![Page 59: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/59.jpg)
Output for polynomial RBF
![Page 60: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/60.jpg)
Harder 1-dimensional datasetRemember how
permitting non-linear basis functions made linear regression so much nicer?
Let’s permit them here too
x=0 ),( 2kkk xxz
![Page 61: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/61.jpg)
For a non-linearly separable problem we have to first map data onto feature space so that they are linear separable
xixi)
Given the training data sample {(xi,yi), i=1, …,N}, find the optimum values of the weight vector w and bias b
w = a0,i yi xi)
where a0,i are the optimal Lagrange multipliers determined by maximizing the following objective function
subject to the constraints
ai yi =0 ; ai >0
N
iji
Tjij
N
ji
N
ii xxddaaaaQ
1 11
)()(2
1)(
![Page 62: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/62.jpg)
SVM building procedure:
1. Pick a nonlinear mapping 2. Solve for the optimal weight vector
However: how do we pick the function
• In practical applications, if it is not totally impossible to find it is very hard
• In the previous example, the function is quite complex: How would we find it?
Answer: the Kernel Trick
![Page 63: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/63.jpg)
Notice that in the dual problem the image of input vectors only involved as an inner product meaning that the optimization can be performed in the (lower dimensional) input space and that the inner product can be replaced by an inner-product kernel
How do we relate the output of the SVM to the kernel K?
Look at the equation of the boundary in the feature space and use the optimality conditions derived from the Lagrangian formulations
1 , 1
1 , 1
1( ) ( ) ( )2
1 ( , )2
N NT
ii i j i j j ji i j
N N
i i j i j i ji i j
Q a a a a d d x x
a a a d d K x x
![Page 64: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/64.jpg)
1
1
1
00
0 1 1
1
( ) 0
( ) 0; ( ) 1
: ( ) [ ( ), ( ),..., ( )]
: ( ) 0
: ( )
m
j jj
m
j jj
m
T
N
ii ii
Hyperplane is defined by
w x b
or
w x where x
writing x x x x
we get w x
from optimality conditions w a d x
![Page 65: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/65.jpg)
1
1
1
1
0
: ( ) ( ) 0
: ( , ) 0
( ) ( , )
: ( , ) ( ) ( )
NT
ii ii
N
i i ii
NT
i i ii
m
ii j jj
Thus a d x x
and so boundary is a d K x x
and Output w x a d K x x
where K x x x x
![Page 66: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/66.jpg)
In the XOR problem, we chose to use the kernel function:K(x, xi) = (x T
xi+1)2
= 1+ x12 xi1
2 + 2 x1x2 xi1xi2 + x22 xi2
2 + 2x1xi1 ,+ 2x2xi2
Which implied the form of our nonlinear functions:(x) = (1,x1
2, 21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T
And: (xi)=(1,xi12, 21/2 xi1xi2, xi2
2, 21/2 xi1 , 21/2 xi2)T
However, we did not need to calculate at all and could simply have used the kernel to calculate:
Q(a) = ai – ½ ai aj di dj xixj
Maximized and solved for ai and derived the hyperplane via:
0),(1
ii
N
ii xxKda
![Page 67: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/67.jpg)
We therefore only need a suitable choice of kernel function cf:Mercer’s Theorem:
Let K(x,y) be a continuous symmetric kernel that defined in the closed interval [a,b]. The kernel K can be expanded in the form
(x,y) = x) T y)
provided it is positive definite. Some of the usual choices for K are:
Polynomial SVM (x T xi+1)p p specified by user
RBF SVM exp(-1/(2) || x – xi||2) specified by user
MLP SVM tanh(s0 x T xi + s1)
![Page 68: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/68.jpg)
Maximize
where ).( lklkkl yyQ xx
Subject to these constraints:
kCαk 0
Then define:
R
kkkk yα
1
xw
kk
KKKK
αK
εyb
maxarg where
.)1(
wx
Then classify with:
f(x,w,b) = sign(w. x - b)
01
R
kkk yα
Datapoints with k > 0 will be the support vectors
..so this sum only needs to be over the support vectors.
R
k
R
lkllk
R
kk Qααα
1 11 2
1An Equivalent Dual QP
![Page 69: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/69.jpg)
Quadratic Basis
Functions
mm
m
m
m
m
xx
xx
xx
xx
xx
xx
x
x
x
x
x
x
1
1
32
1
31
21
2
22
21
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
)(xΦ
Constant Term
Linear Terms
Pure Quadratic
Terms
Quadratic Cross-Terms
Number of terms (assuming m input dimensions) = (m+2)-choose-2
= (m+2)(m+1)/2
= (as near as makes no difference) m2/2
You may be wondering what those
’s are doing.
•You should be happy that they do no harm
•You’ll find out why they’re there soon.
2
![Page 70: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/70.jpg)
QP with basis functionswhere ))().(( lklkkl yyQ xΦxΦ
Subject to these constraints:
kCαk 0
Then define:
kk
KKKK
αK
εyb
maxarg where
.)1(
wx
Then classify with:
f(x,w,b) = sign(w. (x) - b)
01
R
kkk yα
0 s.t.
)(kαk
kkk yα xΦw
Maximize
R
k
R
lkllk
R
kk Qααα
1 11 2
1
![Page 71: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/71.jpg)
QP with basis functionswhere ))().(( lklkkl yyQ xΦxΦ
Subject to these constraints:
kCαk 0
Then define:
kk
KKKK
αK
εyb
maxarg where
.)1(
wx
Then classify with:
f(x,w,b) = sign(w. (x) - b)
01
R
kkk yα
We must do R2/2 dot products to get this matrix ready.
Each dot product requires m2/2 additions and multiplications
The whole thing costs R2 m2 /4. Yeeks!
……or does it?or does it?
0 s.t.
)(kαk
kkk yα xΦw
Maximize
R
k
R
lkllk
R
kk Qααα
1 11 2
1
![Page 72: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/72.jpg)
mm
m
m
m
m
mm
m
m
m
m
bb
bb
bb
bb
bb
bb
b
b
b
b
b
b
aa
aa
aa
aa
aa
aa
a
a
a
a
a
a
1
1
32
1
31
21
2
22
21
2
1
1
1
32
1
31
21
2
22
21
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
)()( bΦaΦ
1
m
iiiba
1
2
m
iii ba
1
22
m
i
m
ijjiji bbaa
1 1
2
+
+
+
Quadratic Dot Products
![Page 73: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/73.jpg)
)()( bΦaΦ
m
i
m
ijjiji
m
iii
m
iii bbaababa
1 11
22
1
221
Just out of casual, innocent, interest, let’s look at another function of a and b:
2)1.( ba
1.2).( 2 baba
121
2
1
m
iii
m
iii baba
1211 1
m
iii
m
i
m
jjjii bababa
122)(11 11
2
m
iii
m
i
m
ijjjii
m
iii babababa
Quadratic Dot Products
![Page 74: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/74.jpg)
)()( bΦaΦ
Just out of casual, innocent, interest, let’s look at another function of a and b:
2)1.( ba
1.2).( 2 baba
121
2
1
m
iii
m
iii baba
1211 1
m
iii
m
i
m
jjjii bababa
122)(11 11
2
m
iii
m
i
m
ijjjii
m
iii babababa
They’re the same!
And this is only O(m) to compute!
m
i
m
ijjiji
m
iii
m
iii bbaababa
1 11
22
1
221
Quadratic Dot Products
![Page 75: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/75.jpg)
QP with Quadratic basis functionswhere ))().(( lklkkl yyQ xΦxΦ
Subject to these constraints:
kCαk 0
Then define:
kk
KKKK
αK
εyb
maxarg where
.)1(
wx
Then classify with:
f(x,w,b) = sign(w. (x) - b)
01
R
kkk yα
We must do R2/2 dot products to get this matrix ready.
Each dot product now only requires m additions and multiplications
0 s.t.
)(kαk
kkk yα xΦw
Maximize
R
k
R
lkllk
R
kk Qααα
1 11 2
1
![Page 76: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/76.jpg)
Higher Order Polynomials
Polynomial (x) Cost to build Qkl matrix traditionally
Cost if 100 inputs
(a).(b) Cost to build Qkl matrix efficiently
Cost if 100 inputs
Quadratic All m2/2 terms up to degree 2
m2 R2 /4 2,500 R2 (a.b+1)2 m R2 / 2 50 R2
Cubic All m3/6 terms up to degree 3
m3 R2 /12 83,000 R2 (a.b+1)3 m R2 / 2 50 R2
Quartic All m4/24 terms up to degree 4
m4 R2 /48 1,960,000 R2
(a.b+1)4 m R2 / 2 50 R2
![Page 77: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/77.jpg)
QP with Quintic basis functions
Maximize
R
k
R
lkllk
R
kk Qααα
1 11
where ))().(( lklkkl yyQ xΦxΦ
Subject to these constraints:
kCαk 0
Then define:
0 s.t.
)(kαk
kkk yα xΦw
kk
KKKK
αK
εyb
maxarg where
.)1(
wx
Then classify with:
f(x,w,b) = sign(w. (x) - b)
01
R
kkk yα
![Page 78: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/78.jpg)
QP with Quintic basis functions
Maximize
R
k
R
lkllk
R
kk Qααα
1 11
where ))().(( lklkkl yyQ xΦxΦ
Subject to these constraints:
kCαk 0
Then define:
0 s.t.
)(kαk
kkk yα xΦw
kk
KKKK
αK
εyb
maxarg where
.)1(
wx
Then classify with:
f(x,w,b) = sign(w. (x) - b)
01
R
kkk yα
We must do R2/2 dot products to get this matrix ready.
In 100-d, each dot product now needs 103 operations instead of 75 million
But there are still worrying things lurking away. What are they?
•The fear of overfitting with this enormous number of terms
•The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?)
![Page 79: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/79.jpg)
QP with Quintic basis functions
Maximize
R
k
R
lkllk
R
kk Qααα
1 11
where ))().(( lklkkl yyQ xΦxΦ
Subject to these constraints:
kCαk 0
Then define:
0 s.t.
)(kαk
kkk yα xΦw
kk
KKKK
αK
εyb
maxarg where
.)1(
wx
Then classify with:
f(x,w,b) = sign(w. (x) - b)
01
R
kkk yα
We must do R2/2 dot products to get this matrix ready.
In 100-d, each dot product now needs 103 operations instead of 75 million
But there are still worrying things lurking away. What are they?
•The fear of overfitting with this enormous number of terms
•The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?)
Because each w. (x) (see below) needs 75 million operations. What can be done?
The use of Maximum Margin magically makes this not a problem
![Page 80: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/80.jpg)
QP with Quintic basis functions
Maximize
R
k
R
lkllk
R
kk Qααα
1 11
where ))().(( lklkkl yyQ xΦxΦ
Subject to these constraints:
kCαk 0
Then define:
0 s.t.
)(kαk
kkk yα xΦw
kk
KKKK
αK
εyb
maxarg where
.)1(
wx
Then classify with:
f(x,w,b) = sign(w. (x) - b)
01
R
kkk yα
We must do R2/2 dot products to get this matrix ready.
In 100-d, each dot product now needs 103 operations instead of 75 million
But there are still worrying things lurking away. What are they?
•The fear of overfitting with this enormous number of terms
•The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?)
Because each w. (x) (see below) needs 75 million operations. What can be done?
The use of Maximum Margin magically makes this not a problem
Only Sm operations (S=#support vectors)
0 s.t.
)()()(kαk
kkk yα xΦxΦxΦw
0 s.t.
5)1(kαk
kkk yα xx
![Page 81: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/81.jpg)
QP with Quintic basis functions
Maximize
R
k
R
lkllk
R
kk Qααα
1 11
where ))().(( lklkkl yyQ xΦxΦ
Subject to these constraints:
kCαk 0
Then define:
0 s.t.
)(kαk
kkk yα xΦw
kk
KKKK
αK
εyb
maxarg where
.)1(
wx
Then classify with:
f(x,w,b) = sign(w. (x) - b)
01
R
kkk yα
We must do R2/2 dot products to get this matrix ready.
In 100-d, each dot product now needs 103 operations instead of 75 million
But there are still worrying things lurking away. What are they?
•The fear of overfitting with this enormous number of terms
•The evaluation phase (doing a set of predictions on a test set) will be very expensive (why?)
Because each w. (x) (see below) needs 75 million operations. What can be done?
The use of Maximum Margin magically makes this not a problem
Only Sm operations (S=#support vectors)
0 s.t.
)()()(kαk
kkk yα xΦxΦxΦw
0 s.t.
5)1(kαk
kkk yα xx
![Page 82: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/82.jpg)
QP with Quintic basis functionswhere ))().(( lklkkl yyQ xΦxΦ
Subject to these constraints:
kCαk 0
Then define:
0 s.t.
)(kαk
kkk yα xΦw
kk
KKKK
αK
εyb
maxarg where
.)1(
wx
Then classify with:
f(x,w,b) = sign(w. (x) - b)
01
R
kkk yα
Why SVMs don’t overfit as much as you’d think:
No matter what the basis function, there are really only up to R parameters: 1, 2 .. R, and usually most are set to zero by the Maximum Margin.
Asking for small w.w is like “weight decay” in Neural Nets and like Ridge Regression parameters in Linear regression and like the use of Priors in Bayesian Regression---all designed to smooth the function and reduce overfitting.
Only Sm operations (S=#support vectors)
0 s.t.
)()()(kαk
kkk yα xΦxΦxΦw
0 s.t.
5)1(kαk
kkk yα xx
Maximize
R
k
R
lkllk
R
kk Qααα
1 11 2
1
![Page 83: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/83.jpg)
SVM Kernel Functions
• K(a,b)=(a . b +1)d is an example of an SVM Kernel Function
• Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function– Radial-Basis-style Kernel Function:
– Neural-net-style Kernel Function:
2
2
2
)(exp),(
ba
baK
).tanh(),( babaK
, and are magic parameters that must be chosen by a model selection method such as CV or VCSRM
![Page 84: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/84.jpg)
SVM Implementations
• Sequential Minimal Optimization, SMO, efficient implementation of SVMs, Platt – in Weka
• SVMlight
– http://svmlight.joachims.org/
![Page 85: Introduction to SVMs. SVMs Geometric –Maximizing Margin Kernel Methods –Making nonlinear decision boundaries linear –Efficiently! Capacity –Structural](https://reader036.vdocuments.mx/reader036/viewer/2022062518/56649ea75503460f94ba9724/html5/thumbnails/85.jpg)
References
• Tutorial on VC-dimension and Support Vector Machines:
C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html
• The VC/SRM/SVM Bible:Statistical Learning Theory by Vladimir Vapnik, Wiley-
Interscience; 1998