![Page 1: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/1.jpg)
1
CSE 881: Data Mining
Lecture 10: Support Vector Machines
![Page 2: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/2.jpg)
2
Support Vector Machines
Find a linear hyperplane (decision boundary) that will separate the data
![Page 3: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/3.jpg)
3
Support Vector Machines
One Possible Solution
B1
![Page 4: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/4.jpg)
4
Support Vector Machines
Another possible solution
B2
![Page 5: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/5.jpg)
5
Support Vector Machines
Other possible solutions
B2
![Page 6: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/6.jpg)
6
Support Vector Machines
Which one is better? B1 or B2? How do you define better?
B1
B2
![Page 7: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/7.jpg)
7
Support Vector Machines
Find hyperplane maximizes the margin => B1 is better than B2
B1
B2
b11
b12
b21
b22
margin
![Page 8: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/8.jpg)
8
SVM vs Perceptron
Perceptron: SVM:
Minimizes least square error (gradient descent)
Maximizes margin
![Page 9: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/9.jpg)
9
Support Vector Machines
B1
b11
b12
0 bxw
1 bxw 1 bxw
1bxw if1
1bxw if1)(
xf
||||
2 Margin
w
![Page 10: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/10.jpg)
10
Learning Linear SVM
We want to maximize:
– Which is equivalent to minimizing:
– But subjected to the following constraints:
or
This is a constrained optimization problem– Solve it using Lagrange multiplier method
||||
2 Margin
w
1bxw if 1
1bxw if1
i
i
iy
2
||||)(
2wwL
Niby ii ,...,2,1 ,1)( xw
![Page 11: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/11.jpg)
11
Unconstrained Optimization
min E(w1,w2) = w12 + w2
2
002
002
222
111
wwdw
dE
wwdw
dE
w2
w1
E(w1,w2)
w1=0, w2=0 minimizes E(w1,w2)
Minimum value is E(0,0)=0
![Page 12: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/12.jpg)
12
Constrained Optimization
min E(w1,w2) = w12 + w2
2
subject to 2w1 + w2 = 4 (equality constraint)
516),(,5
4,58
42042
202
022
)42(
2121
121121
222
111
2122
21
wwEww
wwwwd
dL
wwdw
dL
wwdw
dL
wwwwL
Dual problem:
445)422(4max
222 L
w2
w1
E(w1,w2)
![Page 13: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/13.jpg)
13
Learning Linear SVM
Lagrange multiplier:
– Take derivative w.r.t and b:
– Additional constraints:
– Dual problem:
n
iii by
wwL
1
2
12
||||)( xw
iii
n
iiii
yb
L
yL
00
01
xww
01)(
0
by iii
i
xw
n
ijiji
n
ii yywL
11
)( ji xx
![Page 14: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/14.jpg)
14
Learning Linear SVM
Bigger picture:– Learning algorithm needs to find w and b– To solve for w:
– But:
is zero for points that do not reside on
Data points where s are not zero are called support vectors
n
iiii y
1
xw
01)( by iii xw
1)( by ii xw
![Page 15: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/15.jpg)
15
Example of Linear SVM
x1 x2 y 0.3858 0.4687 1 65.52610.4871 0.611 -1 65.52610.9218 0.4103 -1 00.7382 0.8936 -1 00.1763 0.0579 1 00.4057 0.3529 1 00.9355 0.8132 -1 00.2146 0.0099 1 0
Support vectors
![Page 16: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/16.jpg)
16
Learning Linear SVM
Bigger picture…
– Decision boundary depends only on support vectors If you have data set with same support vectors, decision boundary will not change
– How to classify using SVM once w and b are found? Given a test record, xi
1bxw if1
1bxw if1)(
i
i
ixf
![Page 17: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/17.jpg)
17
Support Vector Machines
What if the problem is not linearly separable?
![Page 18: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/18.jpg)
18
Support Vector Machines
What if the problem is not linearly separable?– Introduce slack variables
Need to minimize:
Subject to:
If k is 1 or 2, this leads to same objective function as linear SVM but with different constraints (see textbook)
ii
ii
1bxw if1
-1bxw if1)(
ixf
N
i
kiC
wwL
1
2
2
||||)(
![Page 19: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/19.jpg)
19
Nonlinear Support Vector Machines
What if decision boundary is not linear?
![Page 20: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/20.jpg)
20
Nonlinear Support Vector Machines
Trick: Transform data into higher dimensional space
0)( bxw
Decision boundary:
![Page 21: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/21.jpg)
21
Learning Nonlinear SVM
Optimization problem:
Which leads to the same set of equations (but involve (x) instead of x)
![Page 22: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/22.jpg)
22
Learning NonLinear SVM
Issues:
– What type of mapping function should be used?
– How to do the computation in high dimensional space? Most computations involve dot product (xi) (xj)
Curse of dimensionality?
![Page 23: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/23.jpg)
23
Learning Nonlinear SVM
Kernel Trick:
– (xi) (xj) = K(xi, xj)
– K(xi, xj) is a kernel function (expressed in terms of the coordinates in the original space) Examples:
![Page 24: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/24.jpg)
24
Example of Nonlinear SVM
SVM with polynomial degree 2 kernel
![Page 25: 1 CSE 881: Data Mining Lecture 10: Support Vector Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649d6c5503460f94a4c376/html5/thumbnails/25.jpg)
25
Learning Nonlinear SVM
Advantages of using kernel:
– Don’t have to know the mapping function – Computing dot product (xi) (xj) in the
original space avoids curse of dimensionality
Not all functions can be kernels
– Must make sure there is a corresponding in some high-dimensional space
– Mercer’s theorem (see textbook)