18-661 introduction to machine learning · 2020. 4. 27. · 18-661 introduction to machine learning...
TRANSCRIPT
![Page 1: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/1.jpg)
18-661 Introduction to Machine Learning
Support Vector Machines (SVM) – I
Spring 2020
ECE – Carnegie Mellon University
![Page 2: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/2.jpg)
Midterm Information
Midterm will be on Wednesday, 2/26 in-class.
• Closed-book except for one double-sided letter-size handwritten page
of notes that you can prepare as you wish.
• We will provide formulas for relevant probability distributions.
• You will not need a calculator. Only pen/pencil and scratch paper
are allowed.
Will cover all topics presented through next Wednesday in class (SVM
and before).
• (1) point estimation/MLE/MAP, (2) linear regression, (3) naive
Bayes, (4) logistic regression, and (5) SVMs.
• Next friday’s recitation will go over practice questions.
• Understand all homework questions and derivations in
lecture/recitation.
1
![Page 3: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/3.jpg)
Outline
1. Review of Non-linear classification boundary
2. Review of Multi-class Logistic Regression
3. Support Vector Machines (SVM): Intuition
4. SVM: Max Margin Formulation
5. SVM: Hinge Loss Formulation
6. Equivalence of These Two Formulations
2
![Page 4: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/4.jpg)
Review of Non-linear
classification boundary
![Page 5: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/5.jpg)
How to handle more complex decision boundaries?
x1
x 2
• This data is not linearly separable in the original feature space
• Use non-linear basis functions to add more features, hopefully it
becomes linearly separable in the “augmented” space.
3
![Page 6: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/6.jpg)
How to handle more complex decision boundaries?
x1
x 2
• This data is not linearly separable in the original feature space
• Use non-linear basis functions to add more features, hopefully it
becomes linearly separable in the “augmented” space.
3
![Page 7: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/7.jpg)
Adding polynomial features
• New feature vector is x = [1, x1, x2, x21 , x
22 ]
• Pr(y = 1|x) = σ(w0 + w1x1 + w2x2 + w3x21 + w4x
22 )
• If w = [−1, 0, 0, 1, 1], the boundary is −1 + x21 + x22 = 0
• If −1 + x21 + x2
2 ≥ 0 declare spam
• If −1 + x21 + x2
2 < 0 declare ham
x1
-1 + x12 + x2
2 = 0 x 2
4
![Page 8: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/8.jpg)
Solution to Overfitting: Regularization
• Add regularization term to be cross entropy loss function
E(w) = −∑n
{yn log σ(w>xn)+(1−yn) log[1−σ(w>xn)]}+ 1
2λ‖w‖22︸ ︷︷ ︸
regularization
• Perform gradient descent on this regularized function
• Often, we do NOT regularize the bias term w0
x1
x 2
5
![Page 9: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/9.jpg)
Solution to Overfitting: Regularization
• Add regularization term to be cross entropy loss function
E(w) = −∑n
{yn log σ(w>xn)+(1−yn) log[1−σ(w>xn)]}+ 1
2λ‖w‖22︸ ︷︷ ︸
regularization
• Perform gradient descent on this regularized function
• Often, we do NOT regularize the bias term w0
x1
x 2
5
![Page 10: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/10.jpg)
Outline
1. Review of Non-linear classification boundary
2. Review of Multi-class Logistic Regression
3. Support Vector Machines (SVM): Intuition
4. SVM: Max Margin Formulation
5. SVM: Hinge Loss Formulation
6. Equivalence of These Two Formulations
6
![Page 11: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/11.jpg)
Review of Multi-class Logistic
Regression
![Page 12: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/12.jpg)
Three approaches
• One-versus-all
• One-versus-one
• Multinomial regression
x1
x 2
7
![Page 13: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/13.jpg)
The One-versus-Rest or One-Versus-All Approach
• For each class Ck , change the problem into binary classification
1. Relabel training data with label Ck , into positive (or ‘1’)
2. Relabel all the rest data into negative (or ‘0’)
• Repeat this multiple times: Train K binary classifiers, using logistic
regression to differentiate the two classes each time
x1
x 2
8
![Page 14: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/14.jpg)
The One-versus-Rest or One-Versus-All Approach
• For each class Ck , change the problem into binary classification
1. Relabel training data with label Ck , into positive (or ‘1’)
2. Relabel all the rest data into negative (or ‘0’)
• Repeat this multiple times: Train K binary classifiers, using logistic
regression to differentiate the two classes each time
x1
x 2
9
![Page 15: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/15.jpg)
The One-versus-Rest or One-Versus-All Approach
• For each class Ck , change the problem into binary classification
1. Relabel training data with label Ck , into positive (or ‘1’)
2. Relabel all the rest data into negative (or ‘0’)
• Repeat this multiple times: Train K binary classifiers, using logistic
regression to differentiate the two classes each time
x1
x 2
w1T x = 0
10
![Page 16: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/16.jpg)
The One-versus-Rest or One-Versus-All Approach
How to combine these linear decision boundaries?
• Use the confidence estimates Pr(y = C1|x) = σ(w>1 x),
. . . Pr(y = CK |x) = σ(w>K x)
• Declare class C∗k that maximizes
k∗ = arg maxk=1,...,K
Pr(y = Ck |x) = σ(w>k x)
x1
x 2
+ Pr(Square) = 0.6
+
Pr(Circle) = 0.75
Pr(Triangle) = 0.2
11
![Page 17: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/17.jpg)
The One-Versus-One Approach
• For each pair of classes Ck and Ck′ , change the problem into binary
classification
1. Relabel training data with label Ck , into positive (or ‘1’)
2. Relabel training data with label Ck′ into negative (or ‘0’)
3. Disregard all other data
x1
x 2
12
![Page 18: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/18.jpg)
The One-Versus-One Approach
• How many binary classifiers for K classes?
K (K − 1)/2
• How to combine their outputs?
• Given x , count the K (K − 1)/2 votes from outputs of all binary
classifiers and declare the winner as the predicted class.
• Use confidence scores to resolve ties
x1
x 2
13
![Page 19: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/19.jpg)
The One-Versus-One Approach
• How many binary classifiers for K classes?
K (K − 1)/2
• How to combine their outputs?
• Given x , count the K (K − 1)/2 votes from outputs of all binary
classifiers and declare the winner as the predicted class.
• Use confidence scores to resolve ties
x1
x 2
13
![Page 20: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/20.jpg)
The One-Versus-One Approach
• How many binary classifiers for K classes? K (K − 1)/2
• How to combine their outputs?
• Given x , count the K (K − 1)/2 votes from outputs of all binary
classifiers and declare the winner as the predicted class.
• Use confidence scores to resolve ties
x1
x 2
13
![Page 21: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/21.jpg)
The One-Versus-One Approach
• How many binary classifiers for K classes? K (K − 1)/2
• How to combine their outputs?
• Given x , count the K (K − 1)/2 votes from outputs of all binary
classifiers and declare the winner as the predicted class.
• Use confidence scores to resolve ties
x1
x 2
13
![Page 22: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/22.jpg)
The One-Versus-One Approach
• How many binary classifiers for K classes? K (K − 1)/2
• How to combine their outputs?
• Given x , count the K (K − 1)/2 votes from outputs of all binary
classifiers and declare the winner as the predicted class.
• Use confidence scores to resolve ties
x1
x 2
13
![Page 23: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/23.jpg)
Multinomial logistic regression (Perceptron)
• Model: For each class Ck , we have a parameter vector w k and
model the posterior probability as:
p(Ck |x) =ew>
k x∑k′ e
w>k′x
← This is called softmax function
• Decision boundary / testing: Assign x with the label that is the
maximum of posterior:
arg maxk P(Ck |x)→ arg maxk w>k x .
14
![Page 24: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/24.jpg)
Parameter estimation
Discriminative approach: maximize conditional likelihood
logP(D) =∑n
logP(yn|xn)
We will change yn to yn = [yn1 yn2 · · · ynK ]>, a K -dimensional vector
using 1-of-K encoding.
ynk =
{1 if yn = k
0 otherwise
Ex: if yn = 2, then, yn = [0 1 0 0 · · · 0]>.
P(yn|xn) =K∏
k=1
P(Ck |xn)ynk
= P(C1|xn)yn1P(C2|xn)yn2 · · ·P(CK |xn)ynK
therefore, only the term corresponding to ynk = 1 will survive.
15
![Page 25: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/25.jpg)
Parameter estimation
Discriminative approach: maximize conditional likelihood
logP(D) =∑n
logP(yn|xn)
We will change yn to yn = [yn1 yn2 · · · ynK ]>, a K -dimensional vector
using 1-of-K encoding.
ynk =
{1 if yn = k
0 otherwise
Ex: if yn = 2, then, yn = [0 1 0 0 · · · 0]>.
P(yn|xn) =K∏
k=1
P(Ck |xn)ynk
= P(C1|xn)yn1P(C2|xn)yn2 · · ·P(CK |xn)ynK
therefore, only the term corresponding to ynk = 1 will survive.
15
![Page 26: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/26.jpg)
Parameter estimation
Discriminative approach: maximize conditional likelihood
logP(D) =∑n
logP(yn|xn)
We will change yn to yn = [yn1 yn2 · · · ynK ]>, a K -dimensional vector
using 1-of-K encoding.
ynk =
{1 if yn = k
0 otherwise
Ex: if yn = 2, then, yn = [0 1 0 0 · · · 0]>.
P(yn|xn) =K∏
k=1
P(Ck |xn)ynk
= P(C1|xn)yn1P(C2|xn)yn2 · · ·P(CK |xn)ynK
therefore, only the term corresponding to ynk = 1 will survive.
15
![Page 27: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/27.jpg)
Cross-entropy error function
∑n
logP(yn|xn) =∑n
logK∏
k=1
P(Ck |xn)ynk =∑n
∑k
ynk logP(Ck |xn)
Definition: negative log likelihood
E(w 1,w 2, . . . ,wK ) = −∑n
∑k
ynk logP(Ck |xn)
= −∑n
∑k
ynk log
(ew>
k xn∑k′ e
w>k′xn
)
Properties
• Convex, therefore unique global optimum
• Optimization requires numerical procedures, analogous to those used
for binary logistic regression
16
![Page 28: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/28.jpg)
Cross-entropy error function
∑n
logP(yn|xn) =∑n
logK∏
k=1
P(Ck |xn)ynk =∑n
∑k
ynk logP(Ck |xn)
Definition: negative log likelihood
E(w 1,w 2, . . . ,wK ) = −∑n
∑k
ynk logP(Ck |xn)
= −∑n
∑k
ynk log
(ew>
k xn∑k′ e
w>k′xn
)Properties
• Convex, therefore unique global optimum
• Optimization requires numerical procedures, analogous to those used
for binary logistic regression
16
![Page 29: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/29.jpg)
Outline
1. Review of Non-linear classification boundary
2. Review of Multi-class Logistic Regression
3. Support Vector Machines (SVM): Intuition
4. SVM: Max Margin Formulation
5. SVM: Hinge Loss Formulation
6. Equivalence of These Two Formulations
17
![Page 30: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/30.jpg)
Support Vector Machines
(SVM): Intuition
![Page 31: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/31.jpg)
Why do we need SVM?
Alternative to Logistic Regression and Naive Bayes.
• Logistic regression and naive Bayes train over the whole dataset.
• These can require a lot of memory in high-dimensional settings.
• SVM can give a better and more efficient solution
• SVM is one of the most powerful and commonly used ML algorithms
18
![Page 32: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/32.jpg)
Binary logistic regression
• We only need to know if p(x) > 0.5 or < 0.5.
• We don’t (always) need to know how far x is from this boundary.
How can we use this insight to improve the classification algorithm?
• What if we just looked at the boundary?
• Maybe then we could ignore some of the samples?
19
![Page 33: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/33.jpg)
Advantages of SVM
We will see later that SVM:
1. Is less sensitive to outliers.
2. Maximizes distance of training points from the boundary
3. Scales better with high-dimensional data.
4. Only requires a subset of the training points.
5. Generalizes well to many nonlinear models.
20
![Page 34: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/34.jpg)
Outline
1. Review of Non-linear classification boundary
2. Review of Multi-class Logistic Regression
3. Support Vector Machines (SVM): Intuition
4. SVM: Max Margin Formulation
5. SVM: Hinge Loss Formulation
6. Equivalence of These Two Formulations
21
![Page 35: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/35.jpg)
SVM: Max Margin Formulation
![Page 36: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/36.jpg)
Binary Classification: Finding a Linear Decision Boundary
HH�
H��
• Input features x .
• Decision boundary is a hyperplane H : w>x + b = 0.
• All x satisfying w>x + b < 0 lie on the same side of the line and are
in the same “class.”
22
![Page 37: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/37.jpg)
Intuition: Where to put the decision boundary?
• Consider a separable training dataset (e.g., with two features)
• There are an infinite number of decision boundaries
H : w>x + b = 0!
HH�
H��
• Which one should we pick?
23
![Page 38: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/38.jpg)
Intuition: Where to put the decision boundary?
w·x+b=0
w·x+b=0
Idea: Find a decision boundary in the ’middle’ of the two classes that:
• Perfectly classifies the training data
• Is as far away from every training point as possible
Let us apply this intuition to build a classifier that MAXIMIZES THE
MARGIN between training points and the decision boundary
24
![Page 39: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/39.jpg)
Intuition: Where to put the decision boundary?
w·x+b=0
w·x+b=0
Idea: Find a decision boundary in the ’middle’ of the two classes that:
• Perfectly classifies the training data
• Is as far away from every training point as possible
Let us apply this intuition to build a classifier that MAXIMIZES THE
MARGIN between training points and the decision boundary24
![Page 40: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/40.jpg)
First, we need to review some vector geometry
What is a hyperplane?
w·x+b=0
w·x+b=0
• General equation is w>x + b = 0
• Divides the space in half, i.e., w>x + b > 0 and w>x + b < 0
• A hyperplane is a line in 2D and a plane in 3D
• w ∈ Rd is a non-zero normal vector25
![Page 41: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/41.jpg)
Vector Norms and Inner Products
• Given two vectors w and x, what is their inner product?
• Inner Product w>x = w1x1 + w2x2 + · · ·+ wdxd
0
w
xθ
• Inner Product w>x is also equal to ||w ||||x || cos θ
• w>w = ||w ||2• If w and x are perpendicular θ = π/2, and thus the inner product is
zero
26
![Page 42: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/42.jpg)
Vector Norms and Inner Products
• Given two vectors w and x, what is their inner product?
• Inner Product w>x = w1x1 + w2x2 + · · ·+ wdxd
0
w
xθ
• Inner Product w>x is also equal to ||w ||||x || cos θ
• w>w = ||w ||2• If w and x are perpendicular θ = π/2, and thus the inner product is
zero
26
![Page 43: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/43.jpg)
Vector Norms and Inner Products
• Given two vectors w and x, what is their inner product?
• Inner Product w>x = w1x1 + w2x2 + · · ·+ wdxd
0
w
xθ
• Inner Product w>x is also equal to ||w ||||x || cos θ
• w>w = ||w ||2• If w and x are perpendicular θ = π/2, and thus the inner product is
zero
26
![Page 44: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/44.jpg)
Vector Norms and Inner Products
• Given two vectors w and x, what is their inner product?
• Inner Product w>x = w1x1 + w2x2 + · · ·+ wdxd
0
w
xθ
• Inner Product w>x is also equal to ||w ||||x || cos θ
• w>w = ||w ||2• If w and x are perpendicular θ = π/2, and thus the inner product is
zero
26
![Page 45: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/45.jpg)
Normal vector of a hyperplane
a
w�x + b = 0
a0
w�
p
q
Vector w is normal to the hyperplane. Why?
• If p and q are both on the line, then w>p + b = w>q + b = 0.
• Then w>(p − q) = w>p −w>q = −b − (−b) = 0
• p − q is an arbitrary vector parallel to the line, thus w is orthogonal
• w∗ = w||w || is the unit normal vector
27
![Page 46: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/46.jpg)
Normal vector of a hyperplane
a
w�x + b = 0
a0
w�
p
q
Vector w is normal to the hyperplane. Why?
• If p and q are both on the line, then w>p + b = w>q + b = 0.
• Then w>(p − q) = w>p −w>q = −b − (−b) = 0
• p − q is an arbitrary vector parallel to the line, thus w is orthogonal
• w∗ = w||w || is the unit normal vector
27
![Page 47: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/47.jpg)
Normal vector of a hyperplane
a
w�x + b = 0
a0
w�
p
q
Vector w is normal to the hyperplane. Why?
• If p and q are both on the line, then w>p + b = w>q + b = 0.
• Then w>(p − q) = w>p −w>q = −b − (−b) = 0
• p − q is an arbitrary vector parallel to the line, thus w is orthogonal
• w∗ = w||w || is the unit normal vector
27
![Page 48: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/48.jpg)
Normal vector of a hyperplane
a
w�x + b = 0
a0
w�
p
q
Vector w is normal to the hyperplane. Why?
• If p and q are both on the line, then w>p + b = w>q + b = 0.
• Then w>(p − q) = w>p −w>q = −b − (−b) = 0
• p − q is an arbitrary vector parallel to the line, thus w is orthogonal
• w∗ = w||w || is the unit normal vector
27
![Page 49: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/49.jpg)
Distance from a Hyperplane
a
w�x + b = 0
a0
w�
p
q
How to find the distance from a to the hyperplane?
• We want to find distance between a and line in the direction of w∗.
• If we define point a0 on the line, then this distance corresponds to
length of a − a0 in direction of w∗, which equals w∗>(a − a0)
• We know w>a0 = −b since w>a0 + b = 0.
• Then the distance equals 1||w || (w
>a + b)28
![Page 50: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/50.jpg)
Distance from a point to decision boundary
The unsigned distance from a point x to decision boundary (hyperplane)
H is
dH(x) =|w>x + b|‖w‖2
We can remove the absolute value | · | by exploiting the fact that the
decision boundary classifies every point in the training dataset correctly.
Namely, (w>x + b) and x ’s label y must have the same sign, so:
dH(x) =y [w>x + b]
‖w‖2
Notation change from Logistic Regression
• Change of notation y = 0→ y = −1
• Separate the bias term b from w
29
![Page 51: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/51.jpg)
Distance from a point to decision boundary
The unsigned distance from a point x to decision boundary (hyperplane)
H is
dH(x) =|w>x + b|‖w‖2
We can remove the absolute value | · | by exploiting the fact that the
decision boundary classifies every point in the training dataset correctly.
Namely, (w>x + b) and x ’s label y must have the same sign, so:
dH(x) =y [w>x + b]
‖w‖2
Notation change from Logistic Regression
• Change of notation y = 0→ y = −1
• Separate the bias term b from w
29
![Page 52: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/52.jpg)
Defining the Margin
Margin
Smallest distance between the hyperplane and all training points
margin(w , b) = minn
yn[w>xn + b]
‖w‖2
H : wTφ(x) + b = 0
|wTφ(x) + b|�w�2
Notation change from Logistic Regression
• Change of notation y = 0→ y = −1
• Separate the bias term b from w30
![Page 53: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/53.jpg)
Optimizing the Margin
H : wTφ(x) + b = 0
|wTφ(x) + b|�w�2
How should we pick (w , b) based on its margin?
We want a decision boundary that is as far away from all training points
as possible, so we to maximize the margin!
maxw ,b
(minn
yn[w>xn + b]
‖w‖2
)= max
w ,b
(1
‖w‖2minn
yn[w>xn + b]
)
Only involves points near the boundary (more on this later).
31
![Page 54: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/54.jpg)
Scale of w
Margin
Smallest distance between the hyperplane and all training points
margin(w , b) = minn
yn[w>xn + b]
‖w‖2
Consider three hyperplanes
(w , b) (2w , 2b) (.5w , .5b)
Which one has the largest margin?
• The margin doesn’t change if we scale (w , b) by a constant c
• w>x + b = 0 and (cw)>x + (cb) = 0: same decision boundary!
• Can we further constrain the problem so as to get a unique solution
(w , b)?
32
![Page 55: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/55.jpg)
Rescaled Margin
We can further constrain the problem by scaling (w , b) such that
minn
yn[w>xn + b] = 1
We’ve fixed the numerator in the margin(w , b) equation, and we have:
margin(w , b) =minn yn[w>xn + b]
‖w‖2=
1
‖w‖2
Hence the points closest to the decision boundary are at distance 1‖w‖2
H : wTφ(x) + b = 0
1
�w�2
wTφ(x) + b = 1
wTφ(x) + b = −1
33
![Page 56: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/56.jpg)
SVM: max margin formulation for separable data
Assuming separable training data, we thus want to solve:
maxw ,b
1
‖w‖2︸ ︷︷ ︸margin
such that yn[w>xn + b] ≥ 1, ∀ n︸ ︷︷ ︸scaling of w , b
This is equivalent to
minw ,b
1
2‖w‖22
s.t. yn[w>xn + b] ≥ 1, ∀ n
Given our geometric intuition, SVM is called a max margin (or large
margin) classifier. The constraints are called large margin constraints.
34
![Page 57: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/57.jpg)
Support vectors – a first look
SVM formulation for separable data
minw ,b
1
2‖w‖22
s.t. yn[w>xn + b] ≥ 1, ∀ n
H : wTφ(x) + b = 0
1
�w�2
wTφ(x) + b = 1
wTφ(x) + b = −1
• “=”: yn[w>xn + b] = 1, these training data are “support vectors”
• “>”: yn[w>xn + b] > 1, removing them do not affect the optimal
solution.35
![Page 58: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/58.jpg)
SVM for non-separable data
SVM formulation for separable data
minw ,b
1
2‖w‖22
s.t. yn[w>xn + b] ≥ 1, ∀ n
Non-separable setting
In practice our training data will not be separable. What issues arise with
the optimization problem above when data is not separable?
• For every w there exists a training point x i such that
yi [w>x i + b] ≤ 0
• There is no feasible (w , b) as at least one of our constraints is
violated!
36
![Page 59: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/59.jpg)
SVM for non-separable data
Constraints in separable setting
yn[w>xn + b] ≥ 1, ∀ n
Constraints in non-separable setting
Idea: modify our constraints to account for non-separability! Specifically,
we introduce slack variables ξn ≥ 0:
yn[w>xn + b] ≥ 1− ξn, ∀ n
• For “hard” training points, we can increase ξn until the above
inequalities are met
• What does it mean when ξn is very large?
37
![Page 60: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/60.jpg)
Soft-margin SVM formulation
We do not want ξn to grow too large, and we can control their size by
incorporating them into our optimization problem:
minw ,b,ξ
1
2‖w‖22 + C
∑n
ξn
s.t. yn[w>xn + b] ≥ 1− ξn, ∀ n
ξn ≥ 0, ∀ n
What is the role of C?
• User-defined hyperparameter
• Trades off between the two terms in our objective
• Same idea as the regularization term in ridge regression, i.e., C = 1λ
38
![Page 61: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/61.jpg)
How to solve this problem?
minw ,b,ξ
1
2‖w‖22 + C
∑n
ξn
s.t. yn[w>xn + b] ≥ 1− ξn, ∀ n
ξn ≥ 0, ∀ n
• This is a convex quadratic program: the objective function is
quadratic in w and linear in ξ and the constraints are linear
(inequality) constraints in w , b and ξn.
• We can solve the optimization problem using general-purpose
solvers, e.g., Matlab’s quadprog() function.
39
![Page 62: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/62.jpg)
Meaning of “support vectors” in SVMs
• The SVM solution is only determined by a subset of the training
samples (as we will see in more detail in the next lecture)
• These samples are called support vectors
• All other training points do not affect the optimal solution, i.e., if we
remove the other points and construct another SVM classifier on the
reduced dataset, the optimal solution will be the same
These properties allow us to be more efficient than logistic regression or
naive Bayes.
40
![Page 63: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/63.jpg)
Visualization of how training data points are categorized
H : wT�(x) + b = 0
wT�(x) + b = 1
wT�(x) + b = �1
⇠n = 0
⇠n < 1⇠n > 1
Support vectors are highlighted by the dotted orange lines
Recall the constraints yn[w>xn + b] ≥ 1− ξn.
41
![Page 64: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/64.jpg)
Visualization of how training data points are categorized
H : wT�(x) + b = 0
wT�(x) + b = 1
wT�(x) + b = �1
⇠n = 0
⇠n < 1⇠n > 1
Recall the constraints yn[w>xn + b] ≥ 1− ξn. Three types of support
vectors
• ξn = 0: The point is on the boundary
• 0 < ξn ≤ 1: On the correct side, but inside the margin
• ξn > 1: On the wrong side of the boundary
42
![Page 65: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/65.jpg)
Example of SVM
0 x1
x2
1 2 3 4 5
y = 1y = -1
What will be the decision boundary learnt by solving the SVM
optimization problem?
43
![Page 66: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/66.jpg)
Example of SVM
0 x1
x2
1 2 3 4 5
y = 1y = -1
What will be the decision boundary learnt by solving the SVM
optimization problem?
43
![Page 67: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/67.jpg)
Example of SVM
0 x1
x2
1 2 3 4 5
x1 -2.5 = 0 y = 1y = -1
Margin = 1.5; the decision boundary has w = [1, 0]>, and b = −2.5.
Is this the right scaling of w and b? We need the support vectors to
satisfy to yn(w>xn + b) = 1.
Not quite. For example, for xn = [1, 0]>, we have
yn(w>xn + b) = (−1)[1− 2.5] = 1.5.
44
![Page 68: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/68.jpg)
Example of SVM
0 x1
x2
1 2 3 4 5
x1 -2.5 = 0 y = 1y = -1
Margin = 1.5; the decision boundary has w = [1, 0]>, and b = −2.5.
Is this the right scaling of w and b? We need the support vectors to
satisfy to yn(w>xn + b) = 1.
Not quite. For example, for xn = [1, 0]>, we have
yn(w>xn + b) = (−1)[1− 2.5] = 1.5.
44
![Page 69: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/69.jpg)
Example of SVM
0 x1
x2
1 2 3 4 5
x1 -2.5 = 0 y = 1y = -1
Margin = 1.5; the decision boundary has w = [1, 0]>, and b = −2.5.
Is this the right scaling of w and b? We need the support vectors to
satisfy to yn(w>xn + b) = 1.
Not quite. For example, for xn = [1, 0]>, we have
yn(w>xn + b) = (−1)[1− 2.5] = 1.5.
44
![Page 70: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/70.jpg)
Example of SVM: scaling
0 x1
x2
1 2 3 4 5
y = 1y = -1 (2x1-5)/3 = 0
Thus, our optimization problem will re-scale w and b to get this equation
for the same decision boundary
Margin = 1.5; the decision boundary has w = [2/3, 0]>, and b = −5/3.
For example, for xn = [1, 0]>, we have
yn(w>xn + b) = (−1)[2/3− 5/3] = 1.
45
![Page 71: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/71.jpg)
Example of SVM: support vectors
0 x1
x2
1 2 3 4 5
y = 1y = -1 (2x1-5)/3 = 0
The solution to our optimization problem will be the same to the
reduced dataset containing all the support vectors.
46
![Page 72: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/72.jpg)
Example of SVM: support vectors
0 x11 2 3 4 5
y = 1y = -1 (2x1-5)/3 = 0
There can be many more data than the number of support vectors (so we
can train on a smaller dataset).
47
![Page 73: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/73.jpg)
Example of SVM: resilience to outliers
0
x2
1 2 3 4 5
y = 1y = -1
• Still linearly separable, but one of the orange dots is an “outlier”.
48
![Page 74: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/74.jpg)
Example of SVM: resilience to outliers
0 x1
x2
1 2 3 4 5
y = 1y = -1
• Naively applying the hard-margin SVM will result in a classifier with
small margin.
• So, better to use the soft-margin formulation.
49
![Page 75: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/75.jpg)
Example of SVM: resilience to outliers
0 x1
x2
1 2 3 4 5
y = 1y = -1
• Naively applying the hard-margin SVM will result in a classifier with
small margin.
• So, better to use the soft-margin formulation.
49
![Page 76: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/76.jpg)
Example of SVM: resilience to outliers
0 x1
x2
1 2 3 4 5
y = 1y = -1
0 x1
x2
1 2 3 4 5
y = 1y = -1
⇠n
minw ,b,ξ
1
2‖w‖22 + C
∑n
ξn
s.t. yn[w>xn + b] ≥ 1− ξn, ∀ n
ξn ≥ 0, ∀ n
• C =∞ corresponds to the hard-margin SVM;
• Due to the flexibility in C , SVM is also less sensitive to outliers.
50
![Page 77: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/77.jpg)
Example of SVM
0 x1
x2
1 2 3 4 5
y = 1y = -1
• Similar reasons apply to the case when the data is not linearly
separable.
• The value of C determines how much the boundary will shift:
trade-off of accuracy and robustness (sensitivity to outliers).
51
![Page 78: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/78.jpg)
Outline
1. Review of Non-linear classification boundary
2. Review of Multi-class Logistic Regression
3. Support Vector Machines (SVM): Intuition
4. SVM: Max Margin Formulation
5. SVM: Hinge Loss Formulation
6. Equivalence of These Two Formulations
52
![Page 79: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/79.jpg)
SVM: Hinge Loss Formulation
![Page 80: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/80.jpg)
Logistic Regression Loss: Illustration
L(w) = −∑n
{yn log σ(w>xn) + (1− yn) log[1− σ(w>xn)]}
• Loss grows approx. linearly as we move away from the boundary
• Alternative: Hinge Loss Function
53
![Page 81: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/81.jpg)
Hinge Loss: Illustration
L(w) = −∑n
{yn log σ(w>xn) + (1− yn) log[1− σ(w>xn)]}
• Loss grows linearly as we move away from the boundary
• No penalty if a point is more than 1 unit from the boundary
• Makes the search for the boundary easier (as we will see later)
54
![Page 82: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/82.jpg)
Hinge Loss: Mathematical Expression
L(w) = −∑n
max(0, 1− yn(w>xn + b))
• Change of notation y = 0→ y = −1
• Separate the bias term b from w
• Makes the mathematical expression more compact55
![Page 83: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/83.jpg)
Hinge Loss: Mathematical Expression
Definition
Assume y ∈ {−1, 1} and the decision rule is h(x) = sign(w>x) with
f (x) = w>x + b,
`hinge(f (x), y) =
{0 if yf (x) ≥ 1
1− yf (x) otherwise
56
![Page 84: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/84.jpg)
Hinge loss
Definition
Assume y ∈ {−1, 1} and the decision rule is h(x) = sign(f (x)) with
f (x) = w>x + b,
`hinge(f (x), y) =
{0 if yf (x) ≥ 1
1− yf (x) otherwise
Intuition
• No penalty if raw output, f (x), has same sign and is far enough
from decision boundary (i.e., if ‘margin’ is large enough)
• Otherwise pay a growing penalty, between 0 and 1 if signs match,
and greater than one otherwise
Convenient shorthand
`hinge(f (x), y) = max(0, 1− yf (x)) = (1− yf (x))+57
![Page 85: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/85.jpg)
Optimization Problem of support vector machines (SVM)
Minimizing the total hinge loss on all the training data
minw ,b
∑n
max(0, 1− yn[w>xn + b])︸ ︷︷ ︸hinge loss for sample n
+λ
2‖w‖22︸ ︷︷ ︸
regularizer
Analogous to regularized least squares, as we balance between two terms
(the loss and the regularizer).
• Can solve using gradient descent to get the optimal w and b
• Gradient of the first term will be either 0, xn or −xn depending on
yn and w>xn + b
• Much easier to compute than in logistic regression, where we need
to compute the sigmoid function σ(w>xn + b) in each iteration
58
![Page 86: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/86.jpg)
Optimization Problem of support vector machines (SVM)
Minimizing the total hinge loss on all the training data
minw ,b
∑n
max(0, 1− yn[w>xn + b])︸ ︷︷ ︸hinge loss for sample n
+λ
2‖w‖22︸ ︷︷ ︸
regularizer
Analogous to regularized least squares, as we balance between two terms
(the loss and the regularizer).
• Can solve using gradient descent to get the optimal w and b
• Gradient of the first term will be either 0, xn or −xn depending on
yn and w>xn + b
• Much easier to compute than in logistic regression, where we need
to compute the sigmoid function σ(w>xn + b) in each iteration
58
![Page 87: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/87.jpg)
Optimization Problem of support vector machines (SVM)
Minimizing the total hinge loss on all the training data
minw ,b
∑n
max(0, 1− yn[w>xn + b])︸ ︷︷ ︸hinge loss for sample n
+λ
2‖w‖22︸ ︷︷ ︸
regularizer
Analogous to regularized least squares, as we balance between two terms
(the loss and the regularizer).
• Can solve using gradient descent to get the optimal w and b
• Gradient of the first term will be either 0, xn or −xn depending on
yn and w>xn + b
• Much easier to compute than in logistic regression, where we need
to compute the sigmoid function σ(w>xn + b) in each iteration
58
![Page 88: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/88.jpg)
Optimization Problem of support vector machines (SVM)
Minimizing the total hinge loss on all the training data
minw ,b
∑n
max(0, 1− yn[w>xn + b])︸ ︷︷ ︸hinge loss for sample n
+λ
2‖w‖22︸ ︷︷ ︸
regularizer
Analogous to regularized least squares, as we balance between two terms
(the loss and the regularizer).
• Can solve using gradient descent to get the optimal w and b
• Gradient of the first term will be either 0, xn or −xn depending on
yn and w>xn + b
• Much easier to compute than in logistic regression, where we need
to compute the sigmoid function σ(w>xn + b) in each iteration
58
![Page 89: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/89.jpg)
Outline
1. Review of Non-linear classification boundary
2. Review of Multi-class Logistic Regression
3. Support Vector Machines (SVM): Intuition
4. SVM: Max Margin Formulation
5. SVM: Hinge Loss Formulation
6. Equivalence of These Two Formulations
59
![Page 90: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/90.jpg)
Equivalence of These Two
Formulations
![Page 91: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/91.jpg)
Recovering our previous SVM formulation
Rewrite the geometric formulation as the hinge loss formulation:
minw ,b
∑n
max(0, 1− yn[w>xn + b]) +λ
2‖w‖22
Here’s the geometric formulation again:
minw ,b,ξ
1
2‖w‖22 + C
∑n
ξn s.t. yn[w>xn + b] ≥ 1− ξn, ξn ≥ 0, ∀ n
Now since yn[w>xn + b] ≥ 1− ξn ⇐⇒ ξn ≥ 1− yn[w>xn + b]:
minw ,b,ξ
C∑n
ξn +1
2‖w‖22 s.t. max(0, 1− yn[w>xn + b]) ≤ ξn, ∀ n
Now since the ξn should always be as small as possible, we obtain:
minw ,b
C∑n
max(0, 1− yn[w>xn + b]) +1
2‖w‖22
60
![Page 92: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/92.jpg)
Recovering our previous SVM formulation
Rewrite the geometric formulation as the hinge loss formulation:
minw ,b
∑n
max(0, 1− yn[w>xn + b]) +λ
2‖w‖22
Here’s the geometric formulation again:
minw ,b,ξ
1
2‖w‖22 + C
∑n
ξn s.t. yn[w>xn + b] ≥ 1− ξn, ξn ≥ 0, ∀ n
Now since yn[w>xn + b] ≥ 1− ξn ⇐⇒ ξn ≥ 1− yn[w>xn + b]:
minw ,b,ξ
C∑n
ξn +1
2‖w‖22 s.t. max(0, 1− yn[w>xn + b]) ≤ ξn, ∀ n
Now since the ξn should always be as small as possible, we obtain:
minw ,b
C∑n
max(0, 1− yn[w>xn + b]) +1
2‖w‖22
60
![Page 93: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/93.jpg)
Recovering our previous SVM formulation
Rewrite the geometric formulation as the hinge loss formulation:
minw ,b
∑n
max(0, 1− yn[w>xn + b]) +λ
2‖w‖22
Here’s the geometric formulation again:
minw ,b,ξ
1
2‖w‖22 + C
∑n
ξn s.t. yn[w>xn + b] ≥ 1− ξn, ξn ≥ 0, ∀ n
Now since yn[w>xn + b] ≥ 1− ξn ⇐⇒ ξn ≥ 1− yn[w>xn + b]:
minw ,b,ξ
C∑n
ξn +1
2‖w‖22 s.t. max(0, 1− yn[w>xn + b]) ≤ ξn, ∀ n
Now since the ξn should always be as small as possible, we obtain:
minw ,b
C∑n
max(0, 1− yn[w>xn + b]) +1
2‖w‖22
60
![Page 94: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/94.jpg)
Recovering our previous SVM formulation
Rewrite the geometric formulation as the hinge loss formulation:
minw ,b
∑n
max(0, 1− yn[w>xn + b]) +λ
2‖w‖22
Here’s the geometric formulation again:
minw ,b,ξ
1
2‖w‖22 + C
∑n
ξn s.t. yn[w>xn + b] ≥ 1− ξn, ξn ≥ 0, ∀ n
Now since yn[w>xn + b] ≥ 1− ξn ⇐⇒ ξn ≥ 1− yn[w>xn + b]:
minw ,b,ξ
C∑n
ξn +1
2‖w‖22 s.t. max(0, 1− yn[w>xn + b]) ≤ ξn, ∀ n
Now since the ξn should always be as small as possible, we obtain:
minw ,b
C∑n
max(0, 1− yn[w>xn + b]) +1
2‖w‖22
60
![Page 95: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/95.jpg)
Recovering our previous SVM formulation
Rewrite the geometric formulation as the hinge loss formulation:
minw ,b
∑n
max(0, 1− yn[w>xn + b]) +λ
2‖w‖22
Here’s the geometric formulation again:
minw ,b,ξ
1
2‖w‖22 + C
∑n
ξn s.t. yn[w>xn + b] ≥ 1− ξn, ξn ≥ 0, ∀ n
Now since yn[w>xn + b] ≥ 1− ξn ⇐⇒ ξn ≥ 1− yn[w>xn + b]:
minw ,b,ξ
C∑n
ξn +1
2‖w‖22 s.t. max(0, 1− yn[w>xn + b]) ≤ ξn, ∀ n
Now since the ξn should always be as small as possible, we obtain:
minw ,b
C∑n
max(0, 1− yn[w>xn + b]) +1
2‖w‖22
60
![Page 96: 18-661 Introduction to Machine Learning · 2020. 4. 27. · 18-661 Introduction to Machine Learning Support Vector Machines (SVM) { I Spring 2020 ECE { Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022051904/5ff5b7b43641220d7868742b/html5/thumbnails/96.jpg)
Advantages of SVM
We’ve seen that the geometric formulation of SVM is equivalent to
minimizing the empirical hinge loss. This explains why SVM:
1. Is less sensitive to outliers.
2. Maximizes distance of training data from the boundary
3. Generalizes well to many nonlinear models.
4. Only requires a subset of the training points.
5. Scales better with high-dimensional data.
We will need to use duality to show the next three properties.
61