machine learning: chenhao tan university of colorado ... · slides adapted from jordan boyd-graber,...

Machine Learning: Chenhao TanUniversity of Colorado BoulderLECTURE 10

Slides adapted from Jordan Boyd-Graber, Chris Ketelsen

Machine Learning: Chenhao Tan | Boulder | 1 of 52

Roadmap

• Last time: linear SVM formulation when data is linearly separable• This time:◦ Introduce duality◦ Make linear SVM work when data is not linearly separable◦ Introduce an efficient algorithm for finding weights

• Next time: Kernel trick


julius

Pencil

Overview

Duality

Slack variables

Sequential Mimimal Optimization

Recap


julius

Pencil

Duality

Outline

Duality

Slack variables


Recap


Duality

Binary classification

Given: Strain = {(xi, yi)}mi=1 training examples, xi ∈ Rd, yi ∈ {−1, 1}

Goal: Find hypothesis function h : X → YLinear SVM: learn a linear decision rule of the form w · x + b


julius

Pencil

Duality

Optimizing the objective function

minw,b

12||w||2 (1)

subject to yi(w · xi + b) ≥ 1, i ∈ [1,m]


julius

Pencil

Duality

Optimizing Constrained Functions

The Method of Lagrange Multipliers

Constrained problem (Primalproblem)

minx

f (x)

s.t. gi(x) ≥ 0, i ∈ [1, n]

Lagrange Multiplier

L (x,α) = f (x)−n∑

i=1

αigi(x),

αi ≥ 0, i ∈ [1, n]


julius

Pencil

julius

Pencil

Duality

Lagrange Multiplier

p∗: the optimal value in the primal problemWe claim that

p∗ = minx

maxα

L (x,α) = minx

maxα

f (x)−n∑

i=1

αigi(x)

This is because

max−αy =

{0 y ≥ 0+∞ otherwise


julius

Pencil

Duality

Lagrange Multiplier

What happens if we reverse min and max:

maxα

minx

L (x,α) ≥ or ≤ minx

maxα

L (x,α)

The left leads to the dual problem.


julius

Pencil

Duality

Lagrange Multiplier

What happens if we reverse min and max:

maxα

minx

L (x,α) ≤ minx

maxα

L (x,α)

The left leads to the dual problem.


julius

Pencil

julius

Pencil

julius

Pencil

Duality

Primal vs. Dual

Prime problem

minw,b

12||w||2

s.t. yi(w · xi + b) ≥ 1, i ∈ [1,m]

Derive the function for dual problem.


julius

Pencil

julius

Pencil

julius

Pencil

julius

Pencil

Duality

Primal vs. Dual

Prime problem

minw,b

12||w||2

s.t. yi(w · xi + b) ≥ 1, i ∈ [1,m]

Replace w, b with stationarity conditions.


julius

Pencil

julius

Pencil

julius

Pencil

julius

Pencil

Duality

Primal vs. Dual

Primal problem

minw,b

12||w||2

s.t. yi(w · xi + b) ≥ 1, i ∈ [1,m]

Dual problem

maxα

m∑i=1

αi −12

m∑i=1

m∑j=1

αiαjyiyj(xj · xi)

s.t. αi ≥ 0, i ∈ [1,m]∑i

αiyi = 0


julius

Pencil

Duality

Karush-Kuhn-Tucker (KKT) conditions

Primal and dual feasibility

yi(w · xi + b) ≥ 1, αi ≥ 0 (2)

Stationarity

w =m∑

i=1

αiyixi,m∑

i=1

αiyi = 0 (3)

Complementary slackness

αi = 0 ∨ yi(w · xi + b) = 1 (4)


julius

Pencil

Slack variables

Outline

Duality

Slack variables


Recap


Slack variables

Old objective function

minw,b

12||w||2 (5)



julius

Pencil

Slack variables

Can SVMs Work Here?

yi(w · xi + b) ≥ 1 (6)


Slack variables

Can SVMs Work Here?

yi(w · xi + b) ≥ 1 (6)


julius

Pencil

Slack variables

Trick: Allow for a few bad apples


julius

Pencil

Slack variables

Old objective function

minw,b

12||w||2 (7)



Slack variables

Relaxing the constraint

yi(w · xi + b) ≥ 1− ξi

• ξi = 0 means at least one margin on correct side of decision boundary• ξi = 1/2 means at least one-half margin on correct side of decision boundary• ξi = 2 means at least one margin on wrong side of decision boundary


Slack variables

New objective function

minw,b,ξ

12||w||2 + C

∑i=1

ξip (8)

subject to yi(w · xi + b) ≥ 1− ξi ∧ ξi ≥ 0, i ∈ [1,m]

• Standard margin• How wrong a point is (slack variables)• Tradeoff between margin and slack variables• How bad wrongness scales


Slack variables


minw,b,ξ

12||w||2 + C

∑i=1

ξip (8)


• Standard margin

• How wrong a point is (slack variables)• Tradeoff between margin and slack variables• How bad wrongness scales


Slack variables


minw,b,ξ

12||w||2 + C

∑i=1

ξip (8)


• Standard margin• How wrong a point is (slack variables)

• Tradeoff between margin and slack variables• How bad wrongness scales


Slack variables


minw,b,ξ

12||w||2 + C

∑i=1

ξip (8)


• Standard margin• How wrong a point is (slack variables)• Tradeoff between margin and slack variables

• How bad wrongness scales


Slack variables


minw,b,ξ

12||w||2 + C

∑i=1

ξip (8)


• Standard margin• How wrong a point is (slack variables)• Tradeoff between margin and slack variables• How bad wrongness scales


Slack variables

Aside: Loss Functions

• Losses measure how bad a mistake is• Important for slack as well


Slack variables



x0/1 Loss


Slack variables



x

LinearHinge

0/1 Loss


Slack variables



x

Quadratic Hinge

LinearHinge

0/1 Loss


Slack variables



x

Quadratic Hinge

LinearHinge

0/1 Loss

We’ll focus on linear hinge loss, set p = 1Machine Learning: Chenhao Tan | Boulder | 20 of 52

Slack variables

What is the role of C?

minw,b,ξ

12||w||2 + C

∑i=1

ξi (9)


A. C ↑⇒ low bias, low varianceB. C ↑⇒ low bias, high varianceC. C ↑⇒ high bias, low varianceD. C ↑⇒ high bias, high variance


Slack variables

New Lagrangian

L (w, b, ξ,α,β) =12||w||2 + C

m∑i=1

ξi (10)

−m∑

i=1

αi [yi(w · xi + b)− 1 + ξi] (11)

−m∑

i=1

βiξi (12)

Taking the gradients (∇wL ,∇bL ,∇ξiL ) and solving for zero gives us

w =

m∑i=1

αiyixi (13)

m∑i=1

αiyi = 0 (14) αi + βi = C (15)


Slack variables

Simplifying dual objective

w =

m∑i=1

αiyixim∑

i=1

αiyi = 0 αi + βi = C

L (w, b, ξ,α,β) =12||w||2 + C

m∑i=1

ξi

−m∑

i=1

αi [yi(w · xi + b)− 1 + ξi]

−m∑

i=1

βiξi


Slack variables

Dual Problem

maxα

m∑i=1

αi −12

m∑i=1

m∑j=1


s.t. C ≥ αi ≥ 0, i ∈ [1,m]∑i

αiyi = 0


Slack variables



yi(w · xi + b) ≥ 1− ξi, ξi ≥ 0,C ≥ αi ≥ 0, βi ≥ 0 (16)

Stationarity

w =m∑

i=1

αiyixi,m∑

i=1

αiyi = 0, αi + βi = C (17)


αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (18)


Slack variables

More on Complementary Slackness

αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (19)

• xi satisfies the margin, yi(w · xi + b) > 1⇒ αi = 0

• xi does not satisfy the margin, yi(w · xi + b) < 1⇒ αi = C• xi is on the margin, yi(w · xi + b) = 1⇒ 0 ≤ αi ≤ C


Slack variables


αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (19)

• xi satisfies the margin, yi(w · xi + b) > 1⇒ αi = 0• xi does not satisfy the margin, yi(w · xi + b) < 1⇒ αi = C

• xi is on the margin, yi(w · xi + b) = 1⇒ 0 ≤ αi ≤ C


Slack variables


αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (19)

• xi satisfies the margin, yi(w · xi + b) > 1⇒ αi = 0• xi does not satisfy the margin, yi(w · xi + b) < 1⇒ αi = C• xi is on the margin, yi(w · xi + b) = 1⇒ 0 ≤ αi ≤ C



Outline

Duality

Slack variables


Recap




Trivia• Invented by John Platt in 1998 at Microsoft Research• Called Minimal due to solving small sub-problems



Dual problem

maxα

m∑i=1

αi −12

m∑i=1

m∑j=1


s.t. C ≥αi ≥ 0, i ∈ [1,m]∑i

αiyi = 0



Brief Interlude: Coordinate Ascent

maxα

m∑i=1

αi −12

m∑i=1

m∑j=1


s.t. C ≥αi ≥ 0, i ∈ [1,m]∑i

αiyi = 0

Loop over each training example, change αi to maximize the above function

Although coordinate ascent works OK for lots of problems, we have the constraint∑i αiyi = 0



Brief Interlude: Coordinate Ascent

maxα

m∑i=1

αi −12

m∑i=1

m∑j=1


s.t. C ≥αi ≥ 0, i ∈ [1,m]∑i

αiyi = 0

Loop over each training example, change αi to maximize the above functionAlthough coordinate ascent works OK for lots of problems, we have the constraint∑

i αiyi = 0



Outline for SVM Optimization (SMO)

1. Select two examples i, j

2. Update αj, αi to maximize the above function





yi(w · xi + b) ≥ 1− ξi, ξi ≥ 0,C ≥ αi ≥ 0, βi ≥ 0 (20)

Stationarity

w =m∑

i=1

αiyixi,m∑

i=1

αiyi = 0, αi + βi = C (21)


αi[yi(w · xi + b)− 1 + ξi] = 0, βiξi = 0 (22)



Outline for SVM Optimization (SMO)

yiαi + yjαj = yiαoldi + yjα

oldj = γ



Step 2: Optimize αj

1. Compute upper (H) and lower (L) bounds that ensure 0 ≤ αj ≤ C.

yi 6= yj

L = max(0, αj − αi) (23)H = min(C,C + αj − αi) (24)

yi = yj

L = max(0, αi + αj − C) (25)H = min(C, αj + αi) (26)

This is because the update for αi is based on yiyj (sign matters)



Step 2: Optimize αj

Compute errors for i and jEk ≡ f (xk)− yk (27)

η = 2xi · xj − xi · xi − xj · xj (28)

for new value for αj

α∗j = α

(old)j −

yj(Ei − Ej)

η(29)



Step 3: Optimize αi

Set αi:α∗

i = α(old)i + yiyj

(α(old)j − αj

)(30)

This balances out the move that we made for αj.



Overall algorithm

Iterate over i = {1, . . .m}Repeat until KKT conditions are met

Choose j randomly from m− 1 other optionsUpdate αi, αj

Find w, b based on stationarity conditions



Iterations / Details

• What if i doesn’t violate the KKT conditions?• What if η ≥ 0?• When do we stop?




• What if i doesn’t violate the KKT conditions? Skip it!• What if η ≥ 0?• When do we stop?




• What if i doesn’t violate the KKT conditions? Skip it!• What if η ≥ 0? Skip it! (should not happen except for numerical instability)• When do we stop?




• What if i doesn’t violate the KKT conditions? Skip it!• What if η ≥ 0? Skip it! (should not happen except for numerical instability)• When do we stop? Until we go through α’s without changing anything



SMO Algorithm

Positive(-2, 2)(0, 4)(2, 1)

0

4

1

2

3 5

positive

negative

Negative(-2, -3)(0, -1)(2, -3)

• Initially, all alphas are zero

α =< 0, 0, 0, 0, 0, 0 >

• Intercept b is also zero• Capacity C = π



SMO Optimization for i = 0, j = 4: Predictions and Step

0

4

1

2

3 5

positive

negative

• Prediction: f (x0)

• Prediction: f (x4)

• Error: E0

• Error: E4




0

4

1

2

3 5

positive

negative

• Prediction: f (x0) = 0• Prediction: f (x4)

• Error: E0

• Error: E4




0

4

1

2

3 5

positive

negative

• Prediction: f (x0) = 0• Prediction: f (x4) = 0• Error: E0

• Error: E4




0

4

1

2

3 5

positive

negative

• Prediction: f (x0) = 0• Prediction: f (x4) = 0• Error: E0 = −1• Error: E4 = +1




0

4

1

2

3 5

positive

negative


η = 2〈x0, x4〉 − 〈x0, x0〉 − 〈x4, x4〉




0

4

1

2

3 5

positive

negative


η = 2〈x0, x4〉 − 〈x0, x0〉 − 〈x4, x4〉= 2 · −2− 8− 1 = −13



SMO Optimization for i = 0, j = 4: Bounds

• Lower and upper bounds for αj

L = max(0, αj − αi) (31)H = min(C,C + αj − αi) (32)





L = max(0, αj − αi) = 0 (31)H = min(C,C + αj − αi) (32)





L = max(0, αj − αi) = 0 (31)H = min(C,C + αj − αi) = π (32)



SMO Optimization for i = 0, j = 4: α update

New value for αj

α∗j = αj −

yj(Ei − Ej)

η(33)

(34)




New value for αj

α∗j = αj −

yj(Ei − Ej)

η=−2η

=2

13(33)

(34)




New value for αj

α∗j = αj −

yj(Ei − Ej)

η=−2η

=2

13(33)

New value for αi

(34)




New value for αj

α∗j = αj −

yj(Ei − Ej)

η=−2η

=2

13(33)

New value for αi

α∗i = αi + yiyj

(α(old)j − αj

)(34)




New value for αj

α∗j = αj −

yj(Ei − Ej)

η=−2η

=2

13(33)

New value for αi

α∗i = αi + yiyj

(α(old)j − αj

)= αj =

213

(34)



Margin



Find weight vector and bias

• Weight vector

~w =

m∑i

αiyi~xi (35)

• Bias

b =b(old) − Ei − yi(α∗i − α

(old)i )xi · xi − yj(α

∗j − α

(old)j )xi · xj (36)

(37)




• Weight vector

~w =

m∑i

αiyi~xi =213

[−22

]− 2

13

[0−1

](35)

• Bias



∗j − α


(37)




• Weight vector

~w =

m∑i

αiyi~xi =2

13

[−22

]− 2

13

[0−1

]=

[−4136

13

](35)

• Bias



∗j − α


(37)




• Weight vector

~w =

m∑i

αiyi~xi =2

13

[−22

]− 2

13

[0−1

]=

[−4136

13

](35)

• Bias



∗j − α


=1− 213· 8 +

213· −2 = −0.54 (37)



SMO Optimization for i = 2, j = 4

0

4

1

2

3 5

positive

negative

Let’s skip the boring stuff• E2 = −1.69• E4 = 0.00• η = −8

• α4 = α(old)j − yj(Ei−Ej)

η

• α2 = α(old)i + yiyj

(α(old)j − αj

)




0

4

1

2

3 5

positive

negative



η = 0.15 + −1.69−8 =

0.37• α2 = α

(old)i + yiyj

(α(old)j − αj

)




0

4

1

2

3 5

positive

negative



η = 0.15 + −1.69−8 =

0.37• α2 = α

(old)i + yiyj

(α(old)j − αj

)=

0− (0.15− 0.37) = 0.21



Margin



Weight vector and bias

• Bias b = −0.12• Weight vector

~w =m∑i

αiyi~xi (38)



Weight vector and bias

• Bias b = −0.12• Weight vector

~w =m∑i

αiyi~xi =

[0.120.88

](38)



Another Iteration (i = 0, j = 2)



SMO Algorithm

• Convenient approach for solving: vanilla, slack, kernel approaches• Convex problem• Scalable to large datasets (implemented in scikit learn)• What we didn’t do:◦ Check KKT conditions◦ Randomly choose indices


Recap

Outline

Duality

Slack variables


Recap


Recap

Recap

• Duality• Slack variables


Recap

Recap

• SMO: Optimize objective function for two data points• Convex problem: Will converge• Relatively fast• Gives good performance


Recap

Wrapup

• Adding slack variables don’t break the SVM problem• Very popular algorithm◦ SVMLight (many options)◦ Libsvm / Liblinear (very fast)◦ Weka (friendly)◦ pyml (Python focused, from Colorado)

• Next up: simple algorithm for finding SVMs


machine learning: chenhao tan university of colorado ... · slides adapted from jordan boyd-graber,...

Documents