cs434a/541a: pattern recognition prof. olga veksler
Post on 12-Dec-2021
29 Views
Preview:
TRANSCRIPT
CS840a Learning and Computer Vision
Prof. Olga Veksler
Lecture 8 SVM
Some pictures from C. Burges
SVM
• Said to start in 1979 with Vladimir Vapnik’s paper
• Major developments throughout 1990’s • Elegant theory
• Has good generalization properties
• Have been applied to diverse problems very successfully in the last 15-20 years
Linear Discriminant Functions
g(x) = wtx + w0
( )( ) 20
10classxxgclassxxg
∈⇒<∈⇒>
• which separating hyperplane should we choose?
Margin Intuition • Training data is just a subset of of all possible data • Suppose hyperplane is close to sample xi
• If sample is close to sample xi, it is likely to be on the wrong side
xi
• Poor generalization
Margin Intuition • Hyperplane as far as possible from any sample
xi
• More likely that new samples close to old samples classified correctly
• Good generalization
SVM • Idea: maximize distance to the closest example
xi xi
smaller distance larger distance
• For the optimal hyperplane • distance to the closest negative example = distance to the closest positive
example
SVM: Linearly Separable Case • SVM: maximize the margin
• margin is twice the absolute value of distance b of the closest example to the separating hyperplane
• Better generalization • in practice and in theory
SVM: Linearly Separable Case
• Support vectors are samples closest to separating hyperplane • they are the most difficult patterns to classify, intuitively • optimal hyperplane is completely defined by support vectors
• do not know which samples are support vectors beforehand
SVM: Formula for the Margin • g(x) = wtx + w0 x
• absolute distance between x and the boundary g(x) = 0
wwxw t
0+
• distance is unchanged for hyperplane g1(x)=α g (x)
w
wxwt
α
α+α 0
• Let xi be an example closest to the boundary. Set
10 =+ wxw it
w
wxwt0+
=
• Now the largest margin hyperplane is unique
SVM: Formula for the Margin
• now distance from closest sample xi to g(x) = 0 is
w
wxw it
0+
w1
=
• Thus the margin is
wm 2
=
• For uniqueness, set for any example xi closest to the boundary
10 =+ wxw it
w1
w1 w
1
w2
SVM: Optimal Hyperplane
−≤+≥+
examplenegativeisxifwxwexamplepositiveisxifwxw
ii
ii
t
t
11
0
0
• Maximize margin w
m 2=
• subject to constraints
• Let
−==
examplenegativeisxifzexamplepositiveisxifz
ii
ii
11
• Convert our problem to
minimize
constrained to
( ) 2
21 wwJ =
( ) iwxwz ii t ∀≥+ 10
• J(w) is a convex function, thus it has a single global minimum
SVM: Optimal Hyperplane • Use Kuhn-Tucker theorem to convert our problem to:
maximize
constrained to
( ) ∑∑∑= ==
αα−α=αn
i
n
jj
tijiji
n
iiD xxzzL
1 11 21
∑=
=α∀≥αn
iiii zandi
1
00
• α = {α1,…, αn} are new variables, one for each sample
( )
α
α
α
α−α=α ∑
=n
t
n
n
iiD HL
11
1 21
• Rewrite LD(α) using n by n matrix H:
jtijiij xxzzH =
• where the value in the i th row and j th column of H is
SVM: Optimal Hyperplane
• Use Kuhn-Tucker theorem to convert our problem to:
maximize
constrained to
( ) ∑∑∑= ==
αα−α=αn
i
n
jj
tijiji
n
iiD xxzzL
1 11 21
∑=
=α∀≥αn
iiii zandi
1
00
• α = {α1,…,α n} are new variables, one for each sample • LD(α) can be optimized by quadratic programming
• LD(α) formulated in terms of α • depends on w and w0
SVM: Optimal Hyperplane
• Final discriminant function:
( ) 0wxxzxgt
Sxiii
i
+
α= ∑
∈
• After finding the optimal α = {α1,…, αn}
• solve for w0 using any αi > 0 and ( )[ ] 010 =−+α wxwz it
ii
• where S is the set of support vectors
{ }0| ≠α= iixS
• compute ∑=
α=n
iiii xzw
1
• for every sample i, one of the following must hold
• αi = 0 (sample i is not a support vector) • αi ≠ 0 and zi(wtxi+w0 - 1) = 0 (sample i is support vector)
it
i
xwz
w −=1
0
SVM: Optimal Hyperplane
maximize
constrained to
( ) ∑∑∑= ==
αα−α=αn
i
n
jj
tijiji
n
iiD xxzzL
1 11 21
∑=
=α∀≥αn
iiii zandi
1
00
• LD(α) depends on the number of samples, not on dimension of samples
• samples appear only through the dot products jti xx
• Will become important when looking for a nonlinear discriminant function
SVM: Non Separable Case
• Linear classifier still be appropriate when data is not linearly separable, but almost linearly separable
outliers
• Can adapt SVM to almost linearly separable case
SVM: Non Separable Case • Introduce non-negative slack variables ξ1,…, ξn
• one for each sample
( ) iwxwz iit
i ∀ξ−≥+ 10
• ξi measures deviation from the ideal position for sample xi • ξi >1: xi is on the wrong side of
the hyperplane • 0< ξi <1: xi is on the right side of
the hyperplane but within the region of maximum margin
ξi > 1
0< ξi <1
( ) iwxwz it
i ∀≥+ 10• Change constraints from to
SVM: Non Separable Case
• Wish to minimize
• where ( )
≤ξ>ξ
=>ξ0001
0i
ii if
ifI
( ) ( )∑=
>ξβ+=ξξn
iin IwwJ
1
21 0
21,...,,
• β measures relative weight of first and second terms • if β is small, we allow a lot of samples not in ideal position • if β is large, we allow very few samples not in ideal position • choosing β appropriately is important
( ) iit
i wxwz ξ−≥+ 10• constrained to and ii ∀≥ξ 0
# of samples not in ideal location
SVM: Non Separable Case
large β, few samples not in ideal position
( ) ( )∑=
>ξβ+=ξξn
iin IwwJ
1
21 0
21,...,,
small β, many samples not in ideal position
# of samples not in ideal location
SVM: Non Separable Case
• where ( )
≤ξ>ξ
=>ξ0001
0i
ii if
ifI
( ) ( )∑=
>ξβ+=ξξn
iin IwwJ
1
21 0
21,...,,
• Minimization problem is NP-hard due to discontinuity of I(ξi)
( ) iit
i wxwz ξ−≥+ 10• constrained to and ii ∀≥ξ 0
# of samples not in ideal location
SVM: Non Separable Case • Instead we minimize
( ) ∑=
ξβ+=ξξn
iin wwJ
1
21 2
1,...,,
( )
∀≥ξ∀ξ−≥+
iiwxwz
i
iit
i
010• constrained to
a measure of # of misclassified
examples
• Use Kuhn-Tucker theorem to converted to
maximize
constrained to
( ) ∑∑∑= ==
αα−α=αn
i
n
jj
tijiii
n
iiD xxzzL
1 11 21
∑=
=α∀β≤α≤n
iiii zandi
1
00
• find w using ∑=
α=n
iiii xzw
1
• solve for w0 using any 0 <αi < β and ( )[ ] 010 =−+α wxwz it
ii
Non Linear Mapping • Cover’s theorem:
• “pattern-classification problem cast in a high dimensional space non-linearly is more likely to be linearly separable than in a low-dimensional space”
• Not linearly separable in 1D
0 1 2 3 4 -2 -3
• Lift to 2D space with h(x) = (x,x2 )
Non Linear Mapping
• To solve a non linear problem with a linear classifier 1. Project data x to high dimension using function ϕ(x) 2. Find a linear discriminant function for transformed data ϕ(x) 3. Final nonlinear discriminant function is g(x) = wt ϕ(x) +w0
0 1 2 3 4 -2 -3
ϕ(x) = (x,x2 )
• In 2D, discriminant function is linear ( )
( ) [ ]( )
( ) 02
1
212
1
wxx
wwxx
g +
=
• In 1D, discriminant function is not linear ( ) 02
21 wxwxwxg ++=
R1 R2 R2
Non Linear Mapping: Another Example
Non Linear SVM
• Can use any linear classifier after lifting data into a higher dimensional space
• However we will have to deal with the “curse of dimensionality” 1. poor generalization to test data 2. computationally expensive
• SVM avoids the “curse of dimensionality” by
• enforcing largest margin permits good generalization • computation in the higher dimensional case is performed only
implicitly through the use of kernel functions
Non Linear SVM: Kernels
• Recall SVM optimization
maximize ( ) ∑∑∑= ==
αα−α=αn
i
n
jj
tijiii
n
iiD xxzzL
1 11 21
• Optimization depends on samples xi only through the dot product xi
txj
• If we lift xi to high dimension using φ(x), need to compute high dimensional product φ(xi)tφ(xj)
maximize ( ) ( ) ( )∑∑∑= ==
ϕϕαα−α=αn
i
n
jj
tijiii
n
iiD xxzzL
1 11 21
• Idea: find kernel function K(xi,xj) s.t. K(xi,xj) = φ(xi)tφ(xj)
K(xi,xj)
Non Linear SVM: Kernels
• Kernel trick • only need to compute K(xi,xj) instead of φ(xi)tφ(xj) • no need to lift data in high dimension explicitely,
computation is performed in the original dimension
maximize ( ) ( ) ( )∑∑∑= ==
ϕϕαα−α=αn
i
n
jj
tijiii
n
iiD xxzzL
1 11 21
K(xi,xj)
Non Linear SVM: Kernels
• Suppose we have 2 features and K(x,y) = (xty)2
• Which mapping φ(x) does it correspond to?
( ) ( )2, yxyxK t=( ) ( )[ ]
( )
( )
2
2
121
=
yy
xx ( ) ( ) ( ) ( )( )22211 yxyx +=
( ) ( )( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )( )2222211211 2 yxyxyxyx ++=
( )( ) ( ) ( ) ( ) ( )( )[ ] ( )( ) ( ) ( ) ( ) ( )( )[ ]tyyyyyxxxxx 22211212221121 22=
• Thus
( ) ( )( ) ( ) ( ) ( )( )[ ]2211 2 xxxxx =ϕ
Non Linear SVM: Kernels
• How to choose kernel K(xi,xj)? • K(xi,xj) should correspond to product φ(xi)tφ(xj) in a higher
dimensional space • Mercer’s condition states which kernel function can be
expressed as dot product of two vectors • Kernel’s not satisfying Mercer’s condition can be sometimes
used, but no geometrical interpretation
• Common choices satisfying Mercer’s condition • Polynomial kernel ( ) ( )p
jtiji xxxxK 1, +=
• Gaussian radial Basis kernel (data is lifted in infinite dimensions)
( )
−
σ−=
2
221exp, jiji xxxxK
Non Linear SVM
• Choose ϕ(x) so that the first (“0”th) dimension is the augmented dimension with feature value fixed to 1
( ) ( ) ( ) ( ) ( )[ ]txxxxx 21211=ϕ
• search for separating hyperplane in high dimension
( ) 00 =+ϕ wxw
• Threshold w0 gets folded into vector w
[ ] 0*1
0 =
ww
ϕ(x)
Non Linear SVM
• Thus seeking hyperplane
( ) 0=ϕ xw
• Or, equivalently, a hyperplane that goes through the origin in high dimensions
• removes only one degree of freedom • but we introduced many new degrees when lifted the data
in high dimension
Non Linear SVM Recepie
• Choose kernel K(xi,xj) • implicitly chooses function φ(xi) that takes xi to a higher dimensional space • gives dot product in the high dimensional space
• Start with x1,…,xn in original feature space of dimension d
• Find largest margin linear classifier in the higher dimensional space by using quadratic programming package to solve
maximize
constrained to
( ) ( )∑∑∑= ==
αα−α=αn
i
n
jjijiii
n
iiD xxKzzL
1 11
,21
∑=
=α∀β≤α≤n
iiii zandi
1
00
Non Linear SVM Recipe
• Linear discriminant function in the high dimensional space
( ) ( )xxzt
Sxiii
i
ϕ
ϕα= ∑
∈
• where S is the set of support vectors { }0| ≠α= iixS
( ) ( )∑∈
ϕϕα=Sx
it
iii
xxz ( )∑∈
α=Sx
iiii
xxKz ,
• Non linear discriminant function in the original space:
( ) ( ) ( )xxzxgt
Sxiii
i
ϕ
ϕα= ∑
∈
• Decide class 1 if g(x ) > 0, otherwise decide class 2
( )∑∈
ϕα=Sx
iiii
xzw• Weight vector w in the high dimensional space
( )( ) ( )xwxg tϕ=ϕ
Non Linear SVM
( ) ( )∑∈
α=Sx
iiii
xxKzxg ,
• Nonlinear discriminant function
( ) ∑=xgmost important
training samples, i.e. support vectors
weight of support vector xi
1 similarity between x and
support vector xi
( )
−
σ−= 2
221exp, xxxxK ii
SVM Example: XOR Problem
• Class 2: x3 = [1,1], x4 = [-1,-1]
• Class 1: x1 = [1,-1], x2 = [-1,1]
• Use polynomial kernel of degree 2 • K(xi,xj) = (xi
t xj + 1)2
• Kernel corresponds to mapping
( ) ( ) ( ) ( ) ( ) ( )( ) ( )( )[ ]txxxxxxx 22212121 2221=
• Need to maximize ( ) ( )∑∑∑4
1
4
1
24
1
121
= ==
+ααα=αi j
jtijiii
iiD xxzzL
constrained to 00 4321 =α−α−α+α∀α≤ andii
SVM Example: XOR Problem
• Rewrite ( ) αα−α=α ∑=
HL t
iiD 2
14
1
• where and [ ] t4321 αααα=α
−−−−
−−−−
=
9111191111911119
H
• Take derivative with respect to α and set it to 0
( ) 0
9111191111911119
1111
=α
−−−−
−−−−
−
=αDLdad
• Solution to the above is α1= α2 = α3 = α4 = 0.25
• all samples are support vectors • satisfies the constraints 00, 4321 =α−α−α+αα≤∀ andi i
SVM Example: XOR Problem
• Weight vector w is:
( ) ( )xwxg ϕ=
( )∑=
ϕα=4
1iiii xzw ( ) ( ) ( ) ( )( )432125.0 xxxx ϕ−ϕ−ϕ+ϕ=
( ) ( ) ( ) ( ) ( ) ( )( ) ( )( )[ ]txxxxxxx 22212121 2221=
[ ]002000=
• Nonlinear discriminant function is
( ) ( )( )2122 xx=( )xw ii
iϕ= ∑=
6
1
( ) ( )212 xx=
• by plugging in x1 = [1,-1], x2 = [-1,1], x3 = [1,1], x4 = [-1,-1]
SVM Example: XOR Problem
( ) ( ) ( )212 xxxg −=
( )1x
( )2x
-1 1
1
-1
decision boundaries nonlinear
( )12x
( ) ( )212 xx
-1 1
1
-1
-2
2
2 -2
decision boundary is linear
Degree 3 Polynomial Kernel
• Left: In linearly separable case, decision boundary is roughly linear, indicating that dimensionality is controlled
• Right: nonseparable case is handled by a polynomial of degree 3
SVM as Unconstrained Minimization • SVM formulated as constrained optimization, minimize
( ) ∑=
ξβ+=ξξn
iin wwJ
1
21 2
1,...,,
( )
∀≥ξ∀ξ−≥+
iiwxwz
i
iit
i
010• constrained to
• Let us name ( ) 0wxwxf it
i +=
• The constraint can be rewritten as ( )
∀≥ξ∀ξ−≥
iixfz
i
iii
01
( )( )iii xfz−=ξ 1,0max• Which implies
weights regularization
loss function
• SVM objective can be rewritten as unconstrained optimization
( ) ( )( )∑=
−β+=ξξn
iiin xfzwwJ
1
21 1,0max
21,...,,
SVM as Unconstrained Minimization
weights regularization
loss function
• SVM objective can be rewritten as unconstrained optimization
( ) ( )( )∑=
−β+=n
iii xfzwwJ
1
2 1,0max21
• zi f(xi) > 1 : xi is on the right side
of the hyperplane and outside margin, no loss
• zi f(xi) = 1 : xi on the margin, no loss
• zi f(xi) < 1 : xi is inside margin, or on the wrong side of the hyperplane, contributes to loss
SVM: Hinge Loss • SVM uses Hinge loss per sample xi
( ) ( )( )iiii xfzxL −= 1,0max
• Hinge loss encourages classification with a margin of 1
( )ii xfz
( )ixL
SVM: Hinge Loss • Can optimize with gradient descent, convex function
( ) ( )( )∑=
−β+=n
iii xfzwwJ
1
2 1,0max21
( ) 0wxwxf it
i +=
• Gradient
( )ii xfz
( )ixL
iixzwwJ
−=∂∂
wwJ
=∂∂
• Gradient descent, single sample
( ) ( )
α−<β−α−
=otherwiseww
xfzifxzwww iiii 1
SVM Summary • Advantages:
• nice theory • good generalization properties • objective function has no local minima • can be used to find non linear discriminant functions • often works well in practice, even if not a lot of training data
• Disadvantages:
• tends to be slower than other methods • quadratic programming is computationally expensive
top related