learning kernels -tutorialmohri/icml2011-tutorial/tutorial...corinna cortes, mehryar mohri, afshin...
TRANSCRIPT
Learning Kernels -Tutorial Part I: Introduction to Kernel Methods.
Corinna CortesGoogle Research
Mehryar MohriCourant Institute & Google Research
Afshin RostamiUC Berkeley
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
Outline
Part I: Introduction to kernel methods.
Part II: Learning kernel algorithms.
Part III: Theoretical guarantees.
Part IV: Software tools.
2
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
Binary Classification Problem
Training data: sample drawn i.i.d. from set according to some distribution ,
Problem: find hypothesis in (classifier) with small generalization error .
Linear classification:
• Hypotheses based on hyperplanes.
• Linear separation in high-dimensional space.
3
h :X �→{−1, +1} H
S =((x1, y1), . . . , (xm, ym)) ∈ X×{−1, +1}.
RD(h)
X⊆RN
D
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
Linear Separation
Classifiers: .
4
H ={x �→sgn(w · x + b) :w ∈ RN, b ∈ R}
w·x+b=0w·x+b=0
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
Optimal Hyperplane: Max. Margin
Canonical hyperplane: and chosen such that for closest points .
Margin: .
5
margin
(Vapnik and Chervonenkis, 1964)
w b
ρ=minx∈S
|w·x+b|�w� =
1�w�
w·x+b=+1w·x+b=−1
w·x+b=0
|w·x + b|=1
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
Optimization Problem
Constrained optimization:
Properties:
• Convex optimization (strictly convex).
• Unique solution for linearly separable sample.
6
minw,b
12�w�2
subject to yi(w · xi + b) ≥ 1, i ∈ [1, m].
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
Support Vector Machines
Problem: data often not linearly separable in practice. For any hyperplane, there exists such that
Idea: relax constraints using slack variables
7
(Cortes and Vapnik, 1995)
xi
yi [w · xi + b] �≥ 1.
yi [w · xi + b] ≥ 1− ξi.
ξi≥0
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
Support vectors: points along the margin or outliers.Soft margin:
Soft-Margin Hyperplanes
8
!i
!jw·x+b=+1
w·x+b=−1
w·x+b=0
ρ = 1/�w�.
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
Optimization Problem
Constrained optimization:
Properties:
• trade-off parameter.
• Convex optimization (strictly convex).
• Unique solution.
9
minw,b,ξ
12�w�2 + C
m�
i=1
ξi
subject to yi(w · xi + b) ≥ 1− ξi ∧ ξi ≥ 0, i ∈ [1, m].
C≥0
(Cortes and Vapnik, 1995)
pageMehryar Mohri - Foundations of Machine Learning
Dual Optimization Problem
Constrained optimization:
Solution:
10
for any SV .
h(x) = sgn� m�
i=1
αiyi(xi · x) + b�,
b = yi −m�
j=1
αjyj(xj · xi) xiwith
maxα
m�
i=1
αi −12
m�
i,j=1
αiαjyiyj(xi · xj)
subject to: αi ≥ 0 ∧m�
i=1
αiyi = 0, i ∈ [1, m].
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
Kernel Methods
Idea:
• Define , called kernel, such that:
• often interpreted as a similarity measure.
Benefits:
• Efficiency: is often more efficient to compute than and the dot product.
• Flexibility: can be chosen arbitrarily so long as the existence of is guaranteed (Mercer’s condition).
11
Φ(x) · Φ(y) = K(x, y).
K : X×X→R
K
K
ΦK
Φ
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
Example - Polynomial Kernels
Definition:
Example: for and ,
12
N =2 d=2
∀x, y ∈ RN , K(x, y) = (x · y + c)d, c > 0.
K(x, y) = (x1y1 + x2y2 + c)2
=
x21
x22√
2x1x2√2c x1√2c x2
c
·
y21
y22√
2 y1y2√2c y1√2c y2
c
.
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
(1, 1,!
"
2,+"
2,!
"
2, 1)
XOR Problem
Use second-degree polynomial kernel with :
x1
x2(1, 1)
(-1, -1)
(-1, 1)
(1, -1)
!
2 x1x2
!
2 x1
Linearly non-separable Linearly separable by
(1, 1,!
"
2,!
"
2,+"
2, 1)
(1, 1,+!
2,"
!
2,"
!
2, 1) (1, 1,+!
2,+!
2,+!
2, 1)
13
c = 1
x1x2 = 0.
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
Other Standard PDS Kernels
Gaussian kernels:
Sigmoid Kernels:
14
K(x, y) = tanh(a(x · y) + b), a, b ≥ 0.
K(x, y) = exp�− ||x− y||2
2σ2
�, σ �= 0.
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
Consequence: SVMs with PDS Kernels
Constrained optimization:
Solution:
15
with for any with0<αi <C.
(Boser, Guyon, and Vapnik, 1992)
maxα
m�
i=1
αi −12
m�
i,j=1
αiαjyiyjK(xi, xj)
subject to: 0 ≤ αi ≤ C ∧m�
i=1
αiyi = 0, i ∈ [1, m].
h(x) = sgn� m�
i=1
αiyiK(xi, x) + b�,
b = yi −m�
j=1
αjyjK(xj , xi) xi
pageMehryar Mohri - Foundations of Machine Learning
SVMs with PDS Kernels
Constrained optimization:
Solution:
16
with for any with0<αi <C.
xib = yi − (α ◦ y)�Kei
h = sgn� m�
i=1
αiyiK(xi, ·) + b�,
maxα
2 α�1−α�Y�KYα
subject to: 0 ≤ α ≤ C ∧α�y = 0.
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
Regression Problem
Training data: sample drawn i.i.d. from set according to some distribution ,
Loss function: a measure of closeness, typically or for some .
Problem: find hypothesis in with small generalization error with respect to target
17
H
DX
S =((x1, y1), . . . , (xm, ym))∈X×Y,
with is a measurable subset.Y ⊆R
L(y, y�)=(y�−y)2 L(y, y�)= |y�−y|pp≥1
f
RD(h) = Ex∼D
�L
�h(x), f(x)
��.
h :X→R
L : Y ×Y →R+
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
Kernel Ridge Regression
Optimization problem:
Solution:
18
or maxα∈Rm
−α�(K + λI)α + 2α�y.
maxα∈Rm
−λα�α + 2α�y −α�Kα
with α = (K + λI)−1y.
h(x) =m�
i=1
αiK(xi, x),
(Saunders et al., 1998)
pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.
Questions
How should the user choose the kernel?
• problem similar to that of selecting features for other learning algorithms.
• poor choice learning made very difficult.
• good choice even poor learners could succeed.
The requirement from the user is thus critical.
• can this requirement be lessened?
• is a more automatic selection of features possible?
19