algorithmic stability - lecture of the course 'foundations ... · algorithmic stability...
TRANSCRIPT
Algorithmic stabilityLecture of the course ’Foundations of Machine learning’
Y. Ryeznik J. Vaicenavicius T. Wiklund
Department of MathematicsUppsala University
October 7, 2015
Stability-basedgeneralisation error bound
Notation
I Let X denote the set of examples, Y denote the set of targetvalues.
I Let z = (x , y) ∈ X × Y.
I Let h ∈ H := {h : X → Y ′} (Y ′ might differ from Y).
I Let L : Y × Y ′ → R+ be a loss function andLz(h) := L(h(x), y).
I Given a sample S = (z1, . . . , zm), the empirical error
R̂(z1,...,zm)(h) :=1
m
m∑i=1
Lzi (h)
I Let D denote the distribution of the sample. Then thegeneralisation error
R(h) := Ez∼D [Lz(h)].
I Given an algorithm A and sample S , let hS ∈ H be thehypothesis returned by A.
Stability-based generalisation guarantee
DefinitionA loss function L is said to be bounded by M ≥ 0 if
∀h ∈ H ∀z ∈ X × Y Lz(h) ≤ M.
Definition (Uniform stability)
Let S and S ′ be training samples of size m and differing by a singlepoint. A learning algorithm A is called β-stable if
∃β ≥ 0 ∀z ∈ X × Y |Lz(hS)− Lz(hS ′)| ≤ β.
Theorem (Generalisation error bound from a training sample)
Let the loss function L be bounded by some M ≥ 0, A be aβ-stable learning algorithm, S be a sample of m points drawn fromdistribution D. Then with probability no less than 1− δ
R(hS) ≤ R̂S(hS) + β + (2mβ + M)
√log(1/δ)
2m.
Proof of generalisation error bound
I Define Φ(S) := R(hS)− R̂S(hS), where S is a sample.
I Let S = (z1, . . . , zm), S ′ = (z1, . . . , z′m) be two samples of size
m, differing by a single point.
I Straightforward calculation using the boundedness of L andthe β-stability of A yields that
|Φ(S)− Φ(S ′)| ≤ 2β + M/m.
I Hence we can apply McDiarmid’s ineq., which gives that
Φ(S) ≤ ES∼Dm [Φ(S)] + (2mβ + M)
√log(1/δ)
2m.
occurs with prob. 1− δ.
Proof cont.
Straightforward to see that
ES∼Dm [R(hS)] = ES∼Dm [Ez∼D [Lz(hS)]] = ES ,z∼Dm+1 [Lz(hS)],
ES∼Dm [R̂(hS)] = E(z1,...,zm,z)∼Dm+1 [Lz(h(z1,...,zm−1,z))].
Then the absolute value of the first term
|ES∼Dm [Φ(S)]|= |E(z1,...,zm,z)∼Dm+1 [Lz(hS)]
−E(z1,...,zm,z)∼Dm+1 [Lz(h(z1,...,zm−1,z))]|≤ E(z1,...,zm,z)∼Dm+1 [|Lz(hS)− Lz(h(z1,...,zm−1,z))|]≤ E(z1,...,zm,z)∼Dm+1 [β]
= β,
which finishes the proof of the claim.
Kernel methods
Non-linear separation
I In most problems linear separation is not possible.
I Maybe linear separation in a bigger space H after a non-lineartransformation
Φ : X → H
is still possible?
Positive definite symmetric kernels
Definition
I A function K : X × X → R is called a kernel over X .
I It is said to be positive definite symmetric (PDS) if for any{x1, . . . , xm} ⊂ X , the matrix K := [K (xi , xj)]ij ∈ Rm×m issymmetric positive semidefinite (symmetric non-negativedefinite) .
Polynomial kernel
DefinitionThe kernel K : R× R→ R defined by
K (x, x′) := (x · x′ + c)d (c > 0, d ∈ N)
is called the polynomial kernel of degree d .
Example (XOR classification)
Figure: Polynomial kernel with c = 1 used.
Reproducing Kernel Hilbert Space
TheoremLet K : X × X → R be a PDS kernel. Then there exists a Hilbertspace H and a map Φ : X → H such that
K (x , y) = 〈Φ(x),Φ(y)〉
for all x , y ∈ X .
I The space H is called a feature space associated to K.
I The map Φ is called a feature map.
Proof
I Define Φ : X → RX by
Φ(x)(y) = K (x , y) (x , y ∈ X ).
I Let H0 := Span({Φ(x) : x ∈ X})I Define a map 〈·, ·〉 : H0 × H0 → R by specifying that for any
f =∑i∈I
aiΦ(xi ), g =∑j∈J
bjΦ(yj) in H0,
the value
〈f , g〉 :=∑
i∈I ,j∈JaibjK (xi , yj) =
∑j∈J
bj f (yj) =∑i∈I
ajg(xi ).
I From above, the map 〈·, ·〉 is well-defined, symmetric, andbilinear.
Proof cont.
I 〈·, ·〉 is positive semidefinite.Proof: 〈f , f 〉 =
∑i ,j∈I aiajK (xi , xj) ≥ 0.
I 〈·, ·〉 is a PDS kernel on H0.Proof: for any f1, . . . , fm ∈ H0 and any c1, . . . , cm ∈ R, wehave ∑
1≤i ,j≤mcicj〈fi , fj〉 = 〈
m∑i=1
ci fi ,m∑j=1
cj fj〉 ≥ 0.
I Cauchy-Schwarz for PDS kernels. If K is a PDS kernel, then
K (x , y)2 ≤ K (x , x)K (y , y)
for all x , y ∈ X .Proof: [K (xi , xj)]1≤i ,j≤2 is PDS and so its determinant isnon-negative.
Proof cont.
I By Cauchy-Schwarz for PDS kernels,
〈f ,Φ(x)〉2 ≤ 〈f , f 〉〈Φ(x),Φ(x)〉.
I From the definition of 〈·, ·〉, we have
f (x) = 〈f ,Φ(x)〉 (reproducing property)
for all f ∈ H0 and all x ∈ X .
I From the two above combined
f (x)2 ≤ 〈f , f 〉〈Φ(x),Φ(x)〉
for all x ∈ X . Consequently, 〈·, ·〉 is definite.
I Also, for any x ∈ X fixed, f 7→ 〈f ,Φ(x)〉 is Lipschitz (byCauchy-Schwarz) and so continuos.
I Finally, we define H := H̄0 (completion of H0) and call it thereproducing kernel Hilbert space.
Q.E.D.
Normalised kernels
DefinitionGiven a kernel K , we define the normalised kernel K ′ by
K ′(x , x ′) =
0 if K (x , x) = 0 or K (x ′, x ′) = 0K(x ,x ′)√
K(x ,x)K(x ′,x ′)otherwise,
for all x , x ′ ∈ X .
I By definition, K ′(x , x) = 1 for all x ∈ X .
LemmaIf K is a PDS kernel, then its normalised kernel K ′ is also PDS.
I Let K ′ be a normalised PDS kernel. Then
K ′(x , x ′) = 〈Φ(x),Φ(x ′)〉
can be interpreted as the sine of the angle between thefeature vectors Φ(x) and Φ(x ′) in a feature Hilbert space H.
PDS kernels - closure properties
TheoremPDS kernels are closed under:
I sum,
I product,
I tensor product,
I pointwise limit,
I composition with power series x 7→∑∞
n=0 anxn with an ≥ 0
for all n ∈ N.
Corollary
The Gaussian kernel K : RN × RN → R defined by
K (x, x′) := exp
(−||x
′ − x||2
2σ2
),
where x, x′ ∈ RN , is PDS.
Regularised kernel methods
Hypothesis space equal to H for some RKHS associated to kernel
K : X × X → R
regularised estimator (on sample S = ((x1, y1), . . . , (xm, ym)))
hS = argminh (R̂S(h) + λ‖h‖2)︸ ︷︷ ︸FS (h)
.
Recall empirical risk (average badness of choices made by h)
R̂S(h) =1
m
m∑i=1
L(h(xi ), yi ).
Regularised kernel methods
Hypothesis space equal to H for some RKHS associated to kernel
K : X × X → R
regularised estimator (on sample S = ((x1, y1), . . . , (xm, ym)))
hS = argminh (R̂S(h) + λ‖h‖2)︸ ︷︷ ︸FS (h)
.
Recall empirical risk (average badness of choices made by h)
R̂S(h) =1
m
m∑i=1
L(h(xi ), yi ).
Theorem (Stability of regularised kernel methods)
If L is convex, differentiable and (in first argument) σ-Lipschitz, i.e.
|L(v , y)− L(v ′, y)| ≤ σ|v − v ′|
and the kernel K is bounded, i.e. for some r > 0
K (x , x) ≤ r2
the regularised estimator
hS = argminh FS(h)
is β-stable for β = σ2r2
mλ .
Theorem (Stability of regularised kernel methods)
If L is convex, differentiable and (in first argument) σ-Lipschitz, i.e.
|L(v , y)− L(v ′, y)| ≤ σ|v − v ′|
and the kernel K is bounded, i.e. for some r > 0
K (x , x) ≤ r2
the regularised estimator
hS = argminh FS(h)
is β-stable for β = σ2r2
mλ .
Theorem (Stability of regularised kernel methods)
If L is convex, differentiable and (in first argument) σ-Lipschitz, i.e.
|L(v , y)− L(v ′, y)| ≤ σ|v − v ′|
and the kernel K is bounded, i.e. for some r > 0
K (x , x) ≤ r2
the regularised estimator
hS = argminh FS(h)
is β-stable for β = σ2r2
mλ .
Recall β-stability
For two samples
S = ((x1, y1), . . . , (xm, ym))
and
S ′ = ((x ′1, y′1), . . . , (x ′m, y
′m))
such that (xk , yk) 6= (x ′k , y′k) for at most one k we must show
|L(hS(x), y)− L(hS ′(x), y)| ≤ β.
This one weird trick...
I Translate minimisation property of hS and hS ′ using convexity
I bound pointwise |h(x)− h′(x)| by H-norm ‖h − h′‖
This one weird trick... (two really)
I Translate minimisation property of hS and hS ′ using convexity
I bound pointwise |h(x)− h′(x)| by H-norm ‖h − h′‖
Step One
Observe that hS and hS ′ minima of differentiable FS and FS ′ , so
∇FS(hS) = 0 and ∇FS ′(hS ′) = 0.
Convexity and hyperplanes
G (h)
h0
G (h0) + 〈∇G (h0), h − h0〉
G (h0) + 〈∇G (h0), h − h0〉 ≤ G (h)
Convexity and hyperplanes
G (h)
h0
G (h0) + 〈∇G (h0), h − h0〉
0 ≤ G (h)− G (h0)− 〈∇G (h0), h − h0〉︸ ︷︷ ︸Sometimes called Bregman Divergence
Convexity and hyperplanes (for the risk!)
R̂S(h)
h0
R̂S(h0) + 〈∇R̂S(h0), h − h0〉
0 ≤ R̂S(h)− R̂S(h0)− 〈∇R̂S(h0), h − h0〉
Simple RKHS property
|h(x)| = |〈h,K (x , ·)〉| (RKHS)
≤ ‖h‖‖K (x , ·)‖ (Cauchy-Schwarz-Bunyakovsky)
= ‖h‖√〈K (x , ·),K (x , ·)〉
= ‖h‖√
K (x , x) (RKHS)
= r‖h‖
Back to β-stability
S and S ′ samples differing on one point, x and y arbitrary
β-stability︷ ︸︸ ︷|L(hS(x), y)− L(hS ′(x), y)| ≤ σ|hS(x)− hS ′(x)| (σ-Lipschitz)
= σ|(hS − hS ′)(x)|≤ σr‖hS − hS ′‖
Rewrite in terms of derivative
Denote N(h) = ‖h‖2, then ∇N(h) = 2h and FS = R̂S + λN
‖hS − hS ′‖2 = 〈hS − hS ′ , hS − hS ′〉= −〈hS , hS ′ − hS〉 − 〈hS ′ , hS − hS ′〉
=1
2(− 〈2hS , hS ′ − hS〉 − 〈2hS ′ , hS − hS ′〉)
=1
2(−〈∇N(hS), hS ′ − hS〉 − 〈∇N(hS ′), hS − hS ′〉)
=1
2λ(− 〈∇λN(hS), hS ′ − hS〉 − 〈∇λN(hS ′), hS − hS ′〉)
Rewrite in terms of derivative
Denote N(h) = ‖h‖2, then ∇N(h) = 2h and FS = R̂S + λN
‖hS − hS ′‖2 = 〈hS − hS ′ , hS − hS ′〉= −〈hS , hS ′ − hS〉 − 〈hS ′ , hS − hS ′〉
=1
2(− 〈2hS , hS ′ − hS〉 − 〈2hS ′ , hS − hS ′〉)
=1
2(−〈∇N(hS), hS ′ − hS〉 − 〈∇N(hS ′), hS − hS ′〉)
=1
2λ(− 〈∇λN(hS), hS ′ − hS〉 − 〈∇λN(hS ′), hS − hS ′〉)
Rewrite in terms of derivative
Denote N(h) = ‖h‖2, then ∇N(h) = 2h and FS = R̂S + λN
− 〈∇λN(hS), hS ′ − hS〉
≤ R̂S(hS ′)− R̂S(hS)− 〈∇R̂S(hS), hS ′ − hS〉︸ ︷︷ ︸by convexity ≥ 0
−〈∇λN(hS), hS ′ − hS〉
= R̂S(hS ′)− R̂S(hS)− 〈∇(R̂S + λN)(hS), hS ′ − hS〉
= R̂S(hS ′)− R̂S(hS)− 〈∇FS(hS), hS ′ − hS〉
= R̂S(hS ′)− R̂S(hS)
Rewrite in terms of derivative
Denote N(h) = ‖h‖2, then ∇N(h) = 2h and FS = R̂S + λN
− 〈∇λN(hS), hS ′ − hS〉
≤ R̂S(hS ′)− R̂S(hS)− 〈∇R̂S(hS), hS ′ − hS〉︸ ︷︷ ︸by convexity ≥ 0
−〈∇λN(hS), hS ′ − hS〉
= R̂S(hS ′)− R̂S(hS)− 〈∇(R̂S + λN)(hS), hS ′ − hS〉
= R̂S(hS ′)− R̂S(hS)− 〈∇FS(hS), hS ′ − hS〉
= R̂S(hS ′)− R̂S(hS)
Rewrite in terms of derivative
Denote N(h) = ‖h‖2, then ∇N(h) = 2h and FS = R̂S + λN
‖hS − hS ′‖2 =1
2λ(− 〈∇λN(hS), hS ′ − hS〉 − 〈∇λN(hS ′), hS − hS ′〉)
≤ 1
2λ
(R̂S(hS ′)− R̂S(hS) + R̂S ′(hS)− R̂S ′(hS ′)
)≤ 1
2λ
(R̂S(hS ′)− R̂S ′(hS ′) + R̂S ′(hS)− R̂S(hS)
)
Rewrite in terms of derivative
Denote N(h) = ‖h‖2, then ∇N(h) = 2h and FS = R̂S + λN
‖hS − hS ′‖2 =1
2λ(− 〈∇λN(hS), hS ′ − hS〉 − 〈∇λN(hS ′), hS − hS ′〉)
≤ 1
2λ
(R̂S(hS ′)− R̂S(hS) + R̂S ′(hS)− R̂S ′(hS ′)
)≤ 1
2λ
(R̂S(hS ′)− R̂S ′(hS ′) + R̂S ′(hS)− R̂S(hS)
)
Use that S ≈ S ′
R̂S(hS ′)− R̂S ′(hS ′)
=1
m
∑i
L(hS ′(xi ), yi )−1
m
∑i
L(hS ′(x′i ), y
′i )
=1
m
∑i 6=k
(L(hS ′(xi ), yi )− L(hS ′(x′i ), y
′i ))
+1
m(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k))
=1
m
∑i 6=k
(L(hS ′(xi ), yi )− L(hS ′(xi ), yi ))
+1
m(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k))
=1
m(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k))
Use that S ≈ S ′
R̂S(hS ′)− R̂S ′(hS ′)
=1
m
∑i
L(hS ′(xi ), yi )−1
m
∑i
L(hS ′(x′i ), y
′i )
=1
m
∑i 6=k
(L(hS ′(xi ), yi )− L(hS ′(x′i ), y
′i ))
+1
m(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k))
=1
m
∑i 6=k
(L(hS ′(xi ), yi )− L(hS ′(xi ), yi ))
+1
m(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k))
=1
m(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k))
Put some finishing touches
‖hS − hS ′‖2 ≤1
2λ
(R̂S(hS ′)− R̂S ′(hS ′) + R̂S ′(hS)− R̂S(hS)
)=
1
2λm(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k)
+ L(hS(x ′k), y ′k)− L(hS(xk), yk))
=1
2λm(L(hS ′(xk), yk)− L(hS(xk), yk)
+ L(hS(x ′k), y ′k)− L(hS ′(x′k), y ′k))
=1
2λm(σr‖hS ′ − hS‖+ σr‖hS − hS ′‖)
=σr
λm‖hS − hS ′‖
Put some finishing touches
‖hS − hS ′‖2 ≤1
2λ
(R̂S(hS ′)− R̂S ′(hS ′) + R̂S ′(hS)− R̂S(hS)
)=
1
2λm(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k)
+ L(hS(x ′k), y ′k)− L(hS(xk), yk))
=1
2λm(L(hS ′(xk), yk)− L(hS(xk), yk)
+ L(hS(x ′k), y ′k)− L(hS ′(x′k), y ′k))
=1
2λm(σr‖hS ′ − hS‖+ σr‖hS − hS ′‖)
=σr
λm‖hS − hS ′‖
Put some finishing touches
‖hS − hS ′‖�2 ≤1
2λ
(R̂S(hS ′)− R̂S ′(hS ′) + R̂S ′(hS)− R̂S(hS)
)=
1
2λm(L(hS ′(xk), yk)− L(hS ′(x
′k), y ′k)
+ L(hS(x ′k), y ′k)− L(hS(xk), yk))
=1
2λm(L(hS ′(xk), yk)− L(hS(xk), yk)
+ L(hS(x ′k), y ′k)− L(hS ′(x′k), y ′k))
=1
2λm(σr‖hS ′ − hS‖+ σr‖hS − hS ′‖)
=σr
λm������‖hS − hS ′‖
Finally combine
|L(hS(x), y)− L(hS ′(x), y)|≤ σr‖hS − hS ′‖
≤ σr σrλm
Theorem (Stability of regularised kernel methods)
If L is convex, differentiable and (in first argument) σ-Lipschitz, i.e.
|L(v , y)− L(v ′, y)| ≤ σ|v − v ′|
and the kernel K is bounded, i.e. for some r > 0
K (x , x) ≤ r2
the regularised estimator
hS = argminh FS(h)
is β-stable for β = σ2r2
mλ .
Differentiability
Only require supporting hyperplanes with
I ∇FS(hS) = ∇FS ′(hS ′) = 0
I ∇FS(hS) = ∇R̂S(hS) +∇λN(hS)
Sufficient to pick subdifferentials!
Differentiability
Only require supporting hyperplanes with
I ∇FS(hS) = ∇FS ′(hS ′) = 0
I ∇FS(hS) = ∇R̂S(hS) +∇λN(hS)
Sufficient to pick subdifferentials!
Lipschitz
Only required on elements of type h(x), h′(x) for h, h′ ∈ H. I.e.sufficient that L satisfies
|L(h(x), y)− L(h′(x), y)| ≤ σ|h(x)− h′(x)|.
L is said to be σ-admissable with respect to H.
Final observation (Lemma 11.1)
Recall that we showed that |h(x)| ≤ r‖h‖ when K was bounded.If also supy L(0, y) = B <∞ then
argminh R̂S(h) + λ‖h‖2 ≤ R̂S(0) + λ‖0‖2 ≤ B.
so
|hS(x)| ≤ r‖hS‖ ≤ r
√B
λ.
Final observation (Lemma 11.1)
Recall that we showed that |h(x)| ≤ r‖h‖ when K was bounded.If also supy L(0, y) = B <∞ then
argminh R̂S(h) + λ‖h‖2 ≤ R̂S(0) + λ‖0‖2 ≤ B
so
|hS(x)| ≤ r‖hS‖ ≤ r
√B
λ.
Applications
Application to classification algorithm: SVMs
Standard hinge loss: Lhinge(y ′, y) : R× {−1,+1} → R
Lhinge(y ′, y) =
{0, if 1− yy ′ ≤ 0
1− yy ′, otherwise
Application to classification algorithm: SVMs
Stability-based learning bound for SVMs
I Assume that K (x , x) ≤ r2 for all x ∈ X for some r ≥ 0
I Let hS denote the hypothesis returned by SVMs when trainedon an i.i.d. sample S of size m.
Then, for any δ > 0, the following inequality holds with probabilityat least 1− δ
R(hS) ≤ R̂(hS) +r2
mλ+
(2r2
λ+
r√λ
+ 1
)√log (1/δ)
2m
Proof
According to its definition, Lhinge(·, y) is Lipschitz with constant 1for any for any (x , y) ∈ X × Y and h, h′ ∈ H, i.e.∣∣Lhinge(h(x), y)− Lhinge(h′(x), y)
∣∣ ≤ ∣∣h(x)− h′(x)∣∣ ,
which means that hinge loss function is σ-admissible with σ = 1.
Proof
I According to the Theorem (Stability of regularised kernelmethods)
β ≤ r2
mλ
I Since |L(0, y)| ≤ 1 for any y ∈ Y then (Lemma 11.1)
|hS(x)| ≤ r√λ
Proof
I According to the Theorem (Stability of regularised kernelmethods)
β ≤ r2
mλ
I Since |L(0, y)| ≤ 1 for any y ∈ Y then (Lemma 11.1)
|hS(x)| ≤ r√λ⇒ Lhinge(hS(x), y) ≤ M = r√
λ+ 1
R(hS) ≤ R̂S(hS) + β + (2mβ + M)
√log(1/δ)
2m
≤ R̂S(hS) +r2
mλ+
(2r2
λ+
r√λ
+ 1
)√log(1/δ)
2m
Stability of linear regression: kernel ridge regression (KRR)
Kernel ridge regression is defined by the minimization of thefollowing objective function
minw
F (w) =m∑i=1
(w ·Φ(xi )− yi )2 + λ||w ||2.
Stability of linear regression: kernel ridge regression (KRR)
Square loss: L2(y ′, y) : R× R→ R
L2(y ′, y) = (y ′ − y)2.
Stability of linear regression: kernel ridge regression (KRR)
Stability-based learning bound for KRR
I Assume that K (x , x) ≤ r2 for all x ∈ X for some r ≥ 0 andthat L2(y ′, y) is bounded by M ≥ 0
I Let hS denote the hypothesis returned by KRR when trainedon an i.i.d. sample S of size m.
Then, for any δ > 0, the following inequality holds with probabilityat least 1− δ
R(hS) ≤ R̂(hS) +4Mr2
mλ+
(8Mr2
λ+ M
)√log (1/δ)
2m
Proof
According to its definition, L2(·, y) is Lipschitz with constant 2√M
for any (x , y) ∈ X × Y and h, h′ ∈ H, i.e.∣∣L2(h(x), y)− L2(h′(x), y)∣∣ ≤ 2
√M∣∣h(x)− h′(x)
∣∣ ,which means that square loss function is σ-admissible withσ = 2
√M.
Proof
I According to the Theorem (Stability of regularised kernelmethods)
β ≤ 4Mr2
mλ
Proof
I According to the Theorem (Stability of regularised kernelmethods)
β ≤ 4Mr2
mλ
R(hS) ≤ R̂S(hS) + β + (2mβ + M)
√log(1/δ)
2m
≤ R̂S(hS) +4Mr2
mλ+
(8Mr2
λ+ M
)√log(1/δ)
2m