algorithmic stability - lecture of the course 'foundations ... · algorithmic stability...

Algorithmic stabilityLecture of the course ’Foundations of Machine learning’

Y. Ryeznik J. Vaicenavicius T. Wiklund

Department of MathematicsUppsala University

October 7, 2015

Stability-basedgeneralisation error bound

Notation

I Let X denote the set of examples, Y denote the set of targetvalues.

I Let z = (x , y) ∈ X × Y.

I Let h ∈ H := {h : X → Y ′} (Y ′ might differ from Y).

I Let L : Y × Y ′ → R+ be a loss function andLz(h) := L(h(x), y).

I Given a sample S = (z1, . . . , zm), the empirical error

R̂(z1,...,zm)(h) :=1

m

m∑i=1

Lzi (h)

I Let D denote the distribution of the sample. Then thegeneralisation error

R(h) := Ez∼D [Lz(h)].

I Given an algorithm A and sample S , let hS ∈ H be thehypothesis returned by A.

Stability-based generalisation guarantee

DefinitionA loss function L is said to be bounded by M ≥ 0 if

∀h ∈ H ∀z ∈ X × Y Lz(h) ≤ M.

Definition (Uniform stability)

Let S and S ′ be training samples of size m and differing by a singlepoint. A learning algorithm A is called β-stable if

∃β ≥ 0 ∀z ∈ X × Y |Lz(hS)− Lz(hS ′)| ≤ β.

Theorem (Generalisation error bound from a training sample)

Let the loss function L be bounded by some M ≥ 0, A be aβ-stable learning algorithm, S be a sample of m points drawn fromdistribution D. Then with probability no less than 1− δ

R(hS) ≤ R̂S(hS) + β + (2mβ + M)

√log(1/δ)

2m.

Proof of generalisation error bound

I Define Φ(S) := R(hS)− R̂S(hS), where S is a sample.

I Let S = (z1, . . . , zm), S ′ = (z1, . . . , z′m) be two samples of size

m, differing by a single point.

I Straightforward calculation using the boundedness of L andthe β-stability of A yields that

|Φ(S)− Φ(S ′)| ≤ 2β + M/m.

I Hence we can apply McDiarmid’s ineq., which gives that

Φ(S) ≤ ES∼Dm [Φ(S)] + (2mβ + M)

√log(1/δ)

2m.

occurs with prob. 1− δ.

Proof cont.

Straightforward to see that

ES∼Dm [R(hS)] = ES∼Dm [Ez∼D [Lz(hS)]] = ES ,z∼Dm+1 [Lz(hS)],

ES∼Dm [R̂(hS)] = E(z1,...,zm,z)∼Dm+1 [Lz(h(z1,...,zm−1,z))].

Then the absolute value of the first term

|ES∼Dm [Φ(S)]|= |E(z1,...,zm,z)∼Dm+1 [Lz(hS)]

−E(z1,...,zm,z)∼Dm+1 [Lz(h(z1,...,zm−1,z))]|≤ E(z1,...,zm,z)∼Dm+1 [|Lz(hS)− Lz(h(z1,...,zm−1,z))|]≤ E(z1,...,zm,z)∼Dm+1 [β]

= β,

which finishes the proof of the claim.

Kernel methods

Non-linear separation

I In most problems linear separation is not possible.

I Maybe linear separation in a bigger space H after a non-lineartransformation

Φ : X → H

is still possible?

Positive definite symmetric kernels

Definition

I A function K : X × X → R is called a kernel over X .

I It is said to be positive definite symmetric (PDS) if for any{x1, . . . , xm} ⊂ X , the matrix K := [K (xi , xj)]ij ∈ Rm×m issymmetric positive semidefinite (symmetric non-negativedefinite) .

Polynomial kernel

DefinitionThe kernel K : R× R→ R defined by

K (x, x′) := (x · x′ + c)d (c > 0, d ∈ N)

is called the polynomial kernel of degree d .

Example (XOR classification)

Figure: Polynomial kernel with c = 1 used.

Reproducing Kernel Hilbert Space

TheoremLet K : X × X → R be a PDS kernel. Then there exists a Hilbertspace H and a map Φ : X → H such that

K (x , y) = 〈Φ(x),Φ(y)〉

for all x , y ∈ X .

I The space H is called a feature space associated to K.

I The map Φ is called a feature map.

Proof

I Define Φ : X → RX by

Φ(x)(y) = K (x , y) (x , y ∈ X ).

I Let H0 := Span({Φ(x) : x ∈ X})I Define a map 〈·, ·〉 : H0 × H0 → R by specifying that for any

f =∑i∈I

aiΦ(xi ), g =∑j∈J

bjΦ(yj) in H0,

the value

〈f , g〉 :=∑

i∈I ,j∈JaibjK (xi , yj) =

∑j∈J

bj f (yj) =∑i∈I

ajg(xi ).

I From above, the map 〈·, ·〉 is well-defined, symmetric, andbilinear.

Proof cont.

I 〈·, ·〉 is positive semidefinite.Proof: 〈f , f 〉 =

∑i ,j∈I aiajK (xi , xj) ≥ 0.

I 〈·, ·〉 is a PDS kernel on H0.Proof: for any f1, . . . , fm ∈ H0 and any c1, . . . , cm ∈ R, wehave ∑

1≤i ,j≤mcicj〈fi , fj〉 = 〈

m∑i=1

ci fi ,m∑j=1

cj fj〉 ≥ 0.

I Cauchy-Schwarz for PDS kernels. If K is a PDS kernel, then

K (x , y)2 ≤ K (x , x)K (y , y)

for all x , y ∈ X .Proof: [K (xi , xj)]1≤i ,j≤2 is PDS and so its determinant isnon-negative.

Proof cont.

I By Cauchy-Schwarz for PDS kernels,

〈f ,Φ(x)〉2 ≤ 〈f , f 〉〈Φ(x),Φ(x)〉.

I From the definition of 〈·, ·〉, we have

f (x) = 〈f ,Φ(x)〉 (reproducing property)

for all f ∈ H0 and all x ∈ X .

I From the two above combined

f (x)2 ≤ 〈f , f 〉〈Φ(x),Φ(x)〉

for all x ∈ X . Consequently, 〈·, ·〉 is definite.

I Also, for any x ∈ X fixed, f 7→ 〈f ,Φ(x)〉 is Lipschitz (byCauchy-Schwarz) and so continuos.

I Finally, we define H := H̄0 (completion of H0) and call it thereproducing kernel Hilbert space.

Q.E.D.

Normalised kernels

DefinitionGiven a kernel K , we define the normalised kernel K ′ by

K ′(x , x ′) =

0 if K (x , x) = 0 or K (x ′, x ′) = 0K(x ,x ′)√

K(x ,x)K(x ′,x ′)otherwise,

for all x , x ′ ∈ X .

I By definition, K ′(x , x) = 1 for all x ∈ X .

LemmaIf K is a PDS kernel, then its normalised kernel K ′ is also PDS.

I Let K ′ be a normalised PDS kernel. Then

K ′(x , x ′) = 〈Φ(x),Φ(x ′)〉

can be interpreted as the sine of the angle between thefeature vectors Φ(x) and Φ(x ′) in a feature Hilbert space H.

PDS kernels - closure properties

TheoremPDS kernels are closed under:

I sum,

I product,

I tensor product,

I pointwise limit,

I composition with power series x 7→∑∞

n=0 anxn with an ≥ 0

for all n ∈ N.

Corollary

The Gaussian kernel K : RN × RN → R defined by

K (x, x′) := exp

(−||x

′ − x||2

2σ2

),

where x, x′ ∈ RN , is PDS.

Regularised kernel methods

Hypothesis space equal to H for some RKHS associated to kernel

K : X × X → R

regularised estimator (on sample S = ((x1, y1), . . . , (xm, ym)))

hS = argminh (R̂S(h) + λ‖h‖2)︸︷︷︸FS (h)

.

Recall empirical risk (average badness of choices made by h)

R̂S(h) =1

m

m∑i=1

L(h(xi ), yi ).

Theorem (Stability of regularised kernel methods)

If L is convex, differentiable and (in first argument) σ-Lipschitz, i.e.

|L(v , y)− L(v ′, y)| ≤ σ|v − v ′|

and the kernel K is bounded, i.e. for some r > 0

K (x , x) ≤ r2

the regularised estimator

hS = argminh FS(h)

is β-stable for β = σ2r2

mλ .

Recall β-stability

For two samples

S = ((x1, y1), . . . , (xm, ym))

and

S ′ = ((x ′1, y′1), . . . , (x ′m, y

′m))

such that (xk , yk) 6= (x ′k , y′k) for at most one k we must show

|L(hS(x), y)− L(hS ′(x), y)| ≤ β.

This one weird trick...

I Translate minimisation property of hS and hS ′ using convexity

I bound pointwise |h(x)− h′(x)| by H-norm ‖h − h′‖

This one weird trick... (two really)

I Translate minimisation property of hS and hS ′ using convexity

I bound pointwise |h(x)− h′(x)| by H-norm ‖h − h′‖

Step One

Observe that hS and hS ′ minima of differentiable FS and FS ′ , so

∇FS(hS) = 0 and ∇FS ′(hS ′) = 0.

Convexity and hyperplanes

G (h)

h0

G (h0) + 〈∇G (h0), h − h0〉

G (h0) + 〈∇G (h0), h − h0〉 ≤ G (h)

Convexity and hyperplanes

G (h)

h0

G (h0) + 〈∇G (h0), h − h0〉

0 ≤ G (h)− G (h0)− 〈∇G (h0), h − h0〉︸︷︷︸Sometimes called Bregman Divergence

Convexity and hyperplanes (for the risk!)

R̂S(h)

h0

R̂S(h0) + 〈∇R̂S(h0), h − h0〉

0 ≤ R̂S(h)− R̂S(h0)− 〈∇R̂S(h0), h − h0〉

Simple RKHS property

|h(x)| = |〈h,K (x , ·)〉| (RKHS)

≤ ‖h‖‖K (x , ·)‖ (Cauchy-Schwarz-Bunyakovsky)

= ‖h‖√〈K (x , ·),K (x , ·)〉

= ‖h‖√

K (x , x) (RKHS)

= r‖h‖

Rewrite in terms of derivative

Denote N(h) = ‖h‖2, then ∇N(h) = 2h and FS = R̂S + λN

‖hS − hS ′‖2 = 〈hS − hS ′ , hS − hS ′〉= −〈hS , hS ′ − hS〉 − 〈hS ′ , hS − hS ′〉

=1

2(− 〈2hS , hS ′ − hS〉 − 〈2hS ′ , hS − hS ′〉)

=1

2(−〈∇N(hS), hS ′ − hS〉 − 〈∇N(hS ′), hS − hS ′〉)

=1

2λ(− 〈∇λN(hS), hS ′ − hS〉 − 〈∇λN(hS ′), hS − hS ′〉)



− 〈∇λN(hS), hS ′ − hS〉

≤ R̂S(hS ′)− R̂S(hS)− 〈∇R̂S(hS), hS ′ − hS〉︸︷︷︸by convexity ≥ 0

−〈∇λN(hS), hS ′ − hS〉

= R̂S(hS ′)− R̂S(hS)− 〈∇(R̂S + λN)(hS), hS ′ − hS〉

= R̂S(hS ′)− R̂S(hS)− 〈∇FS(hS), hS ′ − hS〉

= R̂S(hS ′)− R̂S(hS)



‖hS − hS ′‖2 =1

2λ(− 〈∇λN(hS), hS ′ − hS〉 − 〈∇λN(hS ′), hS − hS ′〉)

≤ 1

2λ

(R̂S(hS ′)− R̂S(hS) + R̂S ′(hS)− R̂S ′(hS ′)

)≤ 1

2λ

(R̂S(hS ′)− R̂S ′(hS ′) + R̂S ′(hS)− R̂S(hS)

)

Use that S ≈ S ′

R̂S(hS ′)− R̂S ′(hS ′)

=1

m

∑i

L(hS ′(xi ), yi )−1

m

∑i

L(hS ′(x′i ), y

′i )

=1

m

∑i 6=k

(L(hS ′(xi ), yi )− L(hS ′(x′i ), y

′i ))

+1

m(L(hS ′(xk), yk)− L(hS ′(x

′k), y ′k))

=1

m

∑i 6=k

(L(hS ′(xi ), yi )− L(hS ′(xi ), yi ))

+1


′k), y ′k))

=1


′k), y ′k))

Put some finishing touches

‖hS − hS ′‖2 ≤1

2λ


)=

1

2λm(L(hS ′(xk), yk)− L(hS ′(x

′k), y ′k)

+ L(hS(x ′k), y ′k)− L(hS(xk), yk))

=1

2λm(L(hS ′(xk), yk)− L(hS(xk), yk)

+ L(hS(x ′k), y ′k)− L(hS ′(x′k), y ′k))

=1

2λm(σr‖hS ′ − hS‖+ σr‖hS − hS ′‖)

=σr

λm‖hS − hS ′‖

Put some finishing touches

‖hS − hS ′‖�2 ≤1

2λ


)=

1

2λm(L(hS ′(xk), yk)− L(hS ′(x

′k), y ′k)

+ L(hS(x ′k), y ′k)− L(hS(xk), yk))

=1

2λm(L(hS ′(xk), yk)− L(hS(xk), yk)

+ L(hS(x ′k), y ′k)− L(hS ′(x′k), y ′k))

=1

2λm(σr‖hS ′ − hS‖+ σr‖hS − hS ′‖)

=σr

λm��‖hS − hS ′‖

Finally combine

|L(hS(x), y)− L(hS ′(x), y)|≤ σr‖hS − hS ′‖

≤ σr σrλm

Theorem (Stability of regularised kernel methods)

If L is convex, differentiable and (in first argument) σ-Lipschitz, i.e.

|L(v , y)− L(v ′, y)| ≤ σ|v − v ′|

and the kernel K is bounded, i.e. for some r > 0

K (x , x) ≤ r2

the regularised estimator

hS = argminh FS(h)

is β-stable for β = σ2r2

mλ .

Differentiability

Only require supporting hyperplanes with

I ∇FS(hS) = ∇FS ′(hS ′) = 0

I ∇FS(hS) = ∇R̂S(hS) +∇λN(hS)

Sufficient to pick subdifferentials!

Lipschitz

Only required on elements of type h(x), h′(x) for h, h′ ∈ H. I.e.sufficient that L satisfies

|L(h(x), y)− L(h′(x), y)| ≤ σ|h(x)− h′(x)|.

L is said to be σ-admissable with respect to H.

Final observation (Lemma 11.1)

Recall that we showed that |h(x)| ≤ r‖h‖ when K was bounded.If also supy L(0, y) = B <∞ then

argminh R̂S(h) + λ‖h‖2 ≤ R̂S(0) + λ‖0‖2 ≤ B.

so

|hS(x)| ≤ r‖hS‖ ≤ r

√B

λ.

Final observation (Lemma 11.1)

Recall that we showed that |h(x)| ≤ r‖h‖ when K was bounded.If also supy L(0, y) = B <∞ then

argminh R̂S(h) + λ‖h‖2 ≤ R̂S(0) + λ‖0‖2 ≤ B

so

|hS(x)| ≤ r‖hS‖ ≤ r

√B

λ.

Applications

Application to classification algorithm: SVMs

Standard hinge loss: Lhinge(y ′, y) : R× {−1,+1} → R

Lhinge(y ′, y) =

{0, if 1− yy ′ ≤ 0

1− yy ′, otherwise

Application to classification algorithm: SVMs

Stability-based learning bound for SVMs

I Assume that K (x , x) ≤ r2 for all x ∈ X for some r ≥ 0

I Let hS denote the hypothesis returned by SVMs when trainedon an i.i.d. sample S of size m.

Then, for any δ > 0, the following inequality holds with probabilityat least 1− δ

R(hS) ≤ R̂(hS) +r2

mλ+

(2r2

λ+

r√λ

+ 1

)√log (1/δ)

2m

Proof

According to its definition, Lhinge(·, y) is Lipschitz with constant 1for any for any (x , y) ∈ X × Y and h, h′ ∈ H, i.e.∣∣Lhinge(h(x), y)− Lhinge(h′(x), y)

∣∣ ≤ ∣∣h(x)− h′(x)∣∣ ,

which means that hinge loss function is σ-admissible with σ = 1.

Proof

I According to the Theorem (Stability of regularised kernelmethods)

β ≤ r2

mλ

I Since |L(0, y)| ≤ 1 for any y ∈ Y then (Lemma 11.1)

|hS(x)| ≤ r√λ

Proof


β ≤ r2

mλ

I Since |L(0, y)| ≤ 1 for any y ∈ Y then (Lemma 11.1)

|hS(x)| ≤ r√λ⇒ Lhinge(hS(x), y) ≤ M = r√

λ+ 1

R(hS) ≤ R̂S(hS) + β + (2mβ + M)

√log(1/δ)

2m

≤ R̂S(hS) +r2

mλ+

(2r2

λ+

r√λ

+ 1

)√log(1/δ)

2m

Stability of linear regression: kernel ridge regression (KRR)

Kernel ridge regression is defined by the minimization of thefollowing objective function

minw

F (w) =m∑i=1

(w ·Φ(xi )− yi )2 + λ||w ||2.


Square loss: L2(y ′, y) : R× R→ R

L2(y ′, y) = (y ′ − y)2.


Stability-based learning bound for KRR

I Assume that K (x , x) ≤ r2 for all x ∈ X for some r ≥ 0 andthat L2(y ′, y) is bounded by M ≥ 0

I Let hS denote the hypothesis returned by KRR when trainedon an i.i.d. sample S of size m.

Then, for any δ > 0, the following inequality holds with probabilityat least 1− δ

R(hS) ≤ R̂(hS) +4Mr2

mλ+

(8Mr2

λ+ M

)√log (1/δ)

2m

Proof

According to its definition, L2(·, y) is Lipschitz with constant 2√M

for any (x , y) ∈ X × Y and h, h′ ∈ H, i.e.∣∣L2(h(x), y)− L2(h′(x), y)∣∣ ≤ 2

√M∣∣h(x)− h′(x)

∣∣ ,which means that square loss function is σ-admissible withσ = 2

√M.

Proof


β ≤ 4Mr2

mλ

Proof


β ≤ 4Mr2

mλ

R(hS) ≤ R̂S(hS) + β + (2mβ + M)

√log(1/δ)

2m

≤ R̂S(hS) +4Mr2

mλ+

(8Mr2

λ+ M

)√log(1/δ)

2m

algorithmic stability - lecture of the course 'foundations ... · algorithmic stability...

Documents