statistical learning: algorithms and theory

103
Statistical Learning: Algorithms and Theory Sayan Mukherjee

Upload: others

Post on 12-Sep-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Learning: Algorithms and Theory

Statistical Learning: Algorithms andTheory

Sayan Mukherjee

Page 2: Statistical Learning: Algorithms and Theory
Page 3: Statistical Learning: Algorithms and Theory

Contents

Statistical Learning: Algorithms and Theory 1

Lecture 1. Course preliminaries and overview 1

Lecture 2. The learning problem 52.1. Key definitions in learning 52.2. Imposing well-posedness 122.3. Voting algorithms 13

Lecture 3. Reproducing Kernel Hilbert Spaces (rkhs) 153.1. Hilbert Spaces 153.2. Reproducing Kernel Hilbert Spaces (rkhs) 163.3. The rkhs norm, complexity, and smoothness 203.4. Gaussian processes and rkhs 21

Lecture 4. Three forms of regularization 234.1. A result of the representer theorem 234.2. Equivalence of the three forms 25

Lecture 5. Algorithms derived from Tikhonov regularization 275.1. Kernel ridge-regression 275.2. Support Vector Machines (SVMs) for classification 305.3. Regularized logistics regression 375.4. Spline models 415.5. Convex Optimization 47

Lecture 6. Voting algorithms 516.1. Boosting 516.2. PAC learning 516.3. The hypothesis boosting problem 536.4. ADAptive BOOSTing (AdaBoost) 566.5. A statistical interpretation of Adaboost 586.6. A margin interpretation of Adaboost 59

Lecture 7. One dimensional concentration inequalities 637.1. Law of Large Numbers 637.2. Polynomial inequalities 637.3. Exponential inequalities 647.4. Martingale inequalities 67

Lecture 8. Vapnik-Cervonenkis theory 71

3

Page 4: Statistical Learning: Algorithms and Theory

8.1. Uniform law of large numbers 718.2. Generalization bound for one function 718.3. Generalization bound for a finite number of functions 728.4. Generalization bound for compact hypothesis spaces 728.5. Generalization bound for hypothesis spaces of indicator functions 768.6. Kolmogorov chaining 818.7. Covering numbers and VC dimension 848.8. Symmetrization and Rademacher complexities 85

Lecture 9. Generalization bounds 899.1. Generalization bounds and stability 899.2. Uniform stability of Tikhonov regularization 93

Bibliography 99

Page 5: Statistical Learning: Algorithms and Theory

Stat293 class notes

Statistical Learning: Algorithms andTheory

Sayan Mukherjee

LECTURE 1Course preliminaries and overview

• Course summary

The problem of supervised learning will be developed in the frameworkof statistical learning theory. Two classes of machine learning algorithmsthat have been used successfully in a variety of applications will be studiedin depth: regularization algorithms and voting algorithms. Support vectormachines (SVMs) are an example of a popular regularization algorithmand AdaBoost is an example of a popular voting algorithm. The coursewill(1) introduce these two classes of algorithms(2) illustrate practical uses of the algorithms via problems in computa-

tional biology and computer graphics(3) state theoretical results on the generalization and consistency of these

algorithms.

• Grading

Three problem sets for 50% of the grade. A final project for 50% ofthe grade. Possible final projects are(1) application of a learning algorithm to data

1Institute of Statistics and Decision Sciences (ISDS) and Institute for Genome Sciences and Policy(IGSP), Duke University, Durham, 27708.E-mail address: [email protected].

November 18, 2004

These lecture notes borrow heavily from the following courses at M.I.T. 9.520 and 18.465.

c©1993 American Mathematical Society

1

Page 6: Statistical Learning: Algorithms and Theory

2 S. MUKHERJEE, STATISTICAL LEARNING

(2) critical review of topics in the class or related topics(3) theoretical analysis of an algorithm

The student can pick a topic or select from predefined topics.

• Course outline(1) The supervised learning problem

The problem of supervised learning will be introduced as functionapproximation given sparse data.

(2) Regularization algorithms(a) Reproducing Kernel Hilbert Spaces

A function class with some very nice properties.(b) Regularization methods(c) Algorithms derived from Tikhonov regularization

(i) Kernel ridge-regression(ii) Support vector machines (SVMs)(iii) Regularized logistics regression(iv) Splines

(d) OptimizationLagrange multipliers and primal/dual problems.

(3) Voting algorithms(a) Examples of voting algorithms(b) Probably approximately correct (PAC) framework

A theoretical framework introduced by Leslie Valiant used ex-tensively in Learning theory.

(c) Strong and Weak learners and the Boosting hypothesisDo there exist two kinds of algorithms: strong and weak algo-rithms, or are they equivalent in and that one can one boostweak algorithms into strong ones.

(d) Boosting by majority and Adaptive boosting (AdaBoost).(4) Applications

(a) Computational biology(i) Classifying tumors using SVMs

Expression data from human tumors will be analyzed us-ing SVMs. Both binary and multiclass problems will beaddressed. The challenge here is high dimensionality andlack of samples.

(ii) Predicting genetic regulatory response using boostingA boosting algorithm is used to predict gene regulatoryresponse using expression and motif data from yeast.

(b) Computer graphicsTrainable videorealistic speech animation system is describedthat uses Tikhonov regularization to create an animation mod-ule. Given video of a pre-determined speech corpus from ahuman subject the algorithm is capable of synthesizing the hu-man subject’s mouth uttering entirely novel utterances thatwere not recorded in the original video.

(5) Theory

Page 7: Statistical Learning: Algorithms and Theory

LECTURE 1. COURSE PRELIMINARIES AND OVERVIEW 3

(a) Generalization and consistencyHow accurate a function learnt from a finite dataset will be onfuture data and as the number of data goto infinity will theoptimal function in the class selected.

(b) Technical tools(i) One-dimensional concentration inequalities(A) Polynomial inequalities(B) Exponential inequalities(C) Martingale inequalities

(ii) Vapnik-Cervonenkis (VC) theory(A) Covering numbers and VC dimension(B) Growth functions and metric entropy(C) Uniform law of large numbers(D) Kolmogorov chaining and Dudley’s entropy integral(E) Rademacher averages

(c) Generalization bounds(i) Bounds for Tikhonov regularization using algorithmic sta-

bility(ii) Bounds for empirical risk minimization algorithms using

VC theory(iii) Generalization bounds for boosting

Page 8: Statistical Learning: Algorithms and Theory
Page 9: Statistical Learning: Algorithms and Theory

LECTURE 2The learning problem

In this lecture the (supervised) learning problem is presented as the problem offunction approximation from sparse data. The key ideas of loss functions, empiricalerror and expected error are introduced. The Empirical Risk Minimization (ERM)algorithm is introduced and three key requirements on this algorithm are described:generalization, consistency, and well-posedness. We then mention the regularizationprinciple which ensures that the above condition are satisfied. We close with a briefintroduction to voting algorithms.

2.1. Key definitions in learning

Our dataset consists of two sets of random variables X ⊆ IRd and Y ⊆ Rk. Ourgeneral assumption will be that X is a compact Euclidean space and Y is closedsubset. For most of this class k = 1. Given a dataset we would like to learn afunction f : X → Y .

For example the space X can be measurements for a given country:

x = (gdp, poilty, infant mortality, population density, estimates of WMD, oil export),

we will focus on two types of spaces Y in these lectures.

(1) Pattern recognition: the space Y = 0, 1 corresponds to a binary variableindicating whether the country is bombed,

(2) Regression: the space Y ⊂ IR corresponds to a real-valued variable indi-cating how much the country is bombed (we assume there exists negativebombing here).

The dataset S is often called the “training set” and consists of n samples,(x, y) pairs, drawn i.i.d. from a probability distribution µ(z), a distribution on theproduct space of X and Y (Z = X × Y )

S = z1 = (x1, y1), ..., zn = (xn, yn).An important concept is the conditional probability of y given x p(y|x)

µ(z) = p(y|x) · p(x).

In the learning problem we assume µ(z) is fixed but unknown.

5

Page 10: Statistical Learning: Algorithms and Theory

6 S. MUKHERJEE, STATISTICAL LEARNING

2.1.1. Algorithms, hypothesis spaces, and loss functions

Proposition. A learning algorithm A is a map from a data set S to a functionfS:

A : S → fS .

Proposition. A hypothesis space H is the space of functions that a learning algo-rithm A “searches” over.

The basic goal of supervised learning is to use the training set S to “learn” afunction fS that given a value xnew not in the training set will predict the associatedvalue ypred:

ypred = fS(xnew),

not only for one value xnew but for a set of these values.A loss function can be used to decide how well a function “fits” a training set

or how well it predicts new observations.

Definition. A loss function will be defined as a nonnegative function of two vari-ables V (f, z) := V (f(x), y):

V : IR× IR→ IR+.

The loss functions we will encounter will be nondecreasing in either

|f(x)− y| or − yf(x).

Examples. The following are loss functions used in general for regression prob-lems.

(1) Square loss: the most common loss function goes back to Gauss, sometimescalled L2 loss

V (f(x), y) = (f(x) − y)2,

(2) Absolute loss: this loss function is less sensitive to outliers than the squareloss, sometimes called L1 loss

V (f(x), y) = |f(x)− y|,(3) Huber loss: is less sensitive to outliers like the L1 loss but is everywhere

differentiable unlike the L1 loss

V (f(x), y) =

14ǫ (f(x)− y)2 if |f(x)− y| ≤ 2ǫ,

|f(x) − y| − ǫ o.w.

(4) Vapnik loss: is less sensitive to outliers like the L1 loss and in particularalgorithms can lead to sparse solutions

V (f(x), y) =

0 if |f(x)− y| ≤ ǫ,

|f(x)− y| − ǫ o.w.

Examples. The following are loss functions used in general for classification prob-lems.

(1) Indicator loss: the most intuitive loss for binary classification

V (f(x), y) = Θ(−f(x)y) :=

1 if − yf(x) ≤ 0,

0 o.w.

Page 11: Statistical Learning: Algorithms and Theory

LECTURE 2. THE LEARNING PROBLEM 7

−3 −2 −1 0 1 2 3

0

1

2

3

4

5

6

7

8

9

y−f(x)

L2 lo

ss

−3 −2 −1 0 1 2 3

0

0.5

1

1.5

2

2.5

3

y−f(x)

L1 lo

ss

−3 −2 −1 0 1 2 3

0

0.5

1

1.5

2

2.5

y−f(x)

Hub

er lo

ss e

psilo

n =

.5

−3 −2 −1 0 1 2 3

0

0.5

1

1.5

2

2.5

y−f(x)

Vap

nik

loss

s ep

silo

n =

.5

Figure 1. Four loss functions for regression: square loss, absolute loss, Hu-ber’s loss function, and Vapnik’s loss function.

(2) Hinge loss: unlike the Indicator loss this loss function is convex and there-fore leads to practical algorithms that may have sparse solutions

V (f(x), y) = (1− f(x)y)+ :=

0 if yf(x) ≥ 1,

1− f(x)y o.w.

(3) Quadratic hinge loss: similar to the hinge loss but the deviation is squared

V (f(x), y) = [(1 − f(x)y)+]2 :=

0 if yf(x) ≥ 1,

(1− f(x)y)2 o.w.

(4) Logistic loss: this is also a convex loss function with the advantage that aprobabilistic model can be associated with it

V (f(x), y) = ln(

1 + e−yf(x))

.

Page 12: Statistical Learning: Algorithms and Theory

8 S. MUKHERJEE, STATISTICAL LEARNING

−3 −2 −1 0 1 2 3

0

0.2

0.4

0.6

0.8

1

y*f(x)

Indi

cato

r lo

ss

−3 −2 −1 0 1 2 3

0

0.5

1

1.5

2

2.5

3

3.5

4

y * f(x)

Hin

ge L

oss

−3 −2 −1 0 1 2 30

2

4

6

8

10

12

14

16

y*f(x)

Hin

ge s

quar

ed lo

ss

−3 −2 −1 0 1 2 3

0

0.5

1

1.5

2

2.5

3

3.5

y*f(x)

Logi

stic

loss

Figure 2. Four loss functions for classification: indicator loss, hinge loss,square hinge loss, and logistic loss.

Page 13: Statistical Learning: Algorithms and Theory

LECTURE 2. THE LEARNING PROBLEM 9

2.1.2. Empirical error, expected error, and generalization

Two key concepts in learning theory are the empirical error of a function andexpected error of a function.

Definition. The empirical error of a function f given a loss function V and atraining set S of n points is

IS [f ] = n−1n∑

i=1

V (f, zi).

Definition. The expected error of a function f given a loss function V and adistribution µ is

I[f ] = IEzV (f, z) :=

V (f, zi)dµ(z).

Note that the expected error can almost never be computed since we almostnever know µ(z). However, we often are given S (n points drawn from µ(z)) sowe can compute the empirical error. This observation motivates the principal ofgeneralization which is fundamental to learning theory.

Proposition. An algorithm generalizes, Agen, if its empirical error is close to itsexpected error:

Agen = A : |I[fS ]− IS [fS ]| is small.

The advantage of algorithms that generalize is that if the empirical error issmall then we have faith that the algorithm will be accurate on future observations.However, an algorithm that generalizes but has a large empirical error is not of muchuse.

2.1.3. Empirical Risk Minimization

A very common algorithm is the empirical risk minimization (ERM) algorithm.

Definition. Given a hypothesis space H a function fS is a minimizer of the em-pirical risk if

fS ∈ arg minf∈H

IS [f ] := arg minf∈H

[

n−1n∑

i=1

V (f, zi)

]

.

If there exists no minimizer of empirical risk then a variation of the algorithm:almost ERM is used.

Definition. Given a hypothesis space H a function fS is an ǫ-minimizer of theempirical risk if for any ǫ > 0

IS [fS ] ≤ inff∈H

IS [f ] + ǫ.

2.1.4. Desired properties of ERM

A natural property for an algorithm such as ERM to have is that as the number ofobservations increase the algorithm should find the best function in the hypothesisspace. This property is defined as consistency.

Page 14: Statistical Learning: Algorithms and Theory

10 S. MUKHERJEE, STATISTICAL LEARNING

Definition. ERM is universally consistent if

∀ε > 0 limn→∞

supµ

IPS

I[fS ] > inff∈H

I[f ] + ε

= 0.

One can observe (homework problem) that for ERM universal consistency anddistribution independent generalization are equivalent.

Definition. ERM has distribution independent generalization if

∀ε > 0 limn→∞

supµ

IPS |IS [fS]− I[fS ]| > ε = 0.

Another set of desirable properties for ERM is that the mapping defined byERM (A : S → fS) be well-posed. The notion of well-posedness goes back toHadamard who defined this idea for operators.

Definition. A map is well-posed if its solution

(1) exists(2) is unique(3) is stable (the function output varies smoothly with respect to perturbations

of the input, the training set S).

The key requirement for ERM is stability since existence is ensured by thedefinition of ERM or almost ERM.and uniqueness is defined modulo empirical error,all functions that have the same empirical error are equivalent.

We will see that the requirements of stability and consistency are complimen-tary and indeed equivalent for ERM.

The following two examples illustrate these issues.

Page 15: Statistical Learning: Algorithms and Theory

LECTURE 2. THE LEARNING PROBLEM 11

Example. This example examines the stability of ERM when different functionclasses are used. The key point here is that ERM with a less complex or simplerfunction class is more stable.

The dataset is composed of 10 points and we fit the smoothest interpolatingpolynomial to this data. We then perturb the data slightly and refit the polynomialand note how much the function changes. Repeating this procedure with a seconddegree polynomial results in a much smaller change in the function.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 3. The first figure consists of 10 points sampled from a function. Inthe second figure we fit the smoothest interpolating polynomial. The thirdfigure displays the original data and the perturbed data. The fourth figureplots the smoothest interpolating polynomial for the two datasets. The fifthfigure plots the original dataset with a 2nd order polynomial fit. The sixthfigure plots both datasets with fit with 2nd order polynomials.

Page 16: Statistical Learning: Algorithms and Theory

12 S. MUKHERJEE, STATISTICAL LEARNING

Example. This example is an illustration of how well-posedness and generalizationare related. The key point here is that when ERM is well-posed the expected errorwill be smaller and in the limit this implies consistency.

Ten points are sampled from a sinusoid. We will use two algorithms to fit thedata

(1) A1: ERM using a 7th order polynomial(2) A2: ERM using a 15th order polynomial.

A1 is well-posed and A2 is not. Note that the expected error of A1 is much smallerthan the expected error of A2.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

x

y

sampled data

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

f(x)

sin(2x)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−1.5

−1

−0.5

0

0.5

1

1.5

x

f(x)

7th order polynomial

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−5

−4

−3

−2

−1

0

1

2

Figure 4. The first figure consists of 10 points sampled from a function. Thesecond figure is the underlying function, a sinusoid. The third and fourthfigures are a 7th and 15th order polynomial fit to the data.

2.2. Imposing well-posedness

The previous examples illustrated that ERM by itself is not well-posed. A standardapproach to imposing well-posed to a procedure is via the principle of regularization.

The principle of regularization involves constraining the hypothesis space. Twobasic of the three basic methods of imposing constraints are

(1) Ivanov regularization: directly constrain the hypothesis space

minf∈H

[

n−1n∑

i=1

V (f, zi)

]

subject to Ω(f) ≤ τ

Page 17: Statistical Learning: Algorithms and Theory

LECTURE 2. THE LEARNING PROBLEM 13

(2) Tikhonov regularization: indirectly constrain the hypothesis space byadding a penalty term.

minf∈H

[

n−1n∑

i=1

V (f, zi) + λΩ(f)

]

.

The function Ω(f) is called the regularization or smoothness functional and istypically a Hilbert space norm, see lecture 3, and the parameters τ and λ arethe regularization parameters that control the trade-off between fitting the dataand constraining the hypothesis space.

2.3. Voting algorithms

An alternative to imposing well-posedness using constraints is to use algorithmsthat by construction are well-posed. Voting algorithms implement this approach,see lecture ??.

The idea behind voting algorithms is that we use a weighted combination ofvery simple functions called weak learners to build a more complicated function

f(x) =

p∑

i=1

αici(x) subject to

p∑

i=1

αi = 1 and αi ≥ 0,

the functions ci(x) are the weak classifiers and the weights αi can be predeterminedor set based upon the observed data.

Page 18: Statistical Learning: Algorithms and Theory
Page 19: Statistical Learning: Algorithms and Theory

LECTURE 3Reproducing Kernel Hilbert Spaces (rkhs)

Reproducing Kernel Hilbert Spaces (rkhs) are hypothesis spaces with some verynice properties. The main property of these spaces is the reproducing propertywhich relates norms in the Hilbert space to linear algebra. This class of functionsalso has a nice interpretation in the context of Gaussian processes. Thus, they areimportant for computational, statistical, and functional reasons.

3.1. Hilbert Spaces

A function space is a space of functions where each function can be thought of asa point.

Examples. The following are three examples of function spaces defined on a subsetof the real line. In these examples the subset of the real line we consider is x ∈ [a, b]where for example a = 0 and b = 10.

(1) C[a, b] is the set of all real-valued continuous functions on x ∈ [a, b].y = x3 is in C[a, b] while y = ⌈x⌉ is not in C[a, b].

(2) L2[a, b] is the set of all square integrable functions on x ∈ [a, b]. If

(∫ b

a|f(x)|2 dx)1/2 <∞ then f ∈ L2[a, b].

y = x3 is in L2[a, b] and so is y = x3 + δ(x− c) where a < c < b, howeverthe second function is not defined at x = c.

(3) L1[a, b] is the set of all functions whose absolute value is integrable onx ∈ [a, b].y = x3 is in L1[a, b] and so is y = x3 + δ(x− c) where a < c < b, howeverthe second function is not defined at x = c.

Definition. A normed vector space is a space F in which a norm is defined. Afunction ‖ · ‖ is a norm iff for all f, g ∈ F

(1) ‖f‖ ≥ 0 and ‖f‖ = 0 iff f = 0(2) ‖f + g‖ ≤ ‖f‖+ ‖g‖(3) ‖αf‖ = |α| ‖f‖.

Note, if all conditions are satisfied except ‖f‖ = 0 iff f = 0 then the space has aseminorm instead of a norm.

15

Page 20: Statistical Learning: Algorithms and Theory

16 S. MUKHERJEE, STATISTICAL LEARNING

Definition. An inner product space is a linear vector space E in which an innerproduct is defined. A real valued function 〈·, ·〉 is an inner product iff ∀f, g, h ∈ Eand α ∈ IR

(1) 〈f, g〉 = 〈g, f〉(2) 〈f + g, h〉 = 〈f, h〉+ 〈g, h〉 and 〈αf, g〉 = α〈f, g〉(3) 〈f, f〉 ≥ 0 and 〈f, f〉 = 0 iff f = 0.

Given an inner product space the norm is defined as ‖f‖ =√

〈f, f〉 and an anglebetween vectors can be defined.

Definition. For a normed space A a subspace B ⊂ A is dense in A iff A = B.Where B is the closure of the set B.

Definition. A normed space F is separable iff F has a countable dense subset.

Example. The set of all rational points is dense in the real line and therefore thereal line is separable. Note, the set of rational points is countable.

Counterexample. The space of right continuous functions on [0, 1] with the supnorm is not separable. For example, the step function

f(x) = U(x− a) ∀a ∈ [0, 1]

cannot be approximated by a countable family of functions in the sup norm sincethe jump must occur at a and the set of all a is uncountable.

Definition. A sequence xn in a normed space F is called a Cauchy sequence iflimn→∞ supm≥n ‖xn − xm‖ = 0.

Definition. A normed space F is called complete iff every Cauchy sequence in itconverges.

Definition. A Hilbert space, H is an inner product space that is complete, separa-ble, and generally infinite dimensional.A Hilbert space has a countable basis.

Examples. The following are examples of Hilbert spaces.

(1) IRn is the textbook example of a Hilbert space. Each point in the spacex ∈ IRn can be represented as a vector x = x1, ..., xn and the metric

in this space is ‖x‖ =√∑n

i=1 |xi|2. The space has a very natural basiscomposed of the n basis functions e1 = 1, 0, ..., 0, e2 = 0, 1, ..., 0,...,en = 0, 0, ..., 1. The inner product between a vector x and a basis vectorei is simply the projection of x onto the ith coordinate xi = 〈x, ei〉.Note, this is not an infinite dimensional Hilbert space.

(2) L2 is also a Hilbert space. This Hilbert space is infinite dimensional.

3.2. Reproducing Kernel Hilbert Spaces (rkhs)

We will use two formulations to describe rkhs. The first is more general, abstract,and elegant. Of course it is less intuitive. The second is less general and construc-tive. Of course it is more intuitive.

For the remainder of this lecture we constrain the Hilbert spaces to a compactdomain X .

Page 21: Statistical Learning: Algorithms and Theory

LECTURE 3. REPRODUCING KERNEL HILBERT SPACES (RKHS) 17

3.2.1. Abstract formulation

Proposition. A linear evaluation function Lt evaluates each function in a Hilbertspace f ∈ H at a point t. It associates f ∈ H to a number f(t) ∈ IR, Lt[f ] = f(t).

(1) Lt[f + g] = f(t) + g(t)(2) Lt[af ] = af(t).

Example. The delta function δ(x− t) is a linear evaluation function for C[a, b]

f(t) =

∫ b

a

f(x)δ(x − t)dx.

Proposition. A linear evaluation function is bounded if there exists an M suchthat for all functions in the Hilbert space f ∈ H

|Lt[f ]| = |f(t)| ≤M‖f‖,where ‖f‖ is the Hilbert space norm.

Example. For the Hilbert space C[a, b] with the sup norm there exists a boundedlinear evaluation function since |f(x)| ≤M for all functions in C[a, b]. This is dueto continuity and compactness of the domain. The evaluation function is simplyLt[f ] : t→ f(t) and M = 1.

Counterexample. For the Hibert space L2[a, b] there exists no bounded linearevaluation function. The following function is in L2[a, b]

y = x3 + δ(x− c) where a < c < b.

At the point x = c there is no M such that |f(c)| ≤ M since the function isevaluated as “∞”. This is an example of a function in the space that is not evendefined pointwise.

Definition. If a Hilbert space has a bounded linear evaluation function, Lt, thenit is a Reproducing Kernel Hilbert Space (rkhs), HK .

The following property of a rkhs is very important and is a result of the Rieszrepresentation theorem.

Proposition. If HK is a rkhs then there exists an element in the space Kt withthe property such that for all f ∈ HK

Lt[f ] = 〈Kt, f〉 = f(t).

The inner product is in the rkhs norm and the element Kt is called the representerof evaluation of t.

Remark. The above property is somewhat amazing in that it says if a Hilbertspace has a bounded linear evaluation function then there is an element in thisspace that evaluates all functions in the space by an inner product.In the space L2[a, b] we say that the delta function evaluates all functions in L2[a, b]

Lt[f ] =

∫ b

a

f(x)δ(x − t)dx.

However, the delta function is not in L2[a, b].

Page 22: Statistical Learning: Algorithms and Theory

18 S. MUKHERJEE, STATISTICAL LEARNING

Definition. The reproducing kernel (rk) is a symmetric real valued function of twovariables s, t ∈ X

K(s, t) : X ×X → IR.

In addition R(s, t) must be positive definite, that is for all real a1, ..., an and t1, ..., tn ∈X

n∑

i,j=1

aiajK(ti, tj) ≥ 0.

If the above inequality is strict then K(s, t) is strictly positive definite.

Remark. Instead of characterizing quadratic forms of the function K(s, t) one cancharacterize the matrix K where Kij = K(ti, tj) and use the notions of positiveand positive-semidefinite matrices. The terminologies between analogous conceptsfor functions versus matrices is unfortunately very confusing.

There is a deep relation between a rkhs and its reproducing kernel. This ischaracterized by the following theorem.

Theorem. For every Reproducing Kernel Hilbert Space (rkhs) there exists a uniquereproducing kernel and conversely given a positive definite function K on X×X wecan construct a unique rkhs of real valued functions on X with K as its reproducingkernel (rk).

Proof.If HK is a rkhs then there exists an element in the rkhs that is the representer

evaluation by the Reisz representer theorem. We define the rk

K(s, t) := 〈Ks, Kt〉where Ks and Kt are the representers of evaluation at s and t. The following holdby the properties of Hilbert spaces and the representer property

j

ajKtj

2

≥ 0

j

ajKtj

2

=∑

i,j

aiaj〈Kti, Ktj

i,j

aiajK(ti, tj) =∑

i,j

aiaj〈Kti, Ktj

〉.

Therefore K(s, t) is positive definite.

We now prove the converse. Given a rk K(·, ·) we construct HK . For eacht ∈ X we define the real valued function

Kt(·) = K(t, ·).We can show that the rkhs is simply the completion of the space of functionsspanned by the the set Kti

H = f | f =∑

i

aiKtiwhere ai ∈ IR, ti ∈ X, and i ∈ IN

Page 23: Statistical Learning: Algorithms and Theory

LECTURE 3. REPRODUCING KERNEL HILBERT SPACES (RKHS) 19

with the following inner product⟨

i

aiKti,∑

i

aiKti

=∑

i,j

aiaj〈Kti, Ktj

〉 =∑

i,j

aiajK(ti, tj).

Since K(·, ·) is positive definite the above inner product is well defined. For anyf ∈ HK we can check that

〈Kt, f〉 = f(t)

because for any function in the above linear space norm convergence implies point-wise convergence

|fn(t)− f(t)| = |〈fn − f, Kt〉| =≤ ‖fn − f‖‖Kt‖,the last step is due to Cauchy-Schwartz. Therefore every Cauchy sequence in thisspace converges and it is complete.

3.2.2. Constructive formulation

The development of rkhs in this subsection is seen in most formulations of SupportVector Machines (SVMs) and Kernel Machines. It is less general in that it relies onthe reproducing kernel being a Mercer Kernel. It however requires less knowledgeof functional analysis and is more intuitive for most people.

In this formulation we start with a continuous kernel K : X × X → IR. Wedefine an integral operator LK : L2[X ]→ C[X ] by the following integral transform

(3.1) LKf :=

X

K(s, t)f(t)dt = g(t).

If K is positive definite then LK is positive definite (the converse is also true)and therefore the eigenvalues of (3.1) are nonnegative.

We denote the eigenvalues and eigenvectors of (3.1) as λ1, ..., λk and φ1, ..., φkrespectively, where

X

K(s, t)φk(t)dt = λkφk(t) ∀k.

We now state Mercer’s theorem.

Theorem. Given the eigenfunctions and eigenvalues of the integral equation de-fined by a symmetric positive definite kernel K

X

K(s, t)φ(s)ds = λφ(t).

The kernel has the expansion

K(s, t) =∑

j

λjφj(s)φj(t),

where convergence is in the L2[X ] norm.

We can define the rkhs as the space of functions spanned by the eigenfunctionsof the integral operator defined by the kernel

HK = f |f(s) =∑

k

ckφk(s) and ‖f‖HK<∞,

Page 24: Statistical Learning: Algorithms and Theory

20 S. MUKHERJEE, STATISTICAL LEARNING

where the rkhs norm ‖ · ‖HKis defined as follows

‖f(s)‖2HK=

j

cjφj(s),∑

j

cjφj(s)

⟩2

HK

:=∑

j

c2j

λj.

Similarly the inner product is defined as follows

〈f, g〉 =⟨

j

cjφj(s),∑

j

djφj(s)

HK

:=∑

j

djcj

λj.

Part of a homework problem will be to prove the representer property

〈f(·, K(·, x)〉Hk= f(x),

using Mercer’s theorem and the above definition of the rkhs norm.

3.2.3. Kernels and feature space

The rkhs concept has been utilized in the SVM and kernel machines literature inwhat is unfortunately called the kernel trick.

Points in the domains x ∈ X ⊂ IRd are mapped into a higher dimensionalspace by the eigenvalues and eigenfunctions of the reproducing kernel (the space isof the dimensionality of the number of nonzero eigenvalues of the integral operatordefined by the kernel)

x→ Φ(x) := √

λ1φ1(x), ...,√

λkφk(x).A standard L2 inner product of two points mapped into the feature space can

be evaluated by a kernel due to Mercer’s theorem

K(s, t) = 〈Φ(s), Φ(t)〉L2 .

3.3. The rkhs norm, complexity, and smoothness

We will measure the complexity of a hypothesis space using the the rkhs norm. Werestrict our function space to consist of functions f ∈ HK where

‖f‖HK≤ A.

The next two examples illustrate how restricting the rkhs norm corresponds toenforcing some kind of “simplicity” or smoothness of the functions.

Example. Our function space consists of one dimensional lines

f(x) = w x and K(s, t) = s t.

For this kernel

‖f‖2HK= ‖w‖2

so our measure of complexity is the slope of the line.Our objective is to find a function with the smallest slope such that

yf(x) ≥ 1

for all observed (x, y) pairs. This can be thought of as separating the samples of thetwo classes by a margin of 2.

In the three following examples the slope of the function that separates the twoclasses by a margin of 2 increases as the points of the opposite class get closer.

Page 25: Statistical Learning: Algorithms and Theory

LECTURE 3. REPRODUCING KERNEL HILBERT SPACES (RKHS) 21

Three datasets and a line to separate the classes. As the class distinctionsbecome finer a larger slope is required to separate the classes.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

x

f(x)

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

x

f(X

)

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

x

f(x)

Figure 1. Three datasets with the points in the two classes getting nearer.Note that the slope of the separating functions get steeper as the two classesapproach each other.

Example. Consider any function of one variable that is continuous, symmetricand periodic with positive Fourier coefficients over the interval. The rk K(s, t) canbe rewritten as K(s − t) = K(z) and can be expanded in a uniformly convergentFourier series (all normalization factors set to 1)

K(z) =

∞∑

n=0

λncos(nz)

K(s− t) = 1 +

∞∑

p=1

λp sin(ps) sin(pt) +

∞∑

p=1

λp cos(ps) cos(pt),

showing that the eigenfunctions of K are

(1, sin(z), cos(z), sin(2z), cos(3z), ..., sin(pz), cos(pz), ...).

So the rkhs norm of a function is

‖f‖2HK=

∞∑

p=0

〈f, cos(pz)〉2 + 〈f, sin(pz〉)2λp

.

For the above rkhs norm to be finite the magnitude of the Fourier coefficientsof the functions f ∈ HK has decay fast enough to counteract the decrease in λn.

The picture below is an illustration of how the smoothness of a function can becharacterized by the decay of the Fourier coefficients.

3.4. Gaussian processes and rkhs

Page 26: Statistical Learning: Algorithms and Theory

22 S. MUKHERJEE, STATISTICAL LEARNING

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.5

1

1.5

x

f(x)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40

10

20

30

40

50

60

70

80

90

w

F(w

)

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

1.2

1.4

x

f(x)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40

2

4

6

8

10

12

14

16

18

w

F(w

)

Figure 2. The first function is smoother than the second function and thedecay of the Fourier coefficients of the first function is quicker.

Page 27: Statistical Learning: Algorithms and Theory

LECTURE 4Three forms of regularization

The three standard approaches to regularization will be described. Setting theregularization functional to a rkhs norm in all three approaches results in the solu-tion having a particular form due to the representer theorem. Lastly the equivalencebetween of the three forms of regularization is specified.

4.1. A result of the representer theorem

The following are the three standard regularization methods for ERM:

(1) Tikhonov regularization: indirectly constrain the hypothesis space byadding a penalty term.

minf∈H

[

n−1n∑

i=1

V (f, zi) + λΩ(f)

]

.

(2) Ivanov regularization: directly constrain the hypothesis space

minf∈H

[

n−1n∑

i=1

V (f, zi)

]

subject to Ω(f) ≤ τ.

(3) Phillips regularization: directly constrain the hypothesis space

minf∈H

Ω(f) subject to

[

n−1n∑

i=1

V (f, zi)

]

≤ κ.

Throughout this text the rkhs norm will be used as the regularization functional

Ω(f) = ‖f‖2HK.

This defines the following optimization problems we will consider in this text:

(P1) minf∈H

[

n−1n∑

i=1

V (f, zi) + λ‖f‖2HK

]

,

(P2) minf∈H

[

n−1n∑

i=1

V (f, zi)

]

subject to ‖f‖2HK≤ τ,

(P3) minf∈H

‖f‖2HKsubject to

[

n−1n∑

i=1

V (f, zi)

]

≤ κ.

23

Page 28: Statistical Learning: Algorithms and Theory

24 S. MUKHERJEE, STATISTICAL LEARNING

All the above optimization problems above are over spaces of functions thatcontain an infinite number of functions. Using the formulation in section 3.2.2 wecan write any function in the rhks as

HK =

f |f(x) =∑

k

ckφk(x)

,

so the optimization procedure is over the coefficients ck. The number of nonzerocoefficients in the expansion defines the dimensionality of the rkhs and this can beinfinite, for example the Gaussian kernel.

One of the amazing aspects of the above optimization problems is that theminimizer satisfies the form

f(x) =

n∑

i=1

ciK(x, xi).

So the optimization procedure is over n real variables. This is formalized in thefollowing “Representer Theorem.”

Theorem. Given a set of points (xi, yi)ni=1 a function of the form

f(x) =

n∑

i=1

ciK(x, xi),

is a minimizer of the following optimization procedure

c ((f(x1), y1), ..., (f(xn), yn)) + λg(‖f‖HK),

where ‖f‖HKis a rkhs norm, g(·) is monotonically increasing, and c is an arbitrary

cost function.

Procedure (P1) is special case of the optimization procedure stated in the abovetheorem.Proof. For ease of notation all norms and inner products in the proof are rkhsnorms and inner products.

Assume that the function f has the following form

f =

n∑

i=1

biφi(xi) + v,

where〈φi(xi), v〉 = 0 ∀i = 1, .., n.

The orthogonality condition simple ensures that v is not in the span of φi(xi)ni=1.So for any point xj (j = 1, ..., n)

f(xj) =

n∑

i=1

biφ(xi) + v, φ(xj)

=

n∑

i=1

bi〈φ(xi), φ(xj)〉,

so v has no effect on the cost function

c ((f(x1), y1), ..., (f(xn), yn)) .

We now look at the rkhs norm

g(‖f‖) = g

(∥

n∑

i=1

biφi(xi) + v

)

= g

n∑

i=1

biφi(xi)

2

+ ‖v‖2

≥ g

n∑

i=1

biφi(xi)

2

.

Page 29: Statistical Learning: Algorithms and Theory

LECTURE 4. THREE FORMS OF REGULARIZATION 25

So the extra factor v increases the rkhs norm and has effect on the cost func-tional and therefore must be zero and the function has the form

f =

n∑

i=1

biφi(xi),

and by the reproducing property

f(x) =n∑

i=1

aiK(x, xi).

Homework: proving a representer theorem for the other two regularizationformulations.

4.2. Equivalence of the three forms

The three forms of regularization have a certain equivalence. The equivalence isthat given a set of points (xi, yi)ni=1 the parameters λ, τ, andκ can be set suchthat the same function f(x) minimizes (P1), (P2), and (P3). Given this equivalenceand and the representer theorem for (P1) it is clear that a representer theorem holdsfor (P2) and (P3).

Proposition. Given a convex loss function function the following optimizationprocedures are equivalent

(P1) minf∈HK

[

n−1n∑

i=1

V (f, zi) + λ‖f‖2HK

]

,

(P2) minf∈HK

[

n−1n∑

i=1

V (f, zi)

]

subject to ‖f‖2HK≤ τ,

(P3) minf∈HK

‖f‖2HKsubject to

[

n−1n∑

i=1

V (f, zi)

]

≤ κ.

By equivalent we mean that if f0(x) is a solution of one of the problems then thereexist parameters τ, κ, λ for which f0(x) is a of the others.

Proof.Let f0 be the solution of (P2) for a fixed τ and assume that the constraint

under the optimization is tight (‖f0‖2HK= τ). Let

[

n−1∑n

i=1 V (f0, zi)]

= b. Byinspection the solution of (P3) with κ = b will be f0.

Let f0 be the solution of (P3) for a fixed κ and assume that the constraintunder the optimization is tight ([n−1

∑ni=1 V (f0, zi)] = κ). Let ‖f0‖2HK

= b. Byinspection the solution of (P2) with τ = b will be f0.

For both (P2) and (P3) the above argument can be adjusted for the case wherethe constraints are not tight but the solution f0 is not necessarily unique.

Let f0 be the solution of (P2) for a fixed τ . Using Lagrange multipliers we canrewrite (P2) as

(4.1) minf∈HK ,α

[

n−1n∑

i=1

V (f, zi)

]

+ α(

‖f‖2HK− τ)

,

Page 30: Statistical Learning: Algorithms and Theory

26 S. MUKHERJEE, STATISTICAL LEARNING

where α ≥ 0 the optimal α = α0. By the Karush-Kuhn-Tucker (KKT) conditions(complimentary slackness) at optimality

α0

(

‖f0‖2HK− τ)

= 0.

If α0 = 0 then ‖f‖2HK< τ and we can rewrite equation (4.1) as

minf∈HK

[

n−1n∑

i=1

V (f, zi)

]

,

which corresponds to (P1) with λ = 0 and the minima is f0. If α0 > 0 then‖f‖2HK

= τ and we can rewrite equation (4.1) as the following equivalent optimiza-tion procedures

(P2) minf∈HK

[

n−1n∑

i=1

V (f, zi)

]

+ α0

(

‖f‖2HK− τ)

,

(P2) minf∈HK

[

n−1n∑

i=1

V (f, zi)

]

+ α0‖f‖2HK,

which corresponds to (P1) with λ = α0 and the minima is f0.Let f0 be the solution of (P3) for a fixed κ. Using Lagrange multipliers we can

rewrite (P3) as

(4.2) minf∈HK,α

‖f‖2HK+ α

([

n−1n∑

i=1

V (f, zi)

]

− κ

)

,

where α ≥ 0 with the optimal α = α0. By the KKT conditions at optimality

α0

([

n−1n∑

i=1

V (f0, zi)

]

− κ

)

= 0.

If α0 = 0 then[

n−1∑n

i=1 V (f0, zi)]

< κ and we can rewrite equation (4.2) as

minf∈HK

‖f‖2HK,

which corresponds to (P1) with λ = ∞ and the minima is f0. If α0 > 0 then[

n−1∑n

i=1 V (f0, zi)]

= κ and we can rewrite equation (4.2) as the following equiv-alent optimization procedures

(P3) minf∈HK

‖f‖2HK+ α0

([

n−1n∑

i=1

V (f, zi)

]

− κ

)

,

(P3) minf∈HK

‖f‖2HK+ α0

[

n−1n∑

i=1

V (f, zi)

]

,

(P3) minf∈HK

[

n−1n∑

i=1

V (f, zi)

]

+1

α0‖f‖2HK

,

which corresponds to (P1) with λ = 1/α0 and the minima is f0.

Page 31: Statistical Learning: Algorithms and Theory

LECTURE 5Algorithms derived from Tikhonov regularization

From Tikhonov regularization we will derive three popular and extensively usedalgorithms: Kernel ridge-regression, Support Vector Machines, and Spline models.For the first two algorithms we will use a RKHS norm as the regularization func-tional. For the spline models we will use a differential operator as the regularizationfunctional and close with an open-problem regarding the relation between this func-tional and a RKHS norm.

5.1. Kernel ridge-regression

The Kernel ridge-regression (KRR) algorithm has been invented and reinventedmany times and has been called a variety of names such as Regularization net-works, Least Square Support Vector Machine (LSSVM), Regularized Least SquareClassification (RLSC).

We start with Tikhonov regularization

minf∈HK

[

n−1n∑

i=1

V (f, zi) + λΩ(f)

]

and then set the regularization functional to a RKHS norm

Ω(f) = ‖f‖2HK

and use the square loss functional

n−1n∑

i=1

V (f, zi) = n−1n∑

i=1

(f(xi)− yi)2.

The resulting optimization problem is

(5.1) minf∈HK

[

n−1n∑

i=1

(f(xi)− yi)2 + λ‖f‖2HK

]

,

the minimizer of which we know by the Representer theorem has the form

f(x) =

n∑

i=1

ciK(x, xi).

27

Page 32: Statistical Learning: Algorithms and Theory

28 S. MUKHERJEE, STATISTICAL LEARNING

This implies that we only need to solve the optimization problem for the ci. Thisturns the problem of optimizing over functions which maybe infinite-dimensionalinto a problem of optimizing over n real numbers.

Using the representer theorem we derive the optimization problem actuallysolved for Kernel ridge-regression.

We first define some notation. We will use the symbol K to refer to either thekernel function K or the n× n matrix K where

Kij ≡ K(xi, xj).

Using this definition the function f(x) evaluated at a training point xj can bewritten in matrix notation as

f(xj) =n∑

i=1

K(xi, xj)ci

= [Kc]j ,

where [Kc]j , is the jth element of the vector obtained in multiplying the kernelmatrix K with the vector c. In this notation we can rewrite equation (5.1) as

minf∈HK

1

n(Kc− y)

2+ λ‖f‖2K ,

where y is the vector of y values. Also by the representer theorem the RKHS normcan be evaluated using linear algebra

||f ||2K = cT Kc,

where cT is the transpose of the vector c. Substituting the above norm into equation(5.1) results in an optimization problem on the vector c

arg minc∈IRn

[

g(c) :=1

ℓ(Kc− y)2 + λcT Kc.

]

This is a convex, differentiable function of c, so we can minimize it simply bytaking the derivative with respect to c, then setting this derivative to 0.

∂g(c)

∂c=

2

ℓK(Kc− y) + 2λKc = 0.

We show that the solution of the above equation is the following linear system

c = (K + λℓI)−1y,

where I is the identity matrix:

differentiation 0 =2

ℓK(Kc− y) + 2λKc

multiplication K(Kc) + λℓKc = Ky

“left multiplication by K−1” (K + λℓI)c = y

inversion c = (K + λℓI)−1y.

The matrix K + λℓI is positive definite and will be well-conditioned if λ is not toosmall.

A few properties of the linear system are:

Page 33: Statistical Learning: Algorithms and Theory

LECTURE 5. ALGORITHMS DERIVED FROM TIKHONOV REGULARIZATION 29

(1) The matrix (K + λℓI) is guaranteed to be invertible if λ > 0. As λ→ 0,the regularized least-squares solution goes to the standard Gaussian least-squares solution which minimizes the empirical loss. As λ → ∞, thesolution goes to f(x) = 0.

(2) In practice, we don’t actually invert (K + λℓI), but instead use an algo-rithm for solving linear systems.

(3) In order to use this approach, we need to compute and store the entirekernel matrix K. This makes it impractical for use with very large trainingsets.

Lastly, there is nothing to stop us for using the above algorithm for classi-fication. By doing so, we are essentially treating our classification problem as aregression problem with y values of 1 or -1.

5.1.1. Solving for c

The conjugate gradient (CG) algorithm is a popular algorithm for solving positivedefinite linear systems. For the purposes of this class, we need to know that CGis an iterative algorithm. The major operation in CG is multiplying a vector v bythe matrix A. Note that matrix A need not always be supplied explicitly, we justneed some way to form a product Av.

For ordinary positive semidefinite systems, CG will be competitive with directmethods. CG can be much faster if there is a way to multiply by A quickly.

Example. Suppose our kernel K is linear:

K(x, y) = 〈x, y〉.Then our solution x can be written as

f(x) =∑

ci〈xi, x〉

=⟨(

cixi

)

, x⟩

:= 〈w, x〉,and we can apply our function to new examples in time d rather than time nd.

This is a general property of Tikhonov regularization with a linear kernel, notrelated to the use of the square loss.

We can use the CG algorithm to get a huge savings for solving regularized least-squares regression with a linear kernel (K(x1,x2) = x1 · x2). With an arbitrarykernel, we must form a product Kv explicitly — we multiply a vector by K. Withthe linear kernel, we note that K = AAT , where A is a matrix with the data pointsas row vectors. Using this:

(K + λnI)v = (AAT + λnI)v

= A(AT v) + λnIv.

Suppose we have n points in d dimensions. Forming the kernel matrix Kexplicitly takes n2d time, and multiplying a vector by K takes n2 time.

If we use the linear representation, we pay nothing to form the kernel matrix,and multiplying a vector by K takes 2dn time.

If d << n, we save approximately a factor of n2d per iteration. The memory

savings are even more important, as we cannot store the kernel matrix at all for

Page 34: Statistical Learning: Algorithms and Theory

30 S. MUKHERJEE, STATISTICAL LEARNING

large training sets, and if were to recompute the entries of the kernel matrix asneeded, each iteration would cost n2d time.

Also note that if the training data are sparse (they consist of a large numberof dimensions, but the majority of dimensions for each point are zero), the cost ofmultiplying a vector by K can be written as 2dn, where d is the average number ofnonzero entries per data point.

This is often the case for applications relating to text, where the dimensionswill correspond to the words in a “dictionary”. There may be tens of thousands ofwords, but only a few hundred will appear in any given document.

5.2. Support Vector Machines (SVMs) for classification

SVMs have been used in a multitude of applications and are one of the most popularmachine learning algorithms. We will derive the SVM algorithm from two perspec-tives: Tikhonov regularization, and the more common geometric perspective.

5.2.1. SVMs from Tikhonov regularization

We start with Tikhonov regularization

minf∈H

[

n−1n∑

i=1

V (f, zi) + λΩ(f)

]

and then set the regularization functional to a RKHS norm

Ω(f) = ‖f‖2HK

and use the hinge loss functional

n−1n∑

i=1

V (f, zi) := n−1n∑

i=1

(1− yif(xi))+,

where (k)+ := max(k, 0).

−3 −2 −1 0 1 2 3

0

0.5

1

1.5

2

2.5

3

3.5

4

y * f(x)

Hin

ge L

oss

Figure 1. Hinge loss.

The resulting optimization problem is

(5.2) minf∈HK

[

n−1n∑

i=1

(1− yif(xi))+ + λ‖f‖2HK

]

,

Page 35: Statistical Learning: Algorithms and Theory

LECTURE 5. ALGORITHMS DERIVED FROM TIKHONOV REGULARIZATION 31

which is non-differentiable at (1− yif(xi)) = 0 so we introduce slack variables andwrite the following constrained optimization problem:

minf∈HK

n−1∑n

i=1 ξi + λ||f ||2Ksubject to : yif(xi) ≥ 1− ξi i = 1, . . . , n

ξi ≥ 0 i = 1, . . . , n.

By the Representer theorem we can rewrite the above constrained optimizationproblem as a constrained quadratic programming problem

minc∈IRn

n−1∑n

i=1 ξi + λcT Kc

subject to : yi

∑nj=1 cjK(xi, xj) ≥ 1− ξi i = 1, . . . , n

ξi ≥ 0 i = 1, . . . , n.

The SVM contains an unregularized bias term b so the Representer theoremresults in a function

f(x) =

n∑

i=1

ciK(x, xi) + b.

Plugging this form into the above constrained quadratic problem results in the“primal” SVM

minc∈IRn,ξ∈IRn

n−1∑n

i=1 ξi + λcT Kc

subject to : yi

(

∑ℓj=1 cjK(xi, xj) + b

)

≥ 1− ξi i = 1, . . . , n

ξi ≥ 0 i = 1, . . . , n.

We now derive the Wolfe dual quadratic program using Lagrange multipliertechniques:

L(c, ξ, b, α, ζ) = n−1n∑

i=1

ξi + λcT Kc

−n∑

i=1

αi

yi

n∑

j=1

cjK(xi, xj) + b

− 1 + ξi

−n∑

i=1

ζiξi.

We want to minimize L with respect to c, b, and ξ, and maximize L with respectto α and ζ, subject to the constraints of the primal problem and nonnegativityconstraints on α and ζ. We first eliminate b and ξ by taking partial derivatives:

∂L

∂b= 0 =⇒

n∑

i=1

αiyi = 0

∂L

∂ξi= 0 =⇒ 1

n− αi − ζi = 0 =⇒ 0 ≤ αi ≤

1

n.

Page 36: Statistical Learning: Algorithms and Theory

32 S. MUKHERJEE, STATISTICAL LEARNING

The above two conditions will be constraints that will have to be satisfied at opti-mality. This results in a reduced Lagrangian:

LR(c, α) = λcT Kc−n∑

i=1

αi

yi

n∑

j=1

cjK(xi, xj)− 1

.

We now eliminate c

∂LR

∂c= 0 =⇒ 2λKc−KY α = 0 =⇒ ci =

αiyi

2λ,

where Y is a diagonal matrix whose i’th diagonal element is yi; Y α is a vectorwhose ith element is αiyi. Substituting in our expression for c, we are left with thefollowing “dual” program:

maxα∈IRn

∑ni=1 αi − 1

4λαT Qα

subject to :∑n

i=1 yiαi = 0

0 ≤ αi ≤ 1n i = 1, . . . , n,

where Q is the matrix defined by

Q = yKyT ⇐⇒ Qij = yiyjK(xi, xj).

In most of the SVM literature, instead of the regularization parameter λ, reg-ularization is controlled via a parameter C, defined using the relationship

C =1

2λn.

Using this definition (after multiplying our objective function by the constant 12n ,

the basic regularization problem becomes

minf∈HK

C

n∑

i=1

V (yi, f(xi)) +1

2||f ||2K .

Like λ, the parameter C also controls the trade-off between classification accuracyand the norm of the function. The primal and dual problems become respectively:

minc∈IRn,ξ∈IRn

C∑n

i=1 ξi + 12c

T Kc

subject to : yi

(

∑nj=1 cjK(xi, xj) + b

)

≥ 1− ξi i = 1, . . . , n

ξi ≥ 0 i = 1, . . . , n

maxα∈IRn

∑ni=1 αi − 1

2αT Qα

subject to :∑n

i=1 yiαi = 0

0 ≤ αi ≤ C i = 1, . . . , n.

5.2.2. SVMs from a geometric perspective

The “traditional” approach to developing the mathematics of SVM is to start withthe concepts of separating hyperplanes and margin. The theory is usually developedin a linear space, beginning with the idea of a perceptron, a linear hyperplanethat separates the positive and the negative examples. Defining the margin as thedistance from the hyperplane to the nearest example, the basic observation is that

Page 37: Statistical Learning: Algorithms and Theory

LECTURE 5. ALGORITHMS DERIVED FROM TIKHONOV REGULARIZATION 33

intuitively, we expect a hyperplane with larger margin to generalize better than onewith smaller margin.

(a) (b)

Figure 2. Two hyperplanes (a) and (b) perfectly separate the data. However,hyperplane (b) has a larger margin and intuitively would be expected to bemore accurate on new observations.

We denote our hyperplane by w, and we will classify a new point x via thefunction

(5.3) f(x) = sign [〈w,x〉] .Given a separating hyperplane w we let x be a datapoint closest to w, and we letxw be the unique point on w that is closest to x. Obviously, finding a maximummargin w is equivalent to maximizing ||x− xw||. So for some k (assume k > 0 forconvenience),

〈w,x〉 = k

〈w,xw〉 = 0

〈w, (x − xw)〉 = k.

Noting that the vector x− xw is parallel to the normal vector w,

〈w, (x− xw)〉 =

w,

( ||x− xw||||w|| w

)⟩

= ||w||2 ||x− xw||||w||

= ||w|| ||x− xw||k = ||w|| ||(x− xw)||

k

||w|| = ||x− xw||.

k is a “nuisance parameter” and without any loss of generality, we fix k to 1, andsee that maximizing ||x − xw|| is equivalent to maximizing 1

||w|| , which in turn

is equivalent to minimizing ||w|| or ||w||2. We can now define the margin as thedistance between the hyperplanes 〈w,x〉 = 0 and 〈w,x〉 = 1.

Page 38: Statistical Learning: Algorithms and Theory

34 S. MUKHERJEE, STATISTICAL LEARNING

So if the data is linear separable case and the hyperplanes run through theorigin the maximum margin hyperplane is the one for which

minw∈IRn

||w||2

subject to : yi〈w,xi〉 ≥ 1 i = 1, . . . , n.

The SVM introduced by Vapnik includes an unregularized bias term b, leadingto classification via a function of the form:

f(x) = sign[〈w,x〉+ b].

In addition, we need to work with datasets that are not linearly separable, so weintroduce slack variables ξi, just as before. We can still define the margin as thedistance between the hyperplanes 〈w,x〉 = 0 and 〈w,x〉 = 1, but the geometricintuition is no longer as clear or compelling.

With the bias term and slack variables the primal SVM problem becomes

minw∈IRn,b∈IR

C∑n

i=1 ξi + 12 ||w||2

subject to : yi (〈w,x〉+ b) ≥ 1− ξi i = 1, . . . , n

ξi ≥ 0 i = 1, . . . , n.

Using Lagrange multipliers we can derive the same dual from in the previous section.Historically, most developments begin with the geometric form, derived a dual

program which was identical to the dual we derived above, and only then observedthat the dual program required only dot products and that these dot products couldbe replaced with a kernel function. In the linearly separable case, we can also derivethe separating hyperplane as a vector parallel to the vector connecting the closesttwo points in the positive and negative classes, passing through the perpendicularbisector of this vector. This was the “Method of Portraits”, derived by Vapnik inthe 1970’s, and recently rediscovered (with non-separable extensions) by Keerthi.

5.2.3. Optimality conditions

The primal and the dual are both feasible convex quadratic programs. Therefore,they both have optimal solutions, and optimal solutions to the primal and the dualhave the same objective value.

We derived the dual from the primal using the (now reparameterized) La-grangian:

L(c, ξ, b, α, ζ) = C

n∑

i=1

ξi + cT Kc

−n∑

i=1

αi

yi

n∑

j=1

cjK(xi, xj) + b

− 1 + ξi

−n∑

i=1

ζiξi.

Page 39: Statistical Learning: Algorithms and Theory

LECTURE 5. ALGORITHMS DERIVED FROM TIKHONOV REGULARIZATION 35

We now consider the dual variables associated with the primal constraints:

αi =⇒ yi

n∑

j=1

cjK(xi, xj) + b

− 1 + ξi

ζi =⇒ ξi ≥ 0.

Complementary slackness tells us that at optimality, either the primal inequality issatisfied at equality or the dual variable is zero. In other words, if c, ξ, b, α and ζare optimal solutions to the primal and dual, then

αi

yi

n∑

j=1

cjK(xi, xj) + b

− 1 + ξi

= 0

ζiξi = 0

All optimal solutions must satisfy:n∑

j=1

cjK(xi, xj)−n∑

j=1

yiαjK(xi, xj) = 0 i = 1, . . . , n

n∑

i=1

αiyi = 0

C − αi − ζi = 0 i = 1, . . . , n

yi

n∑

j=1

yjαjK(xi, xj) + b

− 1 + ξi ≥ 0 i = 1, . . . , n

αi

yi

n∑

j=1

yjαjK(xi, xj) + b

− 1 + ξi

= 0 i = 1, . . . , n

ζiξi = 0 i = 1, . . . , n

ξi, αi, ζi ≥ 0 i = 1, . . . , n

The above optimality conditions are both necessary and sufficient. If we have c,ξ, b, α and ζ satisfying the above conditions, we know that they represent optimalsolutions to the primal and dual problems. These optimality conditions are alsoknown as the Karush-Kuhn-Tucker (KKT) conditions.

Suppose we have the optimal αi’s. Also suppose (this “always” happens inpractice”) that there exists an i satisfying 0 < αi < C. Then

αi < C =⇒ ζi > 0

=⇒ ξi = 0

=⇒ yi

n∑

j=1

yjαjK(xi, xj) + b

− 1 = 0

=⇒ b = yi −n∑

j=1

yjαjK(xi, xj)

So if we know the optimal α’s, we can determine b.

Page 40: Statistical Learning: Algorithms and Theory

36 S. MUKHERJEE, STATISTICAL LEARNING

Defining our classification function f(x) as

f(x) =

ℓ∑

i=1

yiαiK(x, xi) + b,

we can derive “reduced” optimality conditions. For example, consider an i suchthat yif(xi) < 1:

yif(xi) < 1 =⇒ ξi > 0

=⇒ ζi = 0

=⇒ αi = C.

Conversely, suppose αi = C:

αi = C =⇒ yif(xi)− 1 + ξi = 0

=⇒ yif(xi) ≤ 1.

Figure 3. A geometric interpretation of the reduced optimality conditions.The open squares and circles correspond to cases where αi = 0. The darkcircles and squares correspond to cases where yif(xi) = 1 and αi ≤ C, theseare samples at the margin. The grey circles and squares correspond to caseswhere yif(xi) < 1 and αi = C.

5.2.4. Solving the SVM optimization problem

Our plan will be to solve the dual problem to find the α’s, and use that to find band our function f . The dual problem is easier to solve the primal problem. It hassimple box constraints and a single inequality constraint, even better, we will seethat the problem can be decomposed into a sequence of smaller problems.

Page 41: Statistical Learning: Algorithms and Theory

LECTURE 5. ALGORITHMS DERIVED FROM TIKHONOV REGULARIZATION 37

We can solve QPs using standard software. Many codes are available. Mainproblem — the Q matrix is dense, and is n×n, so we cannot write it down. StandardQP software requires the Q matrix, so is not suitable for large problems.

To get around this memory issue we partition the dataset into a working setW and the remaining points R. We can rewrite the dual problem as:

maxαW ∈IR|W |, αR∈IR|R|

∑ni=1

i∈Wαi +

i=1i∈R

αi

− 12 [αW αR]

[

QWW QWR

QRW QRR

] [

αW

αR

]

subject to :∑

i∈W yiαi +∑

i∈R yiαi = 0

0 ≤ αi ≤ C, ∀i.Suppose we have a feasible solution α. We can get a better solution by treating theαW as variable and the αR as constant. We can solve the reduced dual problem:

maxαW ∈IR|W |

(1−QWRαR)αW − 12αWQWW αW

subject to:∑

i∈W yiαi = −∑i∈R yiαi

0 ≤ αi ≤ C, ∀i ∈ W.

The reduced problems are fixed size, and can be solved using a standard QPcode. Convergence proofs are difficult, but this approach seems to always convergeto an optimal solution in practice.

An important issue in the decomposition is selecting the working set. There aremany different approaches. The basic idea is to examine points not in the workingset, find points which violate the reduced optimality conditions, and add them tothe working set. Remove points which are in the working set but are far fromviolating the optimality conditions.

5.3. Regularized logistics regression

One drawback with the SVM is that the method does not explicitly output a prob-ability or likelihood of the labels, instead the output is a real value the magnitudeof which should be monotonic with respect to the probability

P (y = ±1|x) ∝ y f(x).

This issue can be addressed by using a loss function based upon logistic orbinary regression. The main idea behind logistic regression is that we are trying tomodel the log likelihood ratio by the function f(x)

f(x) = log

(

P (y = 1|x)

P (y = −1|x)

)

.

Since P (y = 1|x) is a Bernoulli random variable we can rewrite the above equationas

f(x) = log

(

P (y = 1|x)

P (y = −1|x)

)

= log

(

P (y = 1|x)

1− P (y = 1|x)

)

Page 42: Statistical Learning: Algorithms and Theory

38 S. MUKHERJEE, STATISTICAL LEARNING

which implies

P (y = 1|x) =1

1 + exp(f(x))

P (y = −1|x) =1

1 + exp(−f(x))

P (y = ±1|x) =1

1 + exp(y f(x)).

Given a data set D = (xi, yi)ni=1 and a class of functions f ∈ H the maximum like-lihood estimator (MLE) is the function that maximizes the likelihood of observingthe data set D

f∗MLE := arg max

f∈H[P (D|f)] = arg max

f∈H

[

n∏

i=1

1

1 + exp(yi f(xi))

]

.

As in the case of Empirical risk minimization the MLE estimate may overfit the datasince there is no smoothness or regularization term. A classical way of imposingsmoothness in this context is by placing a prior on the functions f ∈ H

P (f) ∝ e−‖f‖2HK .

Given a prior and a likelihood we can use Bayes rule to compute the posteriordistribution P (f |D)

P (f |D) =P (D|f)P (f)

P (D).

If we plug the prior and likelihood into Bayes rule we can compute the maximuma posteriori (MAP) estimator

f∗MAP := argmax

f∈H

[

P (D|f)P (f)

P (D)

]

= argmaxf∈H

∏ni=1

11+exp(yi f(xi))

e−‖f‖2HK

P (D)

= argmaxf∈H

[

n∑

i=1

log

(

1

1 + exp(yi f(xi))

)

− ‖f‖2HK

]

.

With some simple algebra the above MAP estimator can be rewritten in the formof Tikhonov regularization

f∗MAP = arg min

f∈HK

[

n−1n∑

i=1

log(1 + exp(−yi f(xi))) + λ‖f‖2HK

]

,

where λ is the regularization parameter. By the representer theorem the aboveequation has a solution of the form

f∗(x) =

n∑

i=1

ciK(x, xi).

Given the above representer theorem we can solve for the variables ci by the fol-lowing optimization problem

minc∈IRn

[

n−1n∑

i=1

log(1 + exp(−yi (cT K)i)) + λcT Kc

]

,

Page 43: Statistical Learning: Algorithms and Theory

LECTURE 5. ALGORITHMS DERIVED FROM TIKHONOV REGULARIZATION 39

where (cT K)i is the ith element of the vector cT K. This optimization problemis convex and differentiable so a classical approach for solving for c is using theNewton-Raphson method.

5.3.1. Newton-Raphson

The Newton-Raphson method was initially used to solve for roots of polynomialsand the application to optimization problems is fairly starightforward. We firstdescribe the Newton-Raphson method for the case of a scalar, the optimization isin terms of one variable. We then describe the multivariate form and apply this tothe optimization problem in logistic regression.

• Newton’s method for finding roots: Newton’s method is primarily a methodfor finding roots of polynomials. It was proposed by Newton around 1669and Raphson improved on the method in 1690, therefore the Newton-Raphson method. Given a polynomial f(x) the Taylor series expansion off(x) around the point x = x0 + ε is given by

f(x0 + ε) = f(x0) + f ′(x0)ε +1

2f ′′(x0)ε

2 + ...

truncating the expansion after first order terms results in

f(x0 + ε) ≈ f(x0) + f ′(x0)ε.

From the above expression we can estimate the offset ε needed to getcloser to the root (x : f(x) = 0) starting from the intial guess x0. This isdone by setting f(x0 + ε) = 0 and solving for ε.

0 = f(x0 + ε)

0 ≈ f(x0) + f ′(x0)ε

−f(x0) ≈ f ′(x0)ε

ǫ0 ≈ − f(x0)

f ′(x0).

This is the first order or linear adjustment to the root’s position. This canbe turned into an iterative procedure by setting x1 = x0 + ε0, calculatinga new ε1 and then iterating until converegence:

xn+1 = xn −f(xn)

f ′(xn).

• The Newton-Raphson method as an optimization method for scalars: Weare given a convex minimization problem

minx∈[a,b]

g(x),

where g(x) is a convex function. The extrema of g(x) will occur at a valueof xm such that g′(xm) = 0 and since the function is convex this extremawill be a minima. If g(x) is a polynomial then g′(x) is also a poynomialand we can apply Newton’s method for root finding to g′(x). If g(x) isnot a polynomial then we apply the root finding method to a polynomialapproximation of g(x). We now describe the steps involved.

Page 44: Statistical Learning: Algorithms and Theory

40 S. MUKHERJEE, STATISTICAL LEARNING

(1) Taylor expand g(x): A truncated Taylor expansion of g(x) results ina second order polynomial approximation of g(x)

g(x) ≈ g(x0) + g′(x0)(x− x0) +1

2(x− x0)

2g′′(x0).

(2) Set derivative to zero: Take the derivative of the Taylor expansionand set it equal to zero

dg

dx= f(x) = g′(x0) + g′′(x0)(x − x0) = 0.

This leaves us with with a root finding problem, find the root of f(x)for which we we use Newton’s method for finding roots.

(3) Update rule: The update rule reduces to

xn+1 = xn −f(xn)

f ′(xn)= xn −

g′(xn)

g′′(xn).

A key point in the above procedure is the convexity of g(x). To be surethat the procedure converges the second derivative g′′(x) must be positivein the domain of optimization, the interval [a, b]. Convexity of g(x) ensuresthis.• The Newton-Raphson method as an optimization method for vectors: We

are given a convex minimization problem

minx∈X

g(x),

where X ⊆ IRn is convex and g(x) is a convex function. We follow thelogic of the scalar case except using vector calculus.(1) Taylor expand g(x): A truncated Taylor expansion of g(x) results in

a second order polynomial approximation of g(x)

g(x) ≈ g(x0) + (x− x0)T · ∇g(x0) +

1

2(x− x0)

T ·H(x0) · (x − x0),

where x is a column vector of length n, ∇g(x0) is the gradient of gevaluated at x0 and is also a column vector of length n, H(x0) is theHessian matrix evaluated at x0

Hi,j(x0) =∂2g(x)

∂xi∂xj

x0

, i, j = 1, ..., n.

(2) Set derivative to zero: Take the derivative of the Taylor expansionand set it equal to zero

∇g(x) = ∇g(x0) +1

2H(x0) · (x− x0) +

1

2(x− x0)

T ·H(x0) = 0,

the Hessian matrix is symmetric and twice differentiable (due to con-vexity) so we can reduce the above to

∇g(x) = ∇g(x0) + H(x0) · (x− x0) = 0.

This implies that at a minima x∗, the gradient is zero

0 = H(x0) · (x∗ − x0) +∇g(x0).

Page 45: Statistical Learning: Algorithms and Theory

LECTURE 5. ALGORITHMS DERIVED FROM TIKHONOV REGULARIZATION 41

(3) Update rule: Solving the above linear system of equations for x∗

leads to the following update rule

xn+1 = xn −H−1(xn) · ∇g(xn),

where −H−1(xn) · ∇g(xn) is called the Newton direction.For the above procedure to converge to a minima the Newton directionmust be a direction of descent

∇gT (xn) · (xn+1 − xn) < 0.

If the Hessian matrix is positive definite then the Newton direction willbe a direction of descent, this is the matrix analog of a positive secondderivative. Convexity of g(x) in the domain X ensures that the Hessianis positvie definite. If the function g(x) is quadratic the procedure willconverge in one iteration.• The Newton-Raphson method for regularized logistic regression: The op-

timization problem for regularized logistic regression is

f∗MAP = arg min

f∈HK

[

n−1n∑

i=1

log(1 + exp(−yi f(xi))) + λ‖f‖2HK

]

,

by the representer theorem

f∗(x) =

n∑

i=1

ciK(x, xi) + b,

‖f‖HKis a seminorm that does not penalize constants, like the SVM case.

The optimization problem can be rewritten as

minc∈IRn,b∈IR

[

L[c, b] = n−1n∑

i=1

log(1 + exp(−yi ((cT K)i + b))) + λcT Kc

]

,

where (cT K)i is the ith element of the vector cT K.

5.4. Spline models

One can also derive spline models from Tikhonov regularization and often one findsthis derivation in the same formulation as that for SVMs or Kernel-ridge regression.This is somewhat misleading in that the two derivations are very similar in spiritthere are significant technical differences in the regularization functional used andthe mathematical tools.

In general, for spline models the domain of the RKHS is unbounded and there-fore a countable basis may not exist. In addition the regularization function isdefined as the integral of a differential operator. For example for one-dimensionallinear splines the regularization functional is

‖f‖2K :=

∫∣

∂f(x)

∂x

2

dx,

and the kernel function is a piecewise linear function

K(x− y) = c1|x− y|,

Page 46: Statistical Learning: Algorithms and Theory

42 S. MUKHERJEE, STATISTICAL LEARNING

where c1 is an arbitrary constant and

f(x) =

n∑

i=1

ci|x− xi|+ d1.

We will show that the regularization functional corresponding to the differentialoperators can be written as an integral operator with a kernel function and relatethe kernel to the differential operator via Fourier analysis.

We assume the kernel is symmetric but defined over an unbounded domain.The eigenvalues of the equation

∫ ∞

−∞K(s, t)φ(s)ds = λφ(t)

are not necessarily countable and Mercer’s theorem does not apply. Let us assumethat the kernel is translation invariant, or

K(s, t) = K(s− t).

We will see this implies that we will have to consider Fourier hypotheses spacesand all these spaces will be defined via Fourier transforms.

Definition. The Fourier transform of a real valued function f ∈ L1 is the complexvalued function f(ω) defined as

f(ω) =

∫ +∞

−∞f(x) e−jωxdx.

Proposition. The original function f can be obtained through the inverse Fouriertransform

f(x) =1

∫ +∞

−∞f(ω) ejωxdω.

Comment. Periodic functions can be expanded in a Fourier series

f(ω) =∑

n

βnδ(ω − nω0).

This can be shown from the periodicity condition

f(x + T ) = f(x) for T =2π

ω0.

Taking the Fourier transform of both side yields

f(ω) e−jωT = f(ω).

This is possible if f(ω) 6= 0 only when ω = nω0. This implies for nontrivial f that

f(ω) =∑

n βnδ(ω − nω0), which is a Fourier series.

Proposition. The eigenfunctions for translation invariant kernels are the Fourierbases and the eigenvalues are the Fourier transform of the kernel.If K(s, t) = K(||s− t||) then the solutions of

K(s, t)φ(s)dµ(s) = λφ(t)

have the form

φω(s) = e−i〈s,ω〉,

Page 47: Statistical Learning: Algorithms and Theory

LECTURE 5. ALGORITHMS DERIVED FROM TIKHONOV REGULARIZATION 43

and

λω =1

2πK(ω),

where K(ω) is the Fourier transform of K(||s− t||).The following theorem relates the Fourier transform of a kernel to it being

positive definite.

Theorem (Bochner). A function K(s − t) is positive definite if and only if it is

the Fourier transform of a symmetric, positive function K(ω) decreasing to 0 atinfinity.

We now define the RKHS norm for shift invariant kernels and we will see thatthis norm will apply to regularizers that are differential operators.

Definition. For a positive definite function function K(s− t) we define the RKHSnorm using the following inner product:

〈f(s), g(s)〉HK:=

1

f(ω)g∗(ω)

K(ω)dω,

where g∗(ω) is the complex conjugate of the Fourier transform of g(s).

Give the above definition of the RKHS inner product a direct result is that wecan define the RKHS as a subspace of L2(IR

n) with x ∈ IRn.

Proposition. Given a RKHS norm defined by the above scalar product

‖f‖2HK=

1

∫ |f(ω)|2K(ω)

dω.

The subspace of L2(IRn) for which the above integral is defined (finite) is the RKHS

HK .

We now verify the reproducing property

〈K(s− t), f(s)〉HK:=

1

K(ω)f∗(ω)e−jωt

K(ω)dω = f(t).

We now relate the RKHS norm to regularizers that are differential operatorssometimes called smoothness functionals.

Example. The following differential operators

Φ1[f ] =

∫ +∞

−∞|f ′(x)|2dx =

1

ω2|f(ω)|2dω,

Φ2[f ] =

∫ +∞

−∞|f ′′(x)|2dx =

1

ω4|f(ω)|2dω

are an RKHS norm of the form

|f |2HK=

1

∫ |f(ω)|2K(ω)

dω,

where

K(ω) = 1/ω2 for Φ1,

K(ω) = 1/ω4 for Φ2.

Page 48: Statistical Learning: Algorithms and Theory

44 S. MUKHERJEE, STATISTICAL LEARNING

Notice, that K(ω) is a positive symmetric function decreasing to zero at infinity.Given the Fourier domain representation of the kernel taking the inverse Fouriertransform gives us the reproducing kernel

K(x) = −|x|/2 ⇐⇒ K(ω) = 1/ω2,

K(x) = −|x|3/12 ⇐⇒ K(ω) = 1/ω4.

For both kernels, the singularity of the Fourier transform for ω = 0 is due to theseminorm property and the fact that the kernel is only conditionally positive definite.

The fact that both functionals

Φ1[f ] =

∫ +∞

−∞|f ′(x)|2dx =

1

ω2|f(ω)|2dω,

Φ2[f ] =

∫ +∞

−∞|f ′′(x)|2dx =

1

ω4|f(ω)|2dω

are seminorms is obvious since f(x) = c will result in a zero norm for both func-tionals.

Examples. Other possible kernel functions and their Fourier transforms are

K(x) = e−x2/2σ2 ⇐⇒ K(ω) = e−ω2σ2/2,

K(x) =1

2e−γ|x| ⇐⇒ K(ω) =

1

1 + ω2,

K(x) =sin(Ωx)

πx⇐⇒ K(ω) = U(ω + Ω)− U(ω − Ω).

Example. The Gaussian kernel K(x) = e−x2/2σ2

corresponds to the followingdifferential operator

Φ[f ] = 1 +

∫ +∞

−∞

∞∑

n=1

|f ′(x)|2n

2n!dx.

We now state the representer theorem for shift invariant kernels on unboundeddomains. Note we cannot use the previous proof of the representer theorem sincethese kernels are not Mercer kernels.

Theorem. Let K(ω) be the Fourier transform of a kernel function K(x). Thefunction minimizing the functional

1

n

n∑

i=1

(yi − f(xi))2 +

λ

∫ |f(ω)|2K(ω)

has the form

f(x) =n∑

i=1

ciK(x− xi).

Proof.We rewrite the functional in terms of the Fourier transform f and obtain

1

n

n∑

i=1

(

yi −1

f(ω)ejωxidω

)2

+ λ1

f(−ω)f(ω)

K(ω)dω.

Page 49: Statistical Learning: Algorithms and Theory

LECTURE 5. ALGORITHMS DERIVED FROM TIKHONOV REGULARIZATION 45

Taking the functional derivative w.r.t f(ξ) gives

− 1

n∑

i=1

(yi − f(xi))

∫ Df(ω)

Df(ξ)ejωxidω +

2

2πλ

f(−ω)

K(ω)

Df(ω)

Df(ξ)dω

= − 1

n∑

i=1

(yi − f(xi))

δ(ω − ξ)ejωxidω +1

πλ

f(−ω)

K(ω)δ(ω − ξ)dω.

From the definition of δ we have

− 1

n∑

i=1

(yi − f(xi))ejξxi +

1

πλ

f(−ξ)

K(ξ).

Setting the derivative to zero and changing the sign of ξ

fλ(ξ) = K(ξ)

n∑

i=1

yi − f(xi)

nλe−jξxi .

Defining the coefficients

ci =yi − f(xi)

nλ,

taking the inverse Fourier transform, we obtain

f(x) =

n∑

i=1

ciK(x− xi).

We noted that some of the smoothness functionals were seminorms and there-fore led to conditionally positive definite functions. We now formalize and explorethis issue.

Definition. Let r = ‖x‖ with x ∈ IRn. A continuous function K = K(r) isconditionally positive definite of order m on IRn, if and only if for any distinct

points t1, t2, ..., tℓ ∈ IRn and scalars c1, c2, ..., cℓ such that∑ℓ

i=1 cip(ti) = 0 for allp ∈ πm−1(IR

n), the quadratic form is nonnegative

ℓ∑

i=1

ℓ∑

j=1

cicjK(‖ti − tj‖) ≥ 0.

In the case of a strict inequality the function is strictly conditionally positive defi-nite.

Example. The following class of functionals are conditionally positive definite oforder m

K(x) = − |x|2m−1

2m(2m− 1)→ K(ω) = 1/ω2m.

Proposition. If K is a conditionally positive definite function of order m, then

Φ[f ] =1

∫ +∞

−∞

|f(ω)|2K(ω)

is a seminorm whose null space is the set of polynomials of degree m− 1.If K is strictly positive definite, then Φ is a norm.

Page 50: Statistical Learning: Algorithms and Theory

46 S. MUKHERJEE, STATISTICAL LEARNING

For a positive definite kernel, we have shown that

f(x) =

n∑

i=1

ciK(x, xi),

where the coefficients ci can be found by solving the linear system

(K + nλI)c = y.

For a conditionally positive definite kernel of order m, it can be shown that

f(x) =

n∑

i=1

ciK(x, xi) +

m−1∑

k=1

dkγk(x),

where the coefficients c and d = (d1, ..., dm) are found by solving the linear system

(K + nλI)c + Γ⊤d = y

Γc = 0.

with Γik = γk(xi).

Examples. We state a few regularizational functionals and their appropriate so-lutions.

1) 1-D Linear Splines. The solution is the space of piecewise linear polynomials:

Φ[f ] = ‖f‖2K =

|f ′(x)|2dx =1

ω2|f(ω)|2dω

K(ω) =1

ω2

K(x− y) ∝ |x− y|

f(x) =

n∑

i=1

ci|x− xi|+ d1.

2) 1-D Cubic Spline. The solution is the space of piecewise cubic polynomials.

Φ[f ] = ‖f‖2K =

|f ′′(x)|2dx =1

ω4|f(ω)|2dω

K(ω) =1

ω4

K(x− y) ∝ |x− y|3

f(x) =n∑

i=1

ci|x− xi|3 + d2x + d1.

3) 2-D Thin Plate Splines.

Φ[f ] = ‖f‖2K =

∫ ∫

[

(

∂2f

∂x21

)2

+ 2

(

∂2f

∂x1∂x2

)2

+

(

∂2f

∂x22

)2]

dx1dx2

=1

(2π)2

dω ‖ω‖4|f(ω)|2

Page 51: Statistical Learning: Algorithms and Theory

LECTURE 5. ALGORITHMS DERIVED FROM TIKHONOV REGULARIZATION 47

K(ω) =1

‖ω‖4K(x) ∝ ‖x‖2 ln ‖x‖

f(x) =

n∑

i=1

ci‖x− xi‖2 ln ‖x− xi‖+ 〈d2, x〉+ d1.

4) Gaussian Radial Basis Functions.

Φ[f ] = ‖f‖2K ==1

e−‖ω‖|2σ2/2|f(ω)|2dω

K(ω) = e−‖ω‖|2σ2/2

K(x) ∝ e−‖x‖2/2σ2

f(x) =

n∑

i=1

cie−‖x−xi‖2/2σ2

5.5. Convex Optimization

Concepts from convex optimization such as Karush-Kuhn-Tucker (KKT) conditionswere used in the previous sections of this lecture. In this section we give a briefintroduction and derivation of these conditions.

Definition. A set X ∈ IRn is convex if

∀x1, x2 ∈ X , ∀λ ∈ [0, 1], λx1 + (1 − λ)x2 ∈ X .

A set is convex if, given any two points in the set, the line segment connectingthem lies entirely inside the set.

Definition. A function f : IRn → IR is convex if:

(1) For any x1 and x2 in the domain of f , for any λ ∈ [0, 1],

f(λx1 + (1− λ)x2) ≤ λf(x1) + (1 − λ)f(x2).

(2) The line segment connecting two points f(x1) and f(x2) lies entirely onor above the function f .

(3) The set of points lying on or above the function f is convex.

A function is strictly convex if we replace “on or above” with “above”, orreplace “≤” with <.

Definition. A point x∗ is called a local minimum of f if there exists ε > 0 suchthat f(x∗) < f(x) for all ‖x− x∗‖ ≤ ε.

Definition. A point x∗ is called a global minimum of f if f(x∗) < f(x) for allfeasible x.

Unconstrained convex functions (convex functions where the domain is all ofIRn) are easy to minimize. Convex functions are differentiable almost everywhere.Directional derivatives always exist. If we cannot improve our solution by movinglocally, we are at the optimum. If we cannot find a direction that improves oursolution, we are at the optimum.

Convex functions over convex sets (a convex domain) are also easy to minimize.If the set and the functions are both convex, if we cannot find a direction which we

Page 52: Statistical Learning: Algorithms and Theory

48 S. MUKHERJEE, STATISTICAL LEARNING

Convex Sets Non-Convex Sets

Figure 4. Examples of convex and nonconvex sets in IR2.

are able to move in which decreases the function, we are done. Local optima areglobal optima.

Example. Linear programming is always a convex problem

minc

〈c, x〉subject to : Ax = b

Cx ≤ d.

Example. Quadratic programming is a convex problem iff the matrix Q is positivesemidefinite

min x′Qx + 〈c, x〉subject to : Ax = b

Cx ≤ d.

Definition. The following constrained optimization problem P will be called theprimal problem

min f(x)

subject to : gi(x) ≥ 0 i = 1, . . . , m

hi(x) = 0 i = 1, . . . , n

x ∈ X .

Here, f is our objective function, the gi are inequality constraints, the hi are equalityconstraints, and X is some set.

Page 53: Statistical Learning: Algorithms and Theory

LECTURE 5. ALGORITHMS DERIVED FROM TIKHONOV REGULARIZATION 49

−3 −2 −1 0 1 2 3

0

1

2

3

4

5

6

7

8

9

y−f(x)

L2 lo

ss

−3 −2 −1 0 1 2 3

0

0.5

1

1.5

2

2.5

3

3.5

4

y * f(x)

Hin

ge L

oss

−3 −2 −1 0 1 2 3−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−3 −2 −1 0 1 2 30

1

2

3

4

5

6

7

8

9

10

Figure 5. The top two figures are convex functions. The first function isstrictly convex. Bottom figures are nonconvex functions.

f(x,y) = -x + -y

Global Optima

Local Optimum

Figure 6. Optimizing a convex function of convex and nonconvex sets. Inthe example on the left the set is convex and the function is convex so a localminima corresponds to a global minima. In the example on the right the set isnonconvex and the function is convex one can find local minima that are notglobal minima.

Definition. We define a Lagrangian dual problem D:

max Θ(u,v)

subject to : u ≥ 0

Page 54: Statistical Learning: Algorithms and Theory

50 S. MUKHERJEE, STATISTICAL LEARNING

where Θ(u,v) := inf

f(x)−∑mi=1 uigi(x) −∑n

j=1 vihi(x) : x ∈ X

.

Theorem (Weak Duality). Suppose x is a feasible solution of P . Then x ∈X , gi(x) ≤ 0 ∀i, h(x) = 0 ∀i. Suppose u,v are a feasible solution of D. Thenfor all u ≥ 0

f(x) ≥ Θ(u,v).

Proof.

Θ(u,v) = inf

f(y)−m∑

i=1

uigi(y)−n∑

j=1

vihi(y) : y ∈ X

≤ f(x)−m∑

i=1

uigi(x)−n∑

i=1

vihi(x)

≤ f(x).

Weak duality says that every feasible solution to P is at least as expensive asevery feasible solution to D. It is a very general property of duality, and we didnot rely on any convexity assumptions to show it.

Definition. Strong duality holds when the optima of the primal and dual problemsare equivalent Opt(P ) = Opt(D).

If strong duality, does not hold, we have the possibility of a duality gap. Strongduality is very useful, because it usually means that we may solve whichever of thedual or primal is more convenient computationally, and we can usually obtain thesolution of one from the solution of the other.

Proposition. If the objective function f is convex, and the feasible region is convex,under mild technical we have strong duality.

We now look at a what are called saddle points of the Lagrangian function. Wedefined the Lagrangian function as the dual problem

L(x,u,v) = f(x)−m∑

i=1

uigi(x) −n∑

j=1

vjhj(x).

A set (x∗,u∗,v∗) of feasible solutions to P and D is called a saddle point of theLagrangian if

L(x∗,u,v) ≤ L(x,u,v) ≤ L(x,u∗,v∗) ∀x ∈ X , ∀u ≥ 0

x∗ minimizes L if u and v are fixed at u∗ and v∗, and u∗ and v∗ maximize L if xis fixed at x∗.

Definition. The points (x∗,u∗,v∗) satisfy the Karush Kuhn Tucker (KKT) con-ditions or are KKT points if they are feasible to P and D and

∇f(x∗)−∇g(x∗)′u∗ −∇h(x∗)′v = 0

u∗g(x∗) = 0.

In a convex, differentiable problem, with some minor technical conditions,points that satisfy the KKT conditions are equivalent to saddle points of the La-grangian.

Page 55: Statistical Learning: Algorithms and Theory

LECTURE 6Voting algorithms

Voting algorithms or algorithms where the final classification or regression func-tion is a weighted combination of “simpler” or “weaker” classifiers have been usedextensively in a variety of applications.

We will study two examples of voting algorithms in greater depth: BootstrapAGGregatING (BAGGING) and boosting.

6.1. Boosting

Boosting algorithms especially AdaBoost (adaptive boosting) have had a significantimpact on a variety of practical algorithms and also have been the focus of theo-retical investigation for a variety of fields. The formal term boosting and the firstboosting algorithm came out of the field of computational complexity in theoreticalComputer Science. In particular, learning as formulated by boosting came fromthe concept of Probably Approximatley Correct (PAC) learning.

6.2. PAC learning

The idea of Probably Approximatley Correct (PAC) learning was formulated in1984 by Leslie Valiant as an attempt to characterize what is learnable. Let Xbe a set. This set contains encodings of all objects of interest in the learningproblem. The goal of the learning algorithm is to infer some unknown subsetof X , called a concept, from a known class of concepts, C. Unlike the previousstatistical formulations of the learning problem, the issue of representation arisesin this formulation due to computational issues.

• Concept classes A representation class over X is a pair (σ, C), where C ⊆0, 1∗ and σ : C → 2X . For c ∈ C, σ(c) is a concept over X and theimage space σ(C) is the concept class represented by (σ, C). For c ∈ Cthe positive examples are pos(c) = σ(c) and the negative examples areneg(c) = X − σ(c). The notations c(x) = 1 is equivalent to x ∈ pos(c)and c(x) = 0 is equivalent to x ∈ neg(c). We assume that domain pointsx ∈ X and representations c ∈ C are efficiently encoded with by codingsof length |x| and |c| respectively.

51

Page 56: Statistical Learning: Algorithms and Theory

52 S. MUKHERJEE, STATISTICAL LEARNING

• Parameterized representation We will study representation classes param-eterized by an index n resulting in the domain X = ∪n≥1Xn and repre-sentation class C = ∪n≥1Cn. The index n serves as a measure of thecomplexity of concepts in C. For example, X may be the set 0, 1n andC the set of all Boolean formulae over n variables.• Efficient evaluation of representations If C is a representation class over X ,

then C is polynomially evaluatable if there is a (probabilistic) polynomial-time evaluation algorithmA that given a representation c ∈ C and domainpoint x ∈ X outputs c(x).• Samples A labeled example from a domain X is a pair < x, b > where

x ∈ X and b ∈ 0, 1. A sample S = (< x1, b1 >, ..., < xm, bm >) is afinite sequence of labeled examples. A labeled example of c ∈ C has theform < x, c(x) >. A representation h and an example < x, b > agree ifh(x) = b. A representation h and a sample S are consistent if h agreeswith each example in S.• Distributions on examplesA learning algorithm for a representation class C

will receive examples from a single representation c ∈ C whichw e call thetarget representation. Examples of the target representation are generatedprobabilistically: Dc is a fixed but arbitrary distribution over pos(c) andneg(c). This is the target distributions. The learning algorithm will begive access to an oracle EX which returns in unit time an example of thetarget representation drawn according to the target distribution Dc.• Measure of error Given a target representation c ∈ C and a target distri-

bution D the error of a representation h ∈ H is

ec(h) = D(h(x) 6= c(x)).

In the above formulation C is the target class. In the above formulation H isthe hypothesis class. The algorithm A is a learning algorithm for C and the outputhA ∈ H is the hypothesis of A.

We can now define learnability:

Definition (Strong learning). Let C and H be representation classes over X thatare polynomially evaluatable. Then C is polynomially learnable by H if there is a(probabilistic) algorithm A with access to EX, taking inputs ε, δ with the propertythat for any target representation c ∈ C, for any target distribution D, and for anyinput values 0 < ε, δ < 1, algorithm A halts in time polynomial in 1

ε , 1δ , |c|, n and

outputs a representation hA ∈ H that with probability greater than 1 − δ satsifiesec(h) < ε.

The parameter ε is the accuracy parameter and the parameter δ is the confi-dence parameter. These two parameters characterize the name Probably (δ) Ap-proximatley (ε) Correct (ec(h)). The above definition is sometimes called distribu-tion free learning since the property holds ofver an target represenation and targetdistribution.

Considerable research in PAC learning has focused on which representationclasses C are polynomially learnable.

So far we have defined learning as approximating aribirarily close the targetconcept. Another model of learning called weak learning considers the case wherethe learning algorithm is required to perform slightly better than chance.

Page 57: Statistical Learning: Algorithms and Theory

LECTURE 6. VOTING ALGORITHMS 53

Definition (Weak learning). Let C and H be representation classes over X that arepolynomially evaluatable. Then C is polynomially weak learnable by H if there is a(probabilistic) algorithm A with access to EX, taking inputs ε, δ with the propertythat for any target representation c ∈ C, for any target distribution D, and for anyinput values 0 < ε, δ < 1, algorithm A halts in time polynomial in 1

ε , 1δ , |c|, n and

outputs a representation hA ∈ H that with probability greater than 1 − δ satsifiesec(h) < 1

2 − 1p(|c|) , where p is a polynomial.

Definition (Sample complexity). Given a learning algorithm A for a representa-tion class C. The number of calls sA(ε, δ) to the oracle EX made by A on inputsε, δ for the worst-case measure over all target representations c ∈ C and targetdistributions D is the sample complexity of the algorithm A.

We now state some Boolean classes whose learnability we will state as positiveor negative examples of learnability.

• The class Mn consist of monomials over the Boolean variables xi, ..., xn.• For a constant k, the class kCNFn (conjunctive normal forms) consists of

Boolean formulae of the form C1 ∧ · · · ∧Cl where each Ci is a disjunctionof at most k monomials over the Boolean variables xi, ..., xn.• For a constant k, the class kDNFn (disjunctive normal forms) consists of

Boolean formulae of the form T1 ∨ · · · ∨ Tl where each Ti is a conjunctionof at most k monomials over the Boolean variables xi, ..., xn.• Boolean threshold functions I(

∑ni=1 wixi > t) where wi ∈ 0, 1 and I is

the indicator function.

Definition (Empirical risk minimization, ERM). A consistent algorithm A is onethat outputs hypotheses h that are consistent with the sample S and the range overpossible hypotheses for A is h ∈ C.

The above algorithm is ERM in the case of zero error with the target conceptin the hypothesis space.

Theorem. If the hypothesis class is finite then C is learnable by the consistentalgorithm A.

Theorem. Boolean threshold functions are not learnable.

Theorem. f ∨ g : f ∈ kCNF, g ∈ kDNF is learnable.

Theorem. f ∧ g : f ∈ kDNF, g ∈ kCNF is learnable.

Theorem. Let C be a concept class with finite VC dimension V C(C) = d < ∞.Then C is learnable by the consistent algorithm A.

6.3. The hypothesis boosting problem

An important question both theoretically and practically in the late 1980’s waswhether strong learnablity and weak learnability were equivalent. This was thehypothesis boosting problem:

Conjecture. A concept class C is weakly learnable if and only if it is stronglylearnable.

The above conjecture was proven true in 1990 by Robert Schapire.

Page 58: Statistical Learning: Algorithms and Theory

54 S. MUKHERJEE, STATISTICAL LEARNING

Theorem. A concept class C is weakly learnable if and only if it is strongly learn-able.

The proof of the above theorem was based upon a particular algorithm. Thefollowing algorithm takes as input a weaklearner, an error paramter ε, a confidenceparameter δ, an oracle EX , and outputs a strong learner. At each iteration ofthe algorithm a weaklearner with error rate ε gets boosted so that its error ratedecreases to 3ε2 − 2ε3.

Algorithm 1: Learn(ε, δ, EX)

input : error parameter ε, confidence parameter δ, examples oracle EX

return: h that is ε close to the target concept c with probability ≥ 1− δ

if ε ≥ 1/2− 1/p(n, s) then return WeakLearn(δ, EX);α← g−1(ε) : where g(x) = 3x2 − 2x3;

EX1 ← EX ;h1 ← Learn(α, δ/5, EX1);τ1 ← ε/3;let a1 be an estimate of a1 = Prx∼D[h1(x) 6= c(x)]

choose a sample sufficiently large that|a1 − a1| ≤ τ1 with probability ≥ 1− δ/5

if a1 ≤ ε− τ1 then return h1;

defun EX2()flip coinif heads then return first x : h1(x) = c(x);else tails return first x : h1(x) 6= c(x)

h2 ← Learn(α, δ/5, EX2);τ2 ← (1− 2α)ε/9;let e be an estimate of e = Prx∼D[h2(x) 6= c(x)]

choose a sample sufficiently large that|e− e| ≤ τ2 with probability ≥ 1− δ/5

if e ≤ ε− τ2 then return h2;

defun EX3()return first x : h1(x) 6= h2(x) ;

h3 ← Learn(α, δ/5, EX3);

defun h(x) b1 ← h1(x), b2 ← h2(x)if b1 = b2 then return b1

else return h3(x)return h

The above algorithm can be summarized as follows:

(1) Learn an initial classifier h1 on the first N training points

Page 59: Statistical Learning: Algorithms and Theory

LECTURE 6. VOTING ALGORITHMS 55

(2) Learn h2 on a new sample of N points, half of which are misclassifief byh1

(3) Learn h3 on N points for which h1 and h2 disagree(4) The boosted classifier h = Majority vote(h1, h2, h3).

The basic result is that if the individual classifiers h1, h2, and h3 have error εthe boosted classifier has error 2ε2 − 3ε3.

To prove the theorem one needs to show that the algorithm is correct in thesense following sense.

Theorem. For 0 < ε < 1/2 and for 0 < δ <≤ 1, the hypothesis returned by callingLearn(ε, δ, EX) is ε close to the target concept with probability at least 1− δ.

We first define a few quantities

pi = Prx∼D[hi(x) = c(x)]

q = Prx∼D[h1(x) 6= h2(x)]

w = Prx∼D[h2(x) 6= h1(x) = c(x)]

v = Prx∼D[h1(x) = h2(x) = c(x)]

y = Prx∼D[h1(x) 6= h2(x) = c(x)]

z = Prx∼D[h1(x) 6= h2(x) 6= c(x)].

Given the above quantities

w + v = Prx∼D[h1(x) = c(x)] = 1− a1(6.1)

y + z = Prx∼D[h1(x) 6= c(x)] = a1.(6.2)

We can explicitly express the chance that EXi returns an instance x in termsof the above variable:

D1(x) = D(x)

D2(x) =D(x)

2

(

p1(x)

a1+

1− p(x)

1− a1

)

(6.3)

D3(x) =D(x)q(x)

w + y.

From equation (6.3) we have

1− a2 =∑

x∈Xn

D2(x)(1 − p2(x))

=1

2a1

x∈Xn

D(x)p1(x)(1 − p2(x)) +1

2(1− a1)

x∈Xn

D(x)(1 − p1(x))(1 − p2(x))

=y

2a1+

z

2(1− a1).

Combining the above equation with equations (6.1) and (6.2) we can solve for wand z in terms of y, a1, a2.

w = (2a2 − 1)(1− a1) +y(1− a1)

a1

z = a1 − y.

Page 60: Statistical Learning: Algorithms and Theory

56 S. MUKHERJEE, STATISTICAL LEARNING

We now control the quantity

Prx∼D[h(x) 6= c(x)] = Prx∼D[(h1(x) = h2(x)) ∨ (h1(x) 6= h2(x) ∧ h3(x) 6= c(x))]

= z +∑

x∈Xn

D(x)q(x)p3(x)

= z +∑

x∈Xn

(w + y)D3(x)p3(x)

= z + a3(w + y)

≤ z + α(w + y)

= α(2a2 − 1)(1− a1) + a1 +y(α− a1)

a1

≤ α(2a2 − 1)(1− a1) + α

≤ α(2α− 1)(1− α) + α = 3α2 − 2α3 = ε,

the inequalities follow from the fact that ai ≤ α < 1/2 and y ≤ a1.

One also needs to show that the algorithm runs in polynomial time. Thefollowing lemma implies this. The proof of the lemma is beyond the scope of thelecture notes.

Lemma. On a good run the expected execution time of Learn(ε, δ/2, EX) is poly-nomial in m, 1/δ, 1/ǫ.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

original error rate

boos

ted

erro

r ra

te

zero roundsone roundtwo roundsthree roundsfour rounds

Figure 1. A plot of the boosted error rate as a function of the initial errorfor different numbers of boosting rounds.

6.4. ADAptive BOOSTing (AdaBoost)

We will call the above formulation of boosting the boost-by-majority algorithm.The formulation of boosting by majority by Schapire involved boosting by filtering

Page 61: Statistical Learning: Algorithms and Theory

LECTURE 6. VOTING ALGORITHMS 57

since one weak learner served as a filter for the other. Another formulation ofboost by majority was developed by Yoav Fruend also based upon filtering. Bothof these algorithms were later adjusted so that sampling weights could be insteadof filtering. However, all of these algorithms had the problem that the strength1/2− γ of the weak learner had to be known a priori.

Freund and Schapire developed the following adaptive boosting algorithm, Ad-aBoost, to address these issues.

Algorithm 2: AdaBoost

input : samples S = (xi, yi)Ni=1, weak learner, number of iterations T

return: h(x) = sign[

∑Ti=1 αihi(x)

]

for i=1 to N do w0i = 1/N ;

for t=1 to T doht ← Call WeakLearn with weights wt;

εt =∑N

j=1 wtjIyj 6=ht(xj);

αt = log((1 − εt)/εt);for j=1 to N do wt+1

j = wtj exp (−αtyjht(xj));

Zt =∑N

j=1 wt+1j ;

for j=1 to N do wt+1j = wt+1

j /Zt;

For the above algorithm we can prove that the training error will decreaseover boosting iterations. The advantage of the above algorithm is we don’t needa uniform γ over all rounds. All we need is for each boosting round there exists aγt > 0.

Theorem. Suppose WeakLearn when called by AdaBoost generates hypotheses witherrors ε1, ..., εT . Assume each εi ≤ 1/2 and let γi = 1/2 − εi then the followingupper bound holds on the hypothesis h

|j : h(xj) 6= yj|N

≤T∏

i=1

1− 4γ2i ≤ exp

(

−2T∑

i=1

γ2i

)

.

Proof.If yi 6= h(xi) then yih(xi) ≤ 0 and e−yih(xi) ≥ 1. Therefore

|j : h(xj) 6= yj |N

≤ 1

N

N∑

i=1

e−yih(xi),

=

N∑

i=1

wT+1i

T∏

t=1

Zt =

T∏

t=1

Zt.

In addition, since αt = log((1− εt)/εt) and 1 + x ≤ ex

Zt = 2√

εt(1 − εt) =√

1− 4γ2t ≤ e−2γt .

Page 62: Statistical Learning: Algorithms and Theory

58 S. MUKHERJEE, STATISTICAL LEARNING

6.5. A statistical interpretation of Adaboost

In this section we will reinterpret boosting as a greedy algorithm to fit an addi-tive model. We first define our weak learners as a paramterized class of functionshθ(x) = h(x; θ) where θ ∈ θ. If we think of each weak learner as a basis functionthen the boosted hypothesis h(x) can be thought of as a linear combination theweak learners

h(x) =

T∑

i=1

αihθi(x),

where the hθi(x) is the ith weak learner parameterized by θi. One approach to

setting the parameters θi and weights αi is called forward stagewise modelling. Inthis approach we sequentially add new basis functions or weak learners withoutadjusting the paramters and coefficients of the current solution. The followingagorithm implements forward stagewise additive modeling.

Algorithm 3: Forward stagewise additive modeling

input : samples S = (xi, yi)Ni=1, weak learner, number of iterations T , loss

function L

return: h(x) =[

∑Ti=1 αihθi

(x)]

h0(x) = 0;for i=1 to T do

(αt, θt) = argminα∈IR+,θ∈Θ

∑Ni=1 L(yi, ht−1(xi) + αhθ(x));

ht(x) = ht−1(x) + αthθi(x);

We will now show that the above algorithm with exponential loss

L(y, f(x)) = e−yf(x)

is equivalent to AdaBoost.At each iteration the following minimization is performed

(αt, θt) = arg minα∈IR+,θ∈Θ

N∑

i=1

exp[−yi(ht−1(xi) + αhθ(x))],

(αt, θt) = arg minα∈IR+,θ∈Θ

N∑

i=1

exp[−yiht−1(xi)] exp[−yiαhθ(x)],

(αt, θt) = arg minα∈IR+,θ∈Θ

N∑

i=1

wti exp[−yiαhθ(x)],(6.4)

Page 63: Statistical Learning: Algorithms and Theory

LECTURE 6. VOTING ALGORITHMS 59

where wti = exp[−yiht−1(xi)] does not effect the optimization functional. For any

α > 0 the objective function in equation (6.4) can be rewritten as

θt = arg minθ∈Θ

e−α∑

yi=hθ(xi)

wti + eα

yi 6=hθ(xi)

wti

,

θt = arg minθ∈Θ

[

(e−α − eα)N∑

i=1

wtiIyi 6=hθ(xi) + eα

N∑

i=1

wti

]

,

θt = arg minθ∈Θ

N∑

i=1

wtiIyi 6=hθ(xi).

Therefore the weak learner that minimizes equation (6.4) will minimize the weightederror rate which if we plug back into equation (6.4) we can solve for αt which is

αt =1

2log

1− εt

εt,

where

εt =

N∑

i=1

wtiIyi 6=ht(xi).

The last thing to show is the updating of the linear model

ht(x) = ht−1(x) + αtht(x),

is equivalent to the reweighting used in AdaBoost. Due to the exponential lossfunction and the additive updating at each iteration the above sum can be rewrittenas

wt+1i = wt

ie−αyiht(xi).

So AdaBoost can be interpreted as an algorithm that minimizes the exponentialloss criterion via forward stagewise additive modeling.

We now give some motivation for why the exponential loss is a reasonable lossfunction in the classification problem. The first argument is that like the hingeloss for SVM classification the exponential loss serves as an aupper bound on themissclassification loss (see figure 2).

Another simple motivation for using the exponential loss is the minimizer ofthe expected loss with respect to some function class H

f∗(x) = arg minf∈H

IEY |x[

e−Y f(x)]

=1

2log

Pr(Y = 1|x)

Pr(Y = −1|x),

estimates one-half the log-odds ratio

Pr(Y = 1|x) =1

1 + e−2f∗(x).

6.6. A margin interpretation of Adaboost

We developed a geometric formulation of support vector machines in the seper-able case via maximizing the margin. We will formulate AdaBoost as a marginmaximization problem.

Page 64: Statistical Learning: Algorithms and Theory

60 S. MUKHERJEE, STATISTICAL LEARNING

−1.5 −1 −0.5 0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

3.5

4

4.5

y f(x)

Loss

misclassification lossexponential losshinge loss

Figure 2. A comparison of loss functions for classification.

Recall that for the linear seperable SVM with points in IRd given a dataset Sthe following optimization problem characterizes the maximal margin classifier

w = arg maxw∈IRd

minxi∈S

yi〈w, xi〉||w||L2

.

In the case of AdaBoost we can construct a coordinate space with as manydimensions as weak classifiers, T, u ∈ IRT , where the elements of u1, ..., uT corre-spond to the outputs of the weak classifiers u1 = f1(x), ..., uT = fT (x). We canshow that AdaBoost is an iterative way to solve the following mini-max problem

w = arg maxw∈IRT

minui∈S

yi〈w, ui〉||w||L1

,

where ui = f1(xi), ..., fT (xi) and the final classifier has the form

hT (x) =

T∑

i=1

wTi fi(x).

This follows immediately from the forward additive stagewise modeling interpreta-tion since under seperabality the addition at each iteration of a weak classifier to thelinear expansion will result in a boosted hypothesis ht that as a function of t will benondecreasing in yiht(xi) ∀i with the L1 norm on wt constrained, ||w||L1 = 1, fol-lowing from the fact that the weights at each iteration must satisfy the distributionrequirement.

An interesting geometry arises from the two different norms on the weights win the two different optimization problems. The main idea is that we want to relatethe norm on w to properties of norms on points in either IRd in the SVM case orIRT in the boosting case. By Holder’s inequality for the dual norms ||x||Lq

and

Page 65: Statistical Learning: Algorithms and Theory

LECTURE 6. VOTING ALGORITHMS 61

||w||Lpwith 1

p + 1q = 1 and p, q ∈ [1,∞] the following holds

|〈x, w〉| ≤ ||x||Lq||w||Lp

.

The above inequality implies that minimizing the L2 norm on w is equivalent tomaximizing the L2 distance between the hyperplane and the data. Similarly, min-imizing the L1 norm on w is equivalent to maximizing the L∞ norm between thehyperplane and the data.

Page 66: Statistical Learning: Algorithms and Theory
Page 67: Statistical Learning: Algorithms and Theory

LECTURE 7One dimensional concentration inequalities

7.1. Law of Large Numbers

In this lecture, we will look at concentration inequalities or law of large numbers fora fixed function. Let (Ω,L, µ) be a probability space. Let x1, ..., xn be real randomvariables on Ω. A sequence of random variables yn converges almost surely to arandom variable Y iff IP(yn → Y ) = 1. A sequence of random variables yn convergesin probability to a random variable Y iff for every ǫ > 0, limn→∞ IP(|yn − Y | >ǫ) = 0. Let µn := n−1

∑ni=1 xn. The sequence x1, ..., xn satisfies the strong law

of large numbers if for some constant c, µn converges to c almost surely. Thesequence x1, ..., xn satisfies the weak law of large numbers iff for some constant c,µn converges to c in probability. In general the constant c will be the expectationof the random variable IEx.

A given function f(x) of random variables x concentrates if the deviation be-tween its empirical average, n−1

∑ni=1 f(xi) and expectation, IEf(x), goes to zero

as n goes to infinity. That is f(x) satisfies the law of large numbers.

7.2. Polynomial inequalities

Theorem (Jensen). If φ is a convex function then φ(IEx) ≤ IEφ(x).

Theorem (Bienayme-Chebyshev). For any random variable x, ǫ > 0

IP(|x| ≥ ǫ) ≤ IEx2

ǫ2.

Proof.

IEx2 ≥ E(x2I|x|≥ǫ) ≥ ǫ2IP(|x| > ǫ).

Theorem (Markov). For any random variable x, ǫ > 0

IP(|x| ≥ ǫ) ≤ IEeλx

eλǫ

and

IP(|x| ≥ ǫ) ≤ infλ<0

e−λǫIEeλx.

63

Page 68: Statistical Learning: Algorithms and Theory

64 S. MUKHERJEE, STATISTICAL LEARNING

Proof.

IP(x > ǫ) = IP(eλx > eλǫ) ≤ IEeλx

eλǫ.

7.3. Exponential inequalities

For the sums or averages of independent random variables the above bounds canbe improved from polynomial in 1/ǫ to exponential in ǫ.

Theorem (Bennet). Let x1, ..., xn be independent random variables with IEx = 0,IEx2 = σ2, and |xi| ≤M . For ǫ > 0

IP

(

|n∑

i=1

xi| > ǫ

)

≤ 2e−nσ2

M2 φ( ǫM

nσ2 ),

where

φ(z) = (1 + z) log(1 + z)− z.

Proof. We will prove a bound on one-side of the above theorem

IP

(

n∑

i=1

xi > ǫ

)

.

IP

(

n∑

i=1

xi > ǫ

)

≤ e−λǫIEeλP

xi = e−λǫΠni=1IEeλxi

= e−λǫ(IEeλx)n.

IEeλx = IE

∞∑

k=0

(λx)k

k!=

∞∑

k=0

λk IExk

k!

= 1 +

∞∑

k=2

λk

k!IEx2xk−2 ≤ 1 +

∞∑

k=2

λk

k!Mk−2σ2

= 1 +σ2

M2

∞∑

k=2

λkMk

k!= 1 +

σ2

M2(eλM − 1− λM)

≤ eσ2

M2 (eλM−λM−1).

The last line holds since 1 + x ≤ ex.Therefore,

(7.1) IP

(

n∑

i=1

xi > ǫ

)

≤ e−λǫeσ2

M2 (eλM−λM−1).

We now optimize with respect to λ by taking the derivative with respect to λ

0 = −ǫ +nσ2

M2(MeλM −M),

eλM =ǫM

nσ2+ 1,

λ =1

Mlog

(

1 +ǫM

nσ2

)

.

Page 69: Statistical Learning: Algorithms and Theory

LECTURE 7. ONE DIMENSIONAL CONCENTRATION INEQUALITIES 65

The theorem is proven by substituting λ into equation (7.1).

The problem with Bennet’s inequality is that it is hard to get a simple expres-sion for ǫ as a function of the probability of the sum exceeding ǫ.

Theorem (Bernstein). Let x1, ..., xn be independent random variables with IEx = 0,IEx2 = σ2, and |xi| ≤M . For ǫ > 0

IP

(

|n∑

i=1

xi| > ǫ

)

≤ 2e− ǫ2

2nσ2+23

ǫM .

Proof.Take the proof of Bennet’s inequality and notice

φ(z) ≥ z2

2 + 23z

.

Remark. With Bernstein’s inequality a simple expression for ǫ as a function ofthe probability of the sum exceeding ǫ can be computed

n∑

i=1

xi ≤2

3uM +

√2nσ2u.

Outline.

IP

(

n∑

i=1

xi > ǫ

)

≤ 2e− ǫ2

2nσ2+ 23

ǫM = e−u,

where

u =ǫ2

2nσ2 + 23ǫM

.

we now solve for ǫ

ǫ2 − 2

3ǫM − 2nσ2ǫ = 0

and

ǫ =1

3uM +

u2M2

9+ 2nσ2u.

Since√

a + b ≤ √a +√

b

ǫ =2

3uM +

√2nσ2u.

So with large probabilityn∑

i=1

xi ≤2

3uM +

√2nσ2u.

If we want to bound

|n−1n∑

i=1

f(xi)− IEf(x)|

we consider

|f(xi)− IEf(x)| ≤ 2M.

Thereforen∑

i=1

(f(xi)− IEf(x)) ≤ 4

3uM +

√2nσ2u

Page 70: Statistical Learning: Algorithms and Theory

66 S. MUKHERJEE, STATISTICAL LEARNING

and

n−1n∑

i=1

f(xi)− IEf(x) ≤ 4

3

uM

n+

2σ2u

n.

Similarly,

IEf(x)− n−1n∑

i=1

f(xi) ≥4

3

uM

n+

2σ2u

n.

In the above bound√

2σ2u

n≥ 4uM

n

which implies u ≤ nσ2

8M2 and therefore

|n−1n∑

i=1

f(xi)− IEf(x)| .√

2σ2u

nfor u . nσ2,

which corresponds to the tail probability for a Gaussian random variable and ispredicted by the Central Limit Theorem (CLT) Condition that limn→∞ nσ2 →∞.If limn→∞ nσ2 = C, where C is a fixed constant, then

|n−1n∑

i=1

f(xi)− IEf(x)| . C

n

which corresponds to the tail probability for a Poisson random variable.We now look at an even simpler exponential inequality where we do not need

information on the variance.

Theorem (Hoeffding). Let x1, ..., xn be independent random variables with IEx = 0and |xi| ≤Mi. For ǫ > 0

IP

(

|n∑

i=1

xi| > ǫ

)

≤ 2e− 2ǫ2

Pni=1

M2i .

Proof.

IP

(

n∑

i=1

xi > ǫ

)

≤ e−λǫIEeλPn

i=1 xi = e−λǫΠni=1IEeλxi .

It can be shown (Homework problem)

IE(eλxi) ≤ eλ2M2

i8 .

The bound is proven by optimizing the following with respect to λ

e−λǫΠni=1e

λ2M2i

8 .

Applying Hoeffding’s inequality to

n−1n∑

i=1

f(xi)− IEf(x)

we can state that with probability 1− e−u

n−1n∑

i=1

f(xi)− IEf(x) ≤√

2Mu

n,

Page 71: Statistical Learning: Algorithms and Theory

LECTURE 7. ONE DIMENSIONAL CONCENTRATION INEQUALITIES 67

which is a sub-Gaussian as in the CLT but without the variance information wecan never achieve the 1

n rate we achieved when the random variable has a Poissontail distribution.

We will use the following version of Hoeffding’s inequality in later lectures onKolmogorov chaining and the Dudley’s entropy integral.

Theorem (Hoeffding). Let x1, ..., xn be independent random variables with IP(xi =Mi) = 1/2 and IP(xi = −Mi) = 1/2. For ǫ > 0

IP

(

|n∑

i=1

xi| > ǫ

)

≤ 2e− 2ǫ2

Pni=1

M2i .

Proof.

IP

(

n∑

i=1

xi > ǫ

)

≤ e−λǫIEeλPn

i=1 xi = e−λǫΠni=1IEeλxi .

IE(eλxi) =1

2eλMi +

1

2e−λMi ,

1

2eλMi +

1

2e−λMi =

∞∑

k=0

(Miλ)2k

(2k)!≤ e

λ2M2i

2 .

Optimize the following with respect to λ

e−λǫΠni=1e

λ2M2i

2 .

7.4. Martingale inequalities

In the previous section we stated some concentration inequalities for sums of inde-pendent random variables. We now look at more complicated functions of indepen-dent random variables and introduce a particular Martingale inequality to proveconcentration.

Let (Ω,L, µ) be a probability space. Let x1, ..., xn be real random variables onΩ. Let the function Z(x1, ..., xn) : Ωn → IR be a map from the random variablesto a real number.

The function Z concentrates if the deviation between Z(x1, ..., xn) and IEx1,..,xnZ(x1, .., xn)

goes to zero as n goes to infinity.

Theorem (McDiarmid). Let x1, ..., xn be independent random variables let Z(x1, ..., xn) :Ωn → IR such that

∀x1, ..., xn, x′1, ..., x

′n |Z(x1, .., xn)− Z(x1, ..., xi−1, x

′i, xi+1, xn)| ≤ ci,

then

IP (Z − IEZ > ǫ) ≤ e− ǫ2

2Pn

i=1c2i .

Proof.

IP(Z − IEZ > ǫ) = IP(eλ(Z−IEZ) > eλǫ) ≤ e−λǫIEeλ(Z−IEZ).

Page 72: Statistical Learning: Algorithms and Theory

68 S. MUKHERJEE, STATISTICAL LEARNING

We will use the following very useful decomposition

Z(x1, ..., xn)− IEx′1,..,x′

nZ(x′

1, .., x′n) = [Z(x1, ..., xn)− Ex′

1Z(x′

1, x2, ..., xn)]

+ [Ex′1Z(x′

1, x2, ..., xn)− Ex′1,x′

2Z(x′

1, x′2, x3, ..., xn)]

+ ...

+ [Ex′1,...,x′

n−1Z(x′

1, x′2, ...x

′n−1, xn)− Ex′

1,...,x′nZ(x′

1, ..., x′n)].

We denote the random variable

zi(xi, ..., xn) := IEx′1,...,x′

i−1Z(x′

1, ..., x′i−1, xi, ..., xn)−IEx′

1,...,x′iZ(x′

1, ..., x′i, xi+1, ..., xn),

and

Z(x1, ..., xn)− IEx′1,..,x′

nZ(x′

1, .., x′n) = z1 + ... + zn.

The following inequality is true (see the following Lemma for a proof)

IExieλzi ≤ eλ2c2

i /2 ∀λ ∈ IR.

IEeλ(Z−IEZ) = IEeλ(z1+...,+zn)

IEIEx1eλ(z1+...,+zn) = IEeλ(z2+...,+zn)IEx1e

λz1

≤ IEeλ(z2+...,+zn)eλc21/2,

by induction

IEeλ(Z−IEZ) ≤ eλ2 P

ni=1 c2

i /2.

To derive the bound we optimize with respect to λ

e−λǫ+λ2 P

ni=1 c2

i /2.

Lemma. For all λ ∈ IR

IExieλzi ≤ eλ2c2

i /2.

Proof.For any t ∈ [−1, 1] the function eλt is convex with respect to λ.

eλt = eλ( 1+t2 )−λ( 1−t

2 )

≤ 1 + t

2eλ +

1− t

2e−λ

=eλ + e−λ

2+ t

eλ − e−λ

2

≤ eλ2/2 + t sh(λ).

Set t = zi

ciand notice that zi

ci∈ [−1, 1] so,

eλzi = eλci

zici ≤ eλ2c2

i /2 +zi

cish(λci),

and

IExieλzi ≤ eλ2c2

i /2.

Example. We can use McDiarmid’s inequality to prove the concentration of theempirical minima. Given a dataset v1 = (x1, y1), ..., vn = (xn, yn) the empiricalminima is

Z(v1, ..., vn) = minf∈HK

n−1n∑

i=1

V (f(xi), yi).

Page 73: Statistical Learning: Algorithms and Theory

LECTURE 7. ONE DIMENSIONAL CONCENTRATION INEQUALITIES 69

If the loss function is bounded one can show that for all (v1, ..., vn, v′i)

|Z(v1, ..., vn)− Z(v1, ..., vi−1, v′i, ...vn)| ≤ k

n.

Therefore with probability 1− e−u

|Z − IEZ| ≤√

2ku

n.

So the empirical minima concentrates.

Page 74: Statistical Learning: Algorithms and Theory
Page 75: Statistical Learning: Algorithms and Theory

LECTURE 8Vapnik-Cervonenkis theory

8.1. Uniform law of large numbers

In the previous lecture we considered law of large numbers for a single or fixedfunction. We termed this as one dimensional concentration inequalities. We nowlook at uniform law of large numbers, that is a law of large numbers that holdsuniformly over a class of functions.

The point of these uniform limit theorems is that if the law of large numbersholds for all functions in a hypothesis space then it holds for the empirical minimizer.

The reason this chapter is called Vapnik-Cervonenkis theory is that they pro-vided some of the basic tools to study these classes.

8.2. Generalization bound for one function

Before looking at uniform results we prove generalization results when the hypoth-esis space H consists of one function, f1.

In this case the empirical risk minimizer is f1

f1 = fS := arg minf∈H

[

n−1n∑

i=1

V (f, zi)

]

.

Theorem. Given 0 ≤ V (f1, z) ≤ M for all z and S = zini=1 drawn i.i.d. thenwith probability at least 1− e−t (t > 0)

IEzV (f1, z) ≤ n−1n∑

i=1

V (f1, zi) +

M2t

n.

Proof.By Hoeffding’s inequality

IP

(

IEzV (f1, z)− n−1n∑

i=1

V (f1, zi) > ε

)

≤ e−nε2/M2

so

IP

(

IEzV (f1, z)− n−1n∑

i=1

V (f1, zi) ≤ ε

)

> 1− e−nε2/M2

.

Set t = nε2/M2 and the result follows.

71

Page 76: Statistical Learning: Algorithms and Theory

72 S. MUKHERJEE, STATISTICAL LEARNING

8.3. Generalization bound for a finite number of functions

We now look at the case of ERM on a hypothesis space H with a finite numberof functions, k = |H|. In this case, the empirical minimizer will be one of the kfunctions.

Theorem. Given 0 ≤ V (fj , z) ≤M for all fj ∈ H, z and S = zini=1 drawn i.i.d.then with probability at least 1− e−t (t > 0) for the empirical minimizer, fS,

IEzV (fS , z) < n−1n∑

i=1

V (fS , zi) +

M2(log K + t)

n.

Proof.The follow implication of events holds

maxfj∈H

IEzV (fj , z)− n−1n∑

i=1

V (fj , zi) < ε

IEzV (fS , z)− n−1n∑

i=1

V (fS , zi) < ε

.

IP

(

maxfj∈H

IEzV (fj , z)− n−1n∑

i=1

V (fj , zi) ≥ ε

)

= IP

f∈H

IEzV (f, z)− n−1n∑

i=1

V (f, zi) ≥ ε

≤∑

fj∈HIP

(

IEzV (fj, z)− n−1n∑

i=1

V (fj , zi) ≥ ε

)

≤ ke−nε2/M2

,

the last step comes from our single function result. Set e−t = ke−nε2/M2

and theresult follows.

8.4. Generalization bound for compact hypothesis spaces

We now prove a sufficient condition for the generalization of hypothesis spaces withan infinite number of functions and then give some examples of such spaces.

We first assume that our hypothesis space is a subset of the space of continuousfunctions, H ⊂ C(X ).

Definition. A metric space is compact if and only if it is totally bounded andcomplete.

Definition. Let R be a metric space and ǫ any positive number. Then a set A ⊂ Ris said to be an ǫ-net for a set M ⊂ R if for every x ∈ M , there is at least onepoint a ∈ A such that ρ(x, a) < ǫ. Here ρ(·, ·) is a norm.

Definition. Given a metric space R and a subset M ⊂ R suppose M has a finiteǫ-net for every ǫ > 0. Then M is said to be totally bounded.

Proposition. A compact space has a finite ǫ-net for all ǫ > 0.

For the remainder of this section we will use the supnorm,

ρ(a, b) = supx∈X|a(x) − b(x)|.

Page 77: Statistical Learning: Algorithms and Theory

LECTURE 8. VAPNIK-CERVONENKIS THEORY 73

Definition. Given a hypothesis space H and the supnorm, the covering numberN (H, ǫ) is the minimal number ℓ ∈ IN such that for every f ∈ H there existsfunctions giℓi=1 such that

supx∈X|f(x) − gi(x)| ≤ ǫ for some i.

We now state a generalization bound for this case. In the bound we assumeV (f, z) = (f(x)− y)2 but the result can be easily extended for any Lipschitz loss

|V (f1, z)− V (f2, z)| ≤ C||f1(x)− f2(x)||∞ ∀z.

Theorem. Let H be a compact subset of C(X ). Given 0 ≤ |f(x) − y| ≤ M for allf ∈ H, z and S = zini=1 drawn i.i.d. then with probability at least 1− e−t (t > 0)for the empirical minimizer, fS,

IEx,y(fS(x)− y)2 < n−1n∑

i=1

(fS(xi)− yi)2 +

M2(logN (H, ε/8M) + t)

n.

We first prove two useful lemmas. Define

D(f, S) := IEx,y(f(x)− y)2 − n−1n∑

i=1

(f(xi)− yi)2.

Lemma. If |fj(x) − y| ≤M for j = 1, 2 then

|D(f1, S)−D(f2, S)| ≤ 4M ||f1 − f2||∞.

Proof. Note that

(f1(x)− y)2 − (f2(x) − y)2 = (f1(x)− f2(x))(f1(x) + f2(x) − 2y)

so

|IEx,y(f1(x)− y)2 − IEx,y(f2(x) − y)2| =

(f1(x) − f2(x))(f1(x) + f2(x)− 2y)dµ(x, y)

≤ ||f1 − f2||∞∫

|f1(x)− y + f2(x) − y|du(x, y)

≤ 2M ||f1 − f2||∞,

and

|n−1n∑

i=1

[(f1(xi)− yi)2 − (f2(xi)− yi)

2]| = n−1

n∑

i=1

(f1(xi)− f2(xi))(f1(xi) + f2(xi)− 2yi)

≤ ||f1 − f2||∞1

n

n∑

i=1

|f1(xi)− yi + f2(xi)− yi|

≤ 2M ||f1 − f2||∞.

The result follows from the above inequalities.

Lemma. Let H = B1

...⋃

Bℓ and ε > 0. Then

IP

(

supf∈H

D(f, S)

)

≤ℓ∑

j=1

IP

(

supf∈Bj

D(f, S)

)

.

Page 78: Statistical Learning: Algorithms and Theory

74 S. MUKHERJEE, STATISTICAL LEARNING

Proof.The result follows from the following equivalence and the union bound

supf∈H

D(f, S) ≥ ε⇐⇒ ∃j ≤ ℓ s.t. supf∈Bj

D(f, S) ≥ ε.

We now prove Theorem 8.4.Let ℓ = N

(

H, ε4M

)

and the functions gjℓj=1 have the property that the disksBj centered at fj with radius ε

4M cover H. By the first lemma for all f ∈ Bj

|D(f, S)−D(fj , S)| ≤ 4M ||f − fj ||∞ ≤ 4Mε

4M= ε,

this implies that for all f ∈ Bj

supf∈Bj

|D(f, S)| ≥ 2ε⇒ |D(fj , S)| ≥ ε.

So

IP

(

supf∈Bj

|D(f, S)| ≥ 2ε

)

≤ IP (|D(fj , S)| ≥ ε) ≤ 2e−ε2n/M2

.

This combined with the second lemma implies

IP

(

supf∈H

D(f, S)

)

≤ N(

H,ε

8M

)

e−ε2n/M2

.

Since the following implication of events holds

supf∈H

IEzV (fj , z)− n−1n∑

i=1

V (fj , zi) < ε

IEzV (fS , z)− n−1n∑

i=1

V (fS , zi) < ε

the result is obtained by setting e−t = N(

H, ε8M

)

e−nε2/M2

. A result of the above theorem is the following sufficient condition for uniform

convergence and consistency of ERM.

Corollary. For a Lipschitz loss function ERM is consistent if for all ε > 0

limn→∞

logN (H, ε)

n= 0.

Proof.This follows directly from the statement

IP

(

supf∈H

D(f, S)

)

≤ N(

H,ε

8M

)

e−ε2n/M2

.

We now compute covering numbers for a few types of hypothesis spaces.We also need the definition of packing numbers.

Definition. Given a hypothesis space H and the supnorm, ℓ functions giℓi=1 areǫ-separated if

supx∈X|gj(x)− gi(x)| > ǫ ∀i 6= j.

The packing number P(H, ǫ) is the maximum cardinality of ǫ-separated sets.

The following relationship between packing and covering numbers is very useful.

Page 79: Statistical Learning: Algorithms and Theory

LECTURE 8. VAPNIK-CERVONENKIS THEORY 75

Lemma. Given a metric space (A, ρ). Then for all ǫ > 0 and for every W ⊂ A,the covering numbers and packing numbers satisfy

P(W, 2ǫ, ρ) ≤ N (W, ǫ, ρ) ≤ P(W, ǫ, ρ).

Proof.For the second inequality suppose P is an ǫ-packing of maximal cardinality,

P(W, ǫ, d). Then for any w ∈W there must be a u ∈ P with ρ(u, w) < ǫ, otherwisew is not an element of P and P ∪w is an ǫ-packing. This contradicts the assumptionthat P is a maximal ǫ-packing. So any maximal ǫ packing is an ǫ-cover.

For the first inequality suppose C is an ǫ-cover for W and and that P is a 2ǫ-packing of W with maximum cardinality P(W, ǫ, d). We will show that |P | ≤ |C|.Assume that |C| > |P |. Then for two points w1, w2 ∈ P and one point u ∈ C thefollowing will hold

ρ(w1, u) ≤ ǫ and ρ(w2, u) ≤ ǫ =⇒ ρ(w1, w2) ≤ 2ǫ.

This contradicts the fact that the points in P are 2ǫ-separated.

In general we will compute packing numbers for hypothesis spaces and use theabove lemma to obtain the covering number.

The following proposition will be useful.

Proposition. Given x ∈ IRd, the restriction the space to the unit ball B = x :||x|| ≤M, and the standard Euclidean metric ρ(x, y) = ||x− y||, then for ǫ ≤M

P(B, ǫ, ρ) ≤(

3M

ǫ

)d

.

Proof.The ℓ points w1, .., wℓ form an optimal ǫ-packing so

Vol(

M +ǫ

2

)

= Cd

(

M +ǫ

2

)d

Vol( ǫ

2

)

= Cd

( ǫ

2

)d

ℓ[Cd

( ǫ

2

)d

] = Cd

(

M +ǫ

2

)d

ℓ ≤(

2M + ǫ

ǫ

)d

≤(

3M

ǫ

)d

for all ǫ ≤M.

Example. Covering numbers for a finite dimensional RKHS.For a finite dimensional bounded RKHS

HK =

f : f(x) =

m∑

p=1

cpφp(x)

,

with ‖f‖2K ≤M .

Page 80: Statistical Learning: Algorithms and Theory

76 S. MUKHERJEE, STATISTICAL LEARNING

By the reproducing property and Cauchy-Schwartz inequality, the supnorm canbe bound by the RKHS norm:

||f(x)||∞ = ||〈K(x, ·), f(·)〉K ||∞≤ ||K(x, ·)||K ||f ||K=

〈K(x, ·), K(x, ·)〉||f ||K=

K(x,x)||f ||K≤ κ||f ||K

This means that if we can cover with the RKHS norm we can cover with thesupnorm.

Each function in our cover, giℓi=1 can be written as

gi(x) =

m∑

p=1

dipφp(x)

So if we find ℓ vectors di for which for all c :∑m

p=1

c2p

λp≤ M there exists a di such

thatm∑

p=1

(cp − dip)2

λp< ǫ2,

we have a cover at scale ǫ. The above is simply a weighted Euclidean norm and canbe reduced to the problem of covering a ball of radius M in IRm using the Euclideanmetric. Using proposition 8.4 we can bound the packing number with the RKHSnorm and the supnorm

P(H, ǫ, || · ||Hk) ≤

(

3M

ǫ

)m

,

P(H, ǫ, || · ||∞) ≤(

3M

κǫ

)m

.

Using lemma 8.4 we can get a bound on the covering number

N (H, ǫ, || · ||∞) ≤(

3M

κǫ

)m

.

We have shown that for H ⊂ C(X ) that is compact with respect to the supnormwe have uniform convergence. This requirement is to strict to determine necessaryconditions. A large class of functions that these conditions do not apply to areindicator functions f(x) ∈ 0, 1.

8.5. Generalization bound for hypothesis spaces of indicator func-tions

In this section we derive necessary and sufficient conditions for uniform convergenceof indicator functions and as a result generalization bounds for indicator functions,f(x) ∈ 0, 1.

As in the case of compact functions we will take a class of indicator functionsH and reduce this to some finite set of functions. In the case of indicator functionsthis is done via the notion of a growth function which we now define.

Page 81: Statistical Learning: Algorithms and Theory

LECTURE 8. VAPNIK-CERVONENKIS THEORY 77

Definition. Given a set of n points xini=1 and a class of indicator functions Hwe say that a function f ∈ H picks out a certain subset of xini=1 if this set can beformed by the operation f ∩ xini=1. The cardinality of the number of subsets thatcan be picked out is called the growth function:

n(H, xini=1) = # f ∩ xini=1 : f ∈ H .

We will now state a lemma which will look very much like the generalizationresults for the compact or finite dimensional case.

Lemma. Let H be a class of indicator functions and S = zini=1 drawn i.i.d. thenwith probability at least 1− e−t/8 (t > 0) for the empirical minimizer, fS,

IEx,yIfS(x) 6=y < n−1n∑

i=1

IfS(xi) 6=yi +

(8 log 8n(H, S) + t)

n,

where n(H, S) is the growth function given S observations.

Note: the above result depends upon a particular draw of 2n samples throughthe growth function. We will remove this dependence soon.

We first prove two useful lemmas. Define

D(f, S) := IEx,yIf(x) 6=y − n−1n∑

i=1

If(xi) 6=yi.

The first lemma is based upon the idea of symmetrization and replaces thedeviation between the empirical and expected error to the difference between twoempirical errors.

Lemma. Given two independent copies of the data S = zini=1 and S = z′ini=1

then for any fixed f ∈ H if n ≥ 2/ǫ2

IP (|D(f, S)| > ǫ) ≤ 2 IP (|D(f, S)−D(f, S′)| > ǫ/2) ,

where

|D(f, S)−D(f, S′)| = n−1n∑

i=1

If(xi) 6=yi − n−1n∑

i=1

If(x′i) 6=y′

i.

Proof. We first assume that

IP(|D(f, S′)| ≤ ǫ/2 |S) ≥ 1/2,

where we have conditioned on S. Since S and S′ are independent we can integrateout

1/2 IP(|D(f, S′)| > ǫ) ≤ IP(|D(f, S′)| ≤ ǫ/2, |D(f, S′)| > ǫ).

By the triangle inequality |D(f, S)| > ǫ and |D(f, S′)| ≤ ǫ/2 implies

|D(f, S)−D(f, S′)| ≥ ǫ/2,

soIP(|D(f, S′)| ≤ ǫ/2, |D(f, S′)| > ǫ) ≤ IP(|D(f, S)−D(f, S′)| ≥ ǫ/2).

To complete the proof we need to show our initial assumption holds. Since H isa class of indicator functions the elements in the sum are binomial random variablesand the variance of n of them will be at most 1/4n. So by the Bienayme-Chebyshevinequality

IP(|D(f, S′)| > ǫ/2) ≥ 1/4nǫ2,

Page 82: Statistical Learning: Algorithms and Theory

78 S. MUKHERJEE, STATISTICAL LEARNING

which implies the initial assumption when n ≥ 2/ǫ2.

By symmetrizing we now have a term that depends only on samples. Theproblem is that it depends on the samples we have but also an independent compy.This nuisance is removed by a second step of symmetrization.

Lemma. Let σi be a Rademacher random variable (IP(σi = ±1) = 1/2) then

IP (|D(f, S)−D(f, S′)| > ǫ/2) ≤ 2IP

(∣

n−1n∑

i=1

σiIf(xi) 6=yi

> ǫ/4

)

.

Proof.

IP (|D(f, S)−D(f, S′)| > ǫ/2) = IP

(∣

n−1n∑

i=1

σiIf(xi) 6=yi − n−1n∑

i=1

σiIf(x′i) 6=y′

i

> ǫ/2

)

≤ IP

(∣

n−1n∑

i=1

σiIf(xi) 6=yi

> ǫ/4

)

+

IP

(∣

n−1n∑

i=1

σiIf(x′i) 6=y′

i

> ǫ/4

)

≤ 2IP

(∣

n−1n∑

i=1

σiIf(xi) 6=yi

> ǫ/4

)

.

The second symmetrization step allows is to bound the deviation between theempirical and expected errors based upon a quantity computed just on the empiricaldata.

We now prove Lemma 8.5.By the symmetrization lemmas for n ≥ 8/ǫ2

IP(|D(f, S)| > ǫ) ≤ 4 IP

(∣

n−1n∑

i=1

σiIf(xi) 6=yi

> ǫ/4

)

.

By the Rademacher version of Hoeffdings inequality

IP

(∣

n−1n∑

i=1

σiIf(xi) 6=yi

> ǫ

)

≤ 2e−2ǫ2.

Combining the above for a single function

IP(|D(f, S)| > ǫ) ≤ 8e−ǫ2/8.

Given data S the growth function characterizes the cardinality of subsets thatcan be “picked out” which is a bound on the number of possible labellings orrealizable functions, ℓ = n(H, S). We index the possible labelings by fj wherej = 1, ..., ℓ.

Page 83: Statistical Learning: Algorithms and Theory

LECTURE 8. VAPNIK-CERVONENKIS THEORY 79

We now proceed as in the case of a finite number of functions

IP

(

supf∈H|D(f, S)| ≥ ǫ

)

= IP

f∈H|D(f, S)| ≥ ǫ

≤ℓ∑

i=1

IP (|D(fj , S)| ≥ ε)

≤ 8n(H, S)e−nǫ2/8.

Setting e−t/8 = 8n(H, S)e−nǫ2/8 completes the proof.

This bound is not uniform since the growth function depends on the data S. Wecan make the bound uniform by defining a uniform notion of the growth function.

Definition. The uniform growth function is

n(H) = maxx1,...,xn∈X

n(H, xini=1).

Corollary. Let H be a class of indicator functions and S = zini=1 drawn i.i.d.then with probability at least 1− e−t/8 (t > 0) for the empirical minimizer, fS,

IEx,yIfS(x) 6=y < n−1n∑

i=1

IfS(xi) 6=yi +

(log 8n(H) + t)

n,

where n(H) is the uniform growth function.

Corollary. For a class of indicator functions ERM is consistent if and only if forall ε > 0

limn→∞

8 logn(H)

n= 0.

We now characterize conditions under which the uniform growth function growspolynomially. To do this we need a few definitions.

Definition. A hypothesis space, H, shatters a set x1, ..., xn if each of its 2n

subsets can be “picked out” by H. The Vapnik-Cervonenkis (VC) dimension, v(H),of a hypothesis space is the largest n for which all sets of size n are shattered by H

v(H) = sup n : n(H) = 2n ,if the exists no such n then the VC dimension is infinite.

Definition. A hypothesis space of indicator functions H is a VC class if and onlyif it has finite VC dimension.

Examples.

The VC dimension controls the growth function via the following lemma.

Lemma. For a hypothesis space H with VC dimension d and n > d

n(H) ≤d∑

i=1

(

n

i

)

.

Page 84: Statistical Learning: Algorithms and Theory

80 S. MUKHERJEE, STATISTICAL LEARNING

Proof.The proof will be by induction on n + d. We define

(

ni

)

:= 0 if i < 0 or i > n.In addition one can check

(

n

i

)

=

(

n− 1

i− 1

)

+

(

n− 1

i

)

.

When d = 0 |H| = 1 since no points can be shattered so for all n

n(H) = 1 =

(

n

0

)

= Φ0(n).

When n = 0 there is only one way to label 0 examples so

0(H) = 1 =

d∑

i=1

(

0

i

)

= Φd(0).

Assume the lemma to hold for n′, d′ such that n′ + d′ < n + d.Given S = x1, ..., xn and Sn = x1, ..., xn−1. We now define three hypothesis

spaces H0, H1, and H2:

H0 := fi : i = 1, ...,n(H, S)H1 := fi : i = 1, ...,n−1(H, Sn)H2 := H0 −H1,

where each fi in set H0 is a possible labeling of S by H, each fi in set H1 is apossible labeling of Sn by H.

For the setH1 over Sn: n1 = n−1 since there is one fewer sample and v(H1) ≤ dsince reducing the number of hypotheses cannot increase the VC dimension.

For the set H2 over Sn: n1 = n − 1 since there is one fewer sample andv(H2) ≤ d − 1. If S′ ⊆ Sn is shattered by H2 then all labellings of S′ mustoccur both in H1 and H2 but with different labels on xn. So S′ ∪ xn which hascardinality |S′|+ 1 is shattered by H and so |S′| cannot be more than d− 1.

By induction n−1(H1, Sn) ≤ Φd(m− 1) and n−1(H2, Sn) ≤ Φd−1(m− 1).By construction

n(H, S) = |H1|+ |H2| = n−1(H1, Sn) +n−1(H2, Sn)

≤ Φd(n− 1) + Φd−1(n− 1)

=d∑

i=0

(

n− 1

i

)

+d−1∑

i=0

(

n− 1

i

)

=d∑

i=0

(

n− 1

i

)

+d∑

i=0

(

n− 1

i− 1

)

=d∑

i=0

[(

n− 1

i

)

+

(

n− 1

i− 1

)]

=

d∑

i=0

(

n

i

)

.

Lemma. For n ≥ d ≥ 1d∑

i=1

(

n

i

)

<(en

d

)d

.

Page 85: Statistical Learning: Algorithms and Theory

LECTURE 8. VAPNIK-CERVONENKIS THEORY 81

Proof.For 0 ≤ i ≤ d and n ≥ d

(m/d)d(d/m)i ≥ 1,

sod∑

i=1

(

n

i

)

≤ (n/d)dd∑

i=1

(

n

i

)

(d/n)i ≤ (n/d)d(1 + d/n)n < (ne/d)d.

This now lets state the generalization bound in terms of VC dimension.

Theorem. Let H be a class of indicator functions with VC dimension d and S =zini=1 drawn i.i.d. then with probability at least 1−e−t/8 (t > 0) for the empiricalminimizer, fS,

IEx,yIfS(x) 6=y < n−1n∑

i=1

IfS(xi) 6=yi + 2

(8d log(8en/d) + t)

n.

Proof. From the proof of lemma we have

IP

(

supf∈H|D(f, S)| ≥ ǫ

)

≤ 8n(H, S)e−nǫ2/8,

therefore since n(H, S) ≤(

end

)d, we have

IP

(

supf∈H|D(f, S)| ≥ ǫ

)

≤ 8(en

d

)d

e−nǫ2/8,

and setting e−t/8 = 8(

end

)de−nǫ2/8 gives us

IEx,yIfS(x) 6=y < n−1n∑

i=1

IfS(xi) 6=yi +

(8d log(8en/d) + t + 8 log 8)

n,

for n > 2 and d > 1 8 log 8 < 8d log(en/d) so√

(8d log(8en/d) + t + 8 log 8)

n< 2

(8d log(8en/d) + t)

n,

which proves the theorem.

Theorem. For a class of indicator functions ERM the following are equivalent

(1) ERM is consistent(2) for all ε > 0

limn→∞

8 logn(H)

n= 0.

(3) the VC dimension v(H) is finite.

8.6. Kolmogorov chaining

In this section we introduce Kolmogorov chaining which is a much more efficientway of constructing a cover. In the process we derive Dudley’s entropy integral.

We first define an empirical norm.

Page 86: Statistical Learning: Algorithms and Theory

82 S. MUKHERJEE, STATISTICAL LEARNING

Definition. Given S = x1, ..., xn the empirical ℓ2 norm is

ρS(f, g) =

(

n−1n∑

i=1

(f(xi)− g(xi))2

)1/2

.

We can define a cover given the empirical norm

Definition. Given a hypothesis space H and the above norm, the covering numberN (H, ǫ, ρS) is the minimal number ℓ ∈ IN such that for every f ∈ H there existsfunctions giℓi=1 such that

ρS(f, gi) ≤ ǫ for some i.

The proof of the following theorem is identical to the proof of lemma 8.5.

Theorem. Given the square loss and H be a functions such that −1 ≤ f(x) ≤ 1,y ∈ [−1, 1] and S = zini=1 drawn i.i.d. then with probability at least 1 − e−t/8

(t > 0) for the empirical minimizer, fS,

IEx,y(fS(x)− y)2 < n−1n∑

i=1

(fS(xi)− yi)2 +

(8 logN (H, ε/8M, ρS) + t)

n,

where N (H, ε/8M, ρS) is the empirical cover.

The key idea in the proof of both lemma 8.5 and the above theorem is that

IP(|D(f, S)| > ǫ) ≤ 4 IP

(∣

n−1n∑

i=1

σif(xi)

> ǫ/4

)

,

where

D(f, S) := IEx,y(f(x)− y)2 − n−1n∑

i=1

(f(xi)− yi)2,

and σi is a Rademacher random variable.We now prove the chaining theorem.

Theorem. Given a hypothesis space H where for all f ∈ H −1 ≤ f(x) ≤ 1 if wedefine

R(f) = n−1n∑

i=1

σif(xi),

then

IP

(

∀f ∈ H, R(f) ≤ 29/2

√n

∫ d(0,f)

0

log1/2 P(H, ε, ρS)dε + 27/2d(0, f)

u

n

)

≥ 1−e−u,

where P(H, ε, ρS) is the empirical packing number and∫ d(0,f)

0

log1/2 P(H, ε, ρS)dε

is Dudley’s entropy interal.

Proof.Without loss of generality we will assume that the zero function 0 is in H.

We will construct a nested sets of functions

0 = H0 ⊆ H1 ⊆ H2... ⊆ Hj ⊆ ...H.

These subsets will have the following properties

Page 87: Statistical Learning: Algorithms and Theory

LECTURE 8. VAPNIK-CERVONENKIS THEORY 83

(1) ∀f, g ∈ Hj ρS(f, g) > 2−j

(2) ∀f ∈ H ∃ f ∈ Hj such that ρS(f, g) ≤ 2−j.

Given a set Hj we can construct Hj+1 via the following procedure:

(1) Hj+1 := Hj

(2) Find all f ∈ H such that for all g ∈ Hj+1 ρS(f, g) > 2−(j+1)

(3) Add the above f to Hj+1.

We now define a projection operation πj : H → Hj where given f ∈ Hπj(f) := g where g ∈ Hj such that ρS(g, f) ≤ 2−j .

For all f ∈ H the following chaining holds

f = π0(f) + (π1(f)− π0(f)) + (π2(f)− π1(f)) + ...

=

∞∑

j=1

(πj(f)− πj−1(f)),

and

ρS(πj−1(f), πj(f)) ≤ ρ(πj−1(f), f) + ρS(πj(f), f)

≤ 2−(j−1) + 2−j = 3 · 2−j ≤ 2−j+2.

R(f) is a linear function, so

R(f) =

∞∑

j=1

(πj(f)− πj−1(f)).

The set of links in the chain between two levels are defined as follows

Lj−1,j := f − g : f ∈ Hj , g ∈ Hj−1 and ρS(f, g) ≤ 2−j+2.For a fixed link ℓ ∈ Lj−1,j

R(l) = n−1n∑

i=1

σiℓ(xi),

and |ℓ(xi)| ≤ 2−j+2 so by Hoeffding’s inequality

IP(R(ℓ) ≥ t) ≤ e−nt2/( 2n

P

ni=1 ℓ2(xi))

≤ e−nt2/(2·2−2j+4).

The cardinality of the set of links is

|Lj−1,j | ≤ |Hj | · |Hj−1| ≤ (|Hj |)2.So

IP (∀ℓ ∈ Lj−1,j , R(ℓ) ≤ t) ≥ 1− (|Hj |)2 e−nt2/2−2j+5

,

setting

t =

2−2j+5

n(4 log |Hj |+ u) ≤

2−2j+5

n4 log |Hj |+

2−2j+5u

n,

gives us

IP

(

∀ℓ ∈ Lj−1,j , R(ℓ) ≤ 27/22−j log1/2 |Hj |√n

+ 25/22−j

u

n

)

≥ 1− 1

|Hj |e−u.

Page 88: Statistical Learning: Algorithms and Theory

84 S. MUKHERJEE, STATISTICAL LEARNING

If Hj−1 = Hj then

πj−1(f) = πj(f) and Lj−1,j = 0.So over all levels and links with probability at least 1−∑∞

j=11

|Hj |e−u

∀j ≥ 1, ∀ℓ ∈ Lj−1,j , R(ℓ) ≤ 27/22−j log1/2 |Hj |√n

+ 25/22−j

u

n,

and

1−∞∑

j=1

1

|Hj |e−u ≥ 1−

∞∑

j=1

1

j2e−u = 1−

(

π2

6− 1

)

e−u ≥ 1− e−u.

For some level k

2−(k+1) ≤ d(0, f) ≤ 2−k

and

0 = π0(f) = π1(f) = · · · = πk(f).

So

R(f) =

∞∑

j=k+1

R(πj(f)− πj−1(f))

≤∞∑

j=k+1

(

27/22−j

√n

log1/2 |Hj |+ 25/22−j

u

n

)

≤∞∑

j=k+1

(

27/22−j

√n

log1/2 P(H, 2−j, ρS)

)

+ 25/22−k

u

n.

Since 2−k < 2d(f, 0) we get the second term in the theorem

27/2d(0, f)

u

n.

For the first term∞∑

j=k+1

27/22−j

√n

log1/2 P(H, 2−j, ρS) ≤ 29/2

√n

∞∑

j=k+1

2−(j+1) log1/2 P(H, 2−j, ρS)

≤ 29/2

√n

∫ 2−(k+1)

0

log1/2 P(H, ε, ρS)dε

≤ 29/2

√n

∫ d(0,f)

0

log1/2 P(H, ε, ρS)dε,

the above quantity is Dudley’s entropy integral.

8.7. Covering numbers and VC dimension

In this section we will show how to bound covering numbers via VC dimension.Covering numbers as we have introduced them have been in general for real-valuedfunctions and not indicator functions.

The notion of VC-dimension and VC classes can be extended to real-valuedfunctions in a variety of mappings. The most standard extension is the notion ofVC subgraph classes.

Page 89: Statistical Learning: Algorithms and Theory

LECTURE 8. VAPNIK-CERVONENKIS THEORY 85

Definition. A subgraph of function f(x) where f : X → IR is the set

Ff = (x, t) ∈ X × IR : 0 ≤ t ≤ f(x) or f(x) ≤ t ≤ 0.Definition. The subgraph of a class of funtions H are the sets

F = Ff : f ∈ H.Definition. If F is a VC class of sets then H is a VC subgraph class of functionsand v(H) = v(F).

We now show that we can upper-bound the covering number with the empiricalℓ1 norm with a function of then VC dimension for a hypothesis spaces with finiteVC dimension.

Theorem. Given a VC subgraph class H where −1 ≤ f(x) ≤ 1 ∀f ∈ H and x ∈ Xwith v(H) = d and ρS(f, g) = n−1

∑ni=1 |f(xi)− g(xi)| then

P(H, ε, ρs) ≤(

8e

εlog

7

ε

)d

.

The bound in the above theorem can be improved to

P(H, ε, ρs) ≤(

K

ε

)d

,

however, the proof is more complicated so we prove the weaker statement.Proof.

Set m = P(H, ε, ρs) so f1, ..., fm are ε-separated and each function fk has itsrespective subgraph Ffk

.Sample uniformly from x1, ..., xn k elements z1, ...zk and uniformly on

[−1, 1] k elements t1, ..., tk.We now bound the probability that the subgraphs of two ε-separated functions

pick out different subsets of (z1, t1), ..., (zk, tk)IP (Ffk

and Fflpick out different subsets of (z1, t1), ..., (zk, tk))

= IP (at least one (zi, ti) is picked out by either Ffkor Ffl

but not the other)

= 1− IP (all (zi, ti) are picked out by both or none) .

The probability that (zi, ti) is either picked out by either both Ffk,Ffl

or by neither

8.8. Symmetrization and Rademacher complexities

In the previous lectures we have considered various complexity measures, such ascovering numbers. But what is the right notion of complexity for the learningproblem we posed? Consider the covering numbers for a moment. Take a smallfunction class and take its convex hull. The resulting class can be extremely large.Nevertheless, the supremum of the difference of expected and empirical errors willbe attained at the vertices, i.e. at the base class. In some sense, the “inside” ofthe class does not matter. The covering numbers take into account the whole class,and therefore become very large for the convex hull, even though the essentialcomplexity is that of the base class. This suggests that the covering numbersare not the ideal complexity measure. In this lecture we introduce another notion(Rademacher averages), which can be claimed to be the “correct” one. In particular,

Page 90: Statistical Learning: Algorithms and Theory

86 S. MUKHERJEE, STATISTICAL LEARNING

the Rademacher averages of a convex hull will be equal to those of the base class.This notion of complexity will be shown to have other nice properties.

Instead of jumping right to the definition of Rademacher Averages, we willtake a longer route and show how these averages arise. Results on this topic canbe found in the Theory of Empirical Processes, and so we will give some definitionsfrom it.

Let F be a class of functions. Then (Zi)i∈I is a random process indexed by Fif Zi(f) is a random variable for any i.

As before, µ is a probability measure on Ω, and data x1, ..., xn ∼ µ. Then µn

is the empirical measure supported on x1, ..., xn:

µn =1

n

n∑

i=1

δxi.

Define Zi(·) = (δxi− µ)(·), i.e.

Zi(f) = f(xi)− IEµ(f).

Then Z1, ..., Zn is an i.i.d. process with 0 mean.In the previous lectures we looked at the quantity

(8.1) supf∈F

1

n

n∑

i=1

f(xi)− IEf

,

which can be written as n supf∈F |∑n

i=1 Zi(f)|.Recall that the difficulty with (8.1) is that we do not know µ and therefore

cannot calculate IEf . The classical approach of covering F and using the unionbound is too loose.

Proposition. Symmetrization: If 1n

∑ni=1 f(xi) is close to IEf for data x1, ..., xn,

then 1n

∑ni=1 f(xi) is close to 1

n

∑ni=1 f(x′

i), the empirical average on x′1, ..., x

′n (an

independent copy of x1, ..., xn). Therefore, if the two empirical averarages are farfrom each other, then empirical error is far from expected error.

Now fix one function f . Let ǫ1, ..., ǫn be i.i.d. Rademacher random variables(taking on values 0 or 1 with probability 1/2). Then

IP

[∣

n∑

i=1

(f(xi)− f(x′i))

≥ t

]

= IP

[∣

n∑

i=1

ǫi(f(xi)− f(x′i))

≥ t

]

≤ IP

[∣

n∑

i=1

ǫif(xi)

≥ t/2

]

+ IP

[∣

n∑

i=1

ǫif(x′i)

≥ t/2

]

= 2IP

[∣

n∑

i=1

ǫif(xi)

≥ t/2

]

Together with symmetrization, this suggests that controlling IP (|∑ni=1 ǫif(xi)| ≥ t/2)

is enough to control IP(∣

1n

∑ni=1 f(xi)− IEf

∣ ≥ t)

. Of course, this is a very simpleexample. Can we do the same with quantities that are uniform over the class?

Definition. Suprema of an Empirical process:

Z(x1, ..., xn) = supf∈F

[

IEf − 1

n

n∑

i=1

f(xi)

]

.

Page 91: Statistical Learning: Algorithms and Theory

LECTURE 8. VAPNIK-CERVONENKIS THEORY 87

Definition. Suprema of a Rademacher Process:

R(x1, ..., xn, ǫ1, ..., ǫn) = supf∈F

[

1

n

n∑

i=1

ǫif(xi)

]

.

Proposition. The expectation of the Rademacher process bounds the expectationof the empirical process:

IEZ ≤ 2IER1.

Proof.

IEZ = IEx supf∈F

[

IEf − 1

n

n∑

i=1

f(xi)

]

= IEx supf∈F

[

IEx′

(

1

n

n∑

i=1

f(x′i)

)

− 1

n

n∑

i=1

f(xi)

]

≤ IEx,x′ supf∈F

1

n

n∑

i=1

(f(x′i)− f(xi))

= IEx,x′,ǫ supf∈F

1

n

n∑

i=1

ǫi(f(x′i)− f(xi))

≤ IEx,x′,ǫ supf∈F

1

n

n∑

i=1

ǫif(x′i) + IEx,x′,ǫ sup

f∈F

1

n

n∑

i=1

(−ǫi)f(xi)

= 2IER .

As we discussed previously, we would like to bound the empirical process Zsince this will imply “generalization” for any function in F . We will bound Z bythe Rademacher average IER which we will see has some nice properties.

Theorem. If the functions in F are uniformly bounded between [a, b] then withprobability 1− e−u

Z ≤ 2IER +

2u(b− a)

n.

Proof. The inequality involves two steps

(1) the concentration of Z around its mean IEZ(2) applying the bound IEZ ≤ 2IER

We will use McDiarmid’s inequality for the first step. We define the followingtwo variables Z := Z(x1, ..., xi, ..., xn) and Zi := Z(x1, ..., x

′i, ..., xn). Since a ≤

f(x) ≤ b for all x and f ∈ F :

∣Zi − Z∣

∣ =

supf∈F|IEf − n−1

n∑

j=1

f(xj) +(

n−1f(xi)− n−1f(x′i))

| − supf∈F|IEf − n−1

n∑

j=1

f(xj)|

≤ supf∈F

1

n|f(xi)− f(x′

i)| ≤b− a

n= ci.

1The quantity IER is called a Rademacher average.

Page 92: Statistical Learning: Algorithms and Theory

88 S. MUKHERJEE, STATISTICAL LEARNING

This bounds the Martingale difference for the empirical process. Given the differ-ence bound McDiarmid’s inequality states

IP (Z − IEZ > t) ≤ exp

(

−t2

2∑n

i=1(b−a)2

n2

)

= exp

( −nt2

2(b− a)2

)

.

Therefore, with probability at least 1− e−u,

Z − IEZ <

2u(b− a)

n.

So as the number of samples, n, grows, Z becomes more and more concentratedaround IEZ.

Applying symmetrization proves the theorem. With probability at least 1−e−u.

Z ≤ IEZ +

2u(b− a)

n≤ 2IER +

2u(b− a)

n.

McDiamid’s inequality does not incorporate a notion of variance so it is possibleto obtain a sharper inequality using see Talagrand’s inequality for the suprema ofempirical processes.

We are now left with bounding the Rademacher average. Implicit in the pre-vious lecture on on Kolmogorov chaining was such a bound. Before we restatethat result and give some examples we state some nice and useful properties ofRademacher averages.

Properties. Let F , G be classes of real-valued functions. Then for any n,

(1) If F ⊆ G, then IER(F) ≤ IER(G)(2) IER(F) = IER(convF)(3) ∀c ∈ IR, IER(cF) = |c|IER(F)(4) If φ : IR→ IR is L-Lipschitz and φ(0) = 0, then IER(φ(F)) ≤ 2 L IER(F)

(5) For RKHS balls, c (∑∞

i=1 λi)1/2 ≤ IER(Fk) ≤ C(

∑∞i=1 λi)

1/2, where λi’sare eigenvalues of the corresponding linear operator in the RKHS.

Theorem. The Rademacher average is bounded by Dudley’s entropy integral

IEǫR ≤ c1√n

∫ D

0

logN (ǫ,F , L2(µn))dǫ,

where N denotes the covering number.

Example. Let F be a class with finite VC-dimension V . Then

N (ǫ,F , L2(µn)) ≤(

2

ǫ

)kV

,

for some constant k. The entropy integral above is bounded as∫ 1

0

logN (ǫ,F , L2(µn))dǫ ≤∫ 1

0

kV log 2/ǫ dǫ

≤ k′√V

∫ 1

0

log 2/ǫ dǫ ≤ k√

V .

Therefore, IEǫR ≤ k√

Vn for some constant k.

Page 93: Statistical Learning: Algorithms and Theory

LECTURE 9Generalization bounds

9.1. Generalization bounds and stability

We define an algorithm A to be a mapping from a training set S = z1, . . . , zn toa function fS . Here, zi := (xi, yi).

How do we measure the performance of A ? First, we introduce a loss functionV , so that V (f(x), y) measures the penalty of predicting the value f(x) at point xwhile the true value is y. One goal of learning is to minimize the expected error ata new point:

I[f ] := IE(x,y) [V (f(x), y)] =

V (f(x), y)dµ(x, y)

Unfortunately, we cannot compute the above quantity because the measureµ is unknown. We therefore try to upper bound it. The natural approach is toapproximate the expectation by the average (empirical error1) on the given sampleS:

IS [f ] =1

n

n∑

i=1

V (f(xi), yi).

Generalization bounds are probabilistic bounds on the difference I[f ]− IS [f ].But how far is IS [f ] from I[f ]? For a fixed function f , the difference between

these two quantities is small (law of large numbers). What if the algorithm choosesdifferent f? Then f is itself is a random function, as denoted by fS , and in generalwe cannot control the difference IS [fS ]− I[fS].

The classical uniform generalization bounds use some notion of the “size” ofthe function class and hold for all functions in the class:

IPSsupf∈F|I[f ]− IS [f ]| > ǫ ≤ φ(F , n, ǫ).

It doesn’t matter what the algorithm is doing because these bounds give aguarantee for all of them at the same time! These types of bounds ignore thefact that we are dealing with a specific algorithm. The only knowledge aboutthe algorithm used in these types of bounds is “functions are chosen from a fixed

1The true and empirical risks are denoted in Bousquet & Elisseeff as R(A, S) and R(A, S), re-spectively, to emphasize the algorithm that produced fS .

89

Page 94: Statistical Learning: Algorithms and Theory

90 S. MUKHERJEE, STATISTICAL LEARNING

function class F”. Once the function class is fixed, no matter what algorithm weare using, uniform bounds would provide the same bound on the generalizationerror.

Now, imagine that the algorithm is actually outputing only one function. Thegeneralization bound then follows, as mentioned above, from the law of large num-bers, and is tighter than the uniform bound over a large function class. Of course,this is an extreme example, but the main message is: to get better bounds onthe performance of the algorithm, one should use the knowledge about the specificalgorithm.

Such knowledge can come in different forms. One useful (as we will see now)assumption is “algorithmic stability”. It turns out that if the function output by al-gorithm does not change much when one training point is perturbed, generalizationbounds follow quite easily.

9.1.1. Uniform stability

In this section we introduce the notion of Uniform Stability and show how general-ization bounds can be obtained for uniformly stable algorithms. In the next sectionwe will show that Tikhonov regularization algorithm exhibits such stability, andtherefore we get guarantees on its performance.

We assume that A is deterministic, and that A does not depend on the orderingof the points in the training set.

Define D[S] = I[fS] − IS [fS], the defect, or generalization error. It measuresthe discrepancy between the expected loss and the empirical estimate. Since we canmeasure IS [fS ], bounds on D[S] can be translated into bounds on I[fS ]. We wouldlike to show that with high probability D[S] is small. Then if we can observe thatIS [fS ] is small, it will follow that I[fS ] must be small. Note that in this approachwe are not concerned with “good performance” of all functions, but only the oneproduced by our algorithm:

IPS (|IS [fS ]− I[fS ]| > ǫ) < δ

Given a training set S, we define Si,z to be the new training set obtained whenpoint i of S is replaced by the new point z ∈ Z. We will overload the notationby writing the loss function as V (f, z) instead of V (f(x), y), where z = (x, y). Toobtain the results of this section, we also need to assume that the loss function Vis positive and upper-bounded by M .

Definition. We say that an algorithm A has uniform stability β (is β-stable) if

∀(S, z) ∈ Zn+1, ∀i, supu|V (fS , u)− V (fSi,z , u)| ≤ β.

An algorithm is β-stable if, for any possible training set, we can replace anarbitrary training point with any other possible training point, and the loss atany point will change by no more than β. Uniform stability is a strong requirementbecause it ignores the fact that the points are drawn from a probability distribution.For uniform stability, the function still has to change very little even when a veryunlikely (“bad”) training set is drawn.

In general, the stability β is a function of n, and should perhaps be written βn.Given that an algorithm A has stability β, how can we get bounds on its

performance? The answer is: we will use Concentration Inequalities, introduced in

Page 95: Statistical Learning: Algorithms and Theory

LECTURE 9. GENERALIZATION BOUNDS 91

the previous lectures. In particular, McDiarmid’s Inequality will prove to be usefulfor obtaining exponential bounds on the defect D[S].

Recall McDiarmid’s Inequality:Given random variables z1, . . . , zn = S, and a function F : zn → IR satisfying

supz1,...,zn,z′

i

|F (z1, . . . , zn)− F (z1, . . . , zi−1, z′i, zi+1, . . . , zn)| ≤ ci,

the following statement holds:

IP (|F (z1, . . . , zn)− IES(F (z1, . . . , zn))| > ǫ) ≤ 2 exp

(

− 2ǫ2∑n

i=1 c2i

)

.

We will apply McDiarmid’s inequality to the function F (z1, . . . , zn) = D[S] =I[fS] − IS [fS ]. We will show two things: that D[S] is close to its expectationIED[S], and that the expectation is small. Both of these will follow from uniformβ-stability.

IESD[fS ] = IES [IS [fS ]− I[fS]]

= IES,z

[

1

n

n∑

i=1

V (fS(xi), yi)− V (fS(x), y)

]

= IES,z

[

1

n

n∑

i=1

V (fSi,z (x), y)− V (fS(x), y)

]

≤ β.

The second equality follows by exploiting the “symmetry” of expectation: Theexpected value of a training set on a training point doesn’t change when we “re-name” the points.

Now we need to show that the requirements of McDiarmid’s theorem are sat-isfied:

|D[fS ]−D[fSi,z ]| = |IS [fS ]− I[fS]− ISi,z [fSi,z ] + I[fSi,z ]|≤ |I[fS]− I[fSi,z ]|+ |IS [fS ]− ISi,z [fSi,z ]|

≤ β +1

n|V (fS(xi), yi)− V (fSi,z(x), y)|

+1

n

j 6=i

|V (fS(xj), yj)− V (fSi,z(xj), yj)|

≤ β +M

n+ β

= 2β +M

n.

By McDiarmid’s Inequality, for any ǫ,

IP (|D[fS ]− IED[fS ]| > ǫ) ≤ 2 exp

(

− 2ǫ2∑n

i=1(2(β + Mn ))2

)

=

= 2 exp

(

− ǫ2

2n(β + Mn )2

)

= 2 exp

(

− nǫ2

2(nβ + M)2

)

.

Page 96: Statistical Learning: Algorithms and Theory

92 S. MUKHERJEE, STATISTICAL LEARNING

Note that

IP(D[fS ] > β + ǫ) = IP(D[fS ]− IED[fS] > ǫ)

≤ IP(|D[fS ]− IED[fS ]| > ǫ)

Hence,

IP(IS [fS ]− I[fS ] > β + ǫ) ≤ 2 exp

(

− nǫ2

2(nβ + M)2

)

If we define

δ := 2 exp

(

− nǫ2

2(nβ + M)2

)

.

Solving for ǫ in terms of δ, we find that

ǫ = (nβ + M)

2 ln(2/δ)

n.

By varying δ (and ǫ), we can say that for any δ ∈ (0, 1), with probability 1− δ,

I[fS ] ≤ IS [fS] + β + (nβ + M)

2 ln(2/δ)

n.

Note that if β = kn for some k, we can restate our bounds as

P

(

|I[fS ]− IS [fS ]| ≥ k

n+ ǫ

)

≤ 2 exp

(

− nǫ2

2(k + M)2

)

,

and with probability 1− δ,

I[fS ] ≤ IS [fS ] +k

n+ (2k + M)

2 ln(2/δ)

n.

What is the best rate of convergence that can be achieved with this method?Obviously, the best possible stability would be β = 0 — the function can’t changeat all when you change the training set. An algorithm that always picks the samefunction, regardless of its training set, is maximally stable and has β = 0. Usingβ = 0 in the last bound, with probability 1− δ,

I[fS ] ≤ IS [fS] + M

2 ln(2/δ)

n.

The convergence is still O(

1√n

)

. So once β = O( 1n ), further increases in stability

don’t change the rate of convergence.With only relatively few papers on stability in Machine Learning, there is a

proliferation of names and kinds of stabilities, which can cause lots of confusion.Here we mention a few to give a taste of kinds of perturbations considered in theliterature.

Definition. An algorithm is (β, δ) error stable if

∀δ(S, u), ∀i, |I[fS ]− I[fSi,u ]| ≤ β.

Definition. An algorithm is β L1-stable at S:

∀i, u ∈ Z, IEz [|V (fS , z)− V (fSi,u , z)|] ≤ β.

Definition. An algorithm is (β, δ) cross-validation stable:

∀δ(S, u) ∈ Zn+1, ∀i, |V (fS , u)− V (fSi,u , u)| ≤ β.

Page 97: Statistical Learning: Algorithms and Theory

LECTURE 9. GENERALIZATION BOUNDS 93

We have used McDiarmid’s inequality to prove a generalization bound for auniformly β-stable algorithm. Note that this bound cannot tell us that the expectederror will be low a priori, it can only tell us that with high probability, the expectederror will be close to the empirical error. We have to actually observe a low empiricalerror to conclude that we have a low expected error.

Uniform stability of O(

1n

)

seems to be a strong requirement. In the nextsection, we will show that Tikhonov regularization possesses this property.

9.2. Uniform stability of Tikhonov regularization

Recall the definition of Tikhonov Regularization:

fS = arg minf∈H

[

n−1n∑

i=1

V (f(xi), yi) + λ‖f‖2K

]

,

where H is an RKHS with kernel K.

Proposition. Under mild conditions Tikhonov regularization is is uniformly stablewith

β =L2κ2

λn,

where the constants L and κ will be defined with respect to the mild conditions.

Given the above stability condition and the results from the previous section,we have the following bound:

IP

(

|I[fS ]− IS [fS]| ≥ k

n+ ǫ

)

≤ 2 exp

(

− nǫ2

2(k + M)2

)

.

Equivalently, with probability 1− δ,

I[fS ] ≤ IS [fS ] +k

n+ (2k + M)

2 ln(2/δ)

n.

We now prove the proposition and state the mild conditions.Proof.The first condition is on the loss function:

Definition. A loss function (over a possibly bounded domain X ) is Lipschitz withLipschitz constant L if

∀y1, y2, y′ ∈ Y, |V (y1, y

′)− V (y2, y′)| ≤ L|y1 − y2|.

Put differently, if we have two functions f1 and f2, under an L-Lipschitz loss func-tion,

sup(x,y)

|V (f1(x), y)− V (f2(x), y)| ≤ L|f1 − f2|∞.

Yet another way to write it:

|V (f1, ·)− V (f2, ·)|∞ ≤ L|f1(·)− f2(·)|∞.

If a loss function is L-Lipschitz, then closeness of two functions (in L∞ norm)implies that they are close in loss. The converse is false — it is possible for thedifference in loss of two functions to be small, yet the functions to be far apart (inL∞):

Example. Consider constant loss V = const. The difference of losses of any twofunctions is zero while the functions can be far apart from each other.

Page 98: Statistical Learning: Algorithms and Theory

94 S. MUKHERJEE, STATISTICAL LEARNING

The hinge loss and the ǫ-insensitive loss are both L-Lipschitz with L = 1. Thesquare loss function is L Lipschitz if we can bound the y values and the f(x) valuesgenerated. The 0− 1 loss function is not L-Lipschitz at all — an arbitrarily smallchange in the function can change the loss by 1:

f1 = 0, f2 = ǫ, V (f1(x), 0) = 0, V (f2(x), 0) = 1.

Assuming L-Lipschitz loss, we have transformed the problem of bounding

supu∈Z|V (fS , u)− V (fSi,z , u)|

into a problem of bounding

|fS − fSi,z |∞.

As the next step, we bound the L∞ norm by the norm in the RKHS assosiated withthe kernel K. We now impose the second condition: there exists a κ satisfying

∀x ∈ X ,√

K(x,x) ≤ κ.

Under the above assumption, using the reproducing property and the Cauchy-Schwartz inequality, we can derive the following:

∀x |f(x)| = |〈K(x, ·), f(·)〉K |≤ ||K(x, ·)||K ||f ||K=

〈K(x, ·), K(x, ·)〉||f ||K=

K(x,x)||f ||K≤ κ||f ||K

Since above inequality holds for all x, we have |f |∞ ≤ ||f ||K . Hence, if we canbound the RKHS norm, we can bound the L∞ norm. We have now reduced theproblem to bounding ||fS − fSi,z ||K .

Lemma. For Tikhonov regularization under the above mild conditions

||fS − fSi,z ||2K ≤L|fS − fSi,z |∞

λn.

This lemma says that when we replace a point in the training set, the change in theRKHS norm (squared) of the difference between the two functions cannot be toolarge compared to the change in L∞. Using the lemma and the relation betweenLK and L∞,

||fS − fSi,z ||2K ≤ L|fS − fSi,z |∞λn

≤ Lκ||fS − fSi,z ||Kλn

Dividing through by ||fS − fSi,z ||K , we derive

||fS − fSi,z ||K ≤κL

λn.

Page 99: Statistical Learning: Algorithms and Theory

LECTURE 9. GENERALIZATION BOUNDS 95

Using again the relationship between LK and L∞, and the L Lipschitz condition,

sup |V (fS(·), ·) − V (fSz,i(·), ·)| ≤ L|fS − fSz,i |∞≤ Lκ||fS − fSz,i||K

≤ L2κ2

λn= β.

We now prove lemma 9.2.Proof. The proof involves comparing norms of fS and fSz,i and uses the notionof divergences.Suppose we have a convex, differentiable function F , and we know F (f1) for somef1. We can “guess” F (f2) by considering a linear approximation to F at f1:

F (f2) = F (f1) + 〈f2 − f1,∇F (f1)〉.The Bregman divergence is the error in this linearized approximation:

dF (f2, f1) = F (f2)− F (f1)− 〈f2 − f1,∇F (f1)〉.

We will need the following key facts about divergences:

• dF (f2, f1) ≥ 0• If f1 minimizes F , then the gradient is zero, and dF (f2, f1) = F (f2) −

F (f1).• If F = A + B, where A and B are also convex and differentiable, then

dF (f2, f1) = dA(f2, f1) + dB(f2, f1) (the derivatives add).

Consider the Tikhonov functional

TS(f) =1

n

n∑

i=1

V (f(xi), yi) + λ||f ||2K ,

as well as the component functionals

VS(f) =1

n

n∑

i=1

V (f(xi), yi)

Page 100: Statistical Learning: Algorithms and Theory

96 S. MUKHERJEE, STATISTICAL LEARNING

and

N(f) = ||f ||2K .

Hence, TS(f) = VS(f)+λN(f). If the loss function is convex (in the first variable),then all three functionals are convex.

R

F

fs fs’

Vs’(f)

N(f)

Ts’(f)Ts(f)

Vs(f)

This picture illustrates the relation between the various functionals on theminimizers over the original set S and the perturbed set Si,z = S′, where (xi, yi)is replaced by a new point z = (x, y). Let fS be the minimizer of TS , and let fSi,z

be the minimizer of TSi,z . Then

λ(dN (fSi,z , fS) + dN (fS , fSi,z)) ≤dTS

(fSi,z , fS) + dTSi,z

(fS , fSi,z) =

1

n(V (fSi,z , zi)− V (fS , zi) + V (fS , z)− V (fSi,z , z)) ≤

2L|fS − fSi,z |∞n

.

We conclude that

dN (fSi,z , fS) + dN (fS , fSi,z) ≤ 2L|fS − fSi,z |∞λn

But what is dN (fSi,z , fS)? Let’s express our functions as the sum of orthogonaleigenfunctions in the RKHS:

fS(x) =

∞∑

n=1

cnφn(x)

fSi,z(x) =

∞∑

n=1

c′nφn(x)

Page 101: Statistical Learning: Algorithms and Theory

LECTURE 9. GENERALIZATION BOUNDS 97

Once we express a function in this form, we recall that

||f ||2K =∞∑

n=1

c2n

λn

Using this notation, we reexpress the divergence in terms of the ci and c′i:

dN (fSi,z , fS) = ||fSi,z ||2K − ||fS ||2K − 〈fSi,z − fS,∇||fS ||2K〉

=∞∑

n=1

c′2nλn−

∞∑

n=1

c2n

λn−

∞∑

i=1

(c′n − cn)(2cn

λn)

=

∞∑

n=1

c′2n + c2n − 2c′ncn

λn

=

∞∑

n=1

(c′n − cn)2

λn

= ||fSi,z − fS ||2KWe conclude that

dN (fSi,z , fS) + dN (fS , fSi,z) = 2||fSi,z − fS ||2KCombining these results proves our Lemma:

||fSi,z − fS ||2K =dN (fSi,z , fS) + dN (fS , fSi,z)

2

≤ 2L|fS − fSi,z |∞λn

We have shown that Tikhonov regularization with an L-Lipschitz loss is β-

stable with β = L2κ2

λn . If we want to actually apply the theorems and get thegeneralization bound, we need to bound the loss (note that this was a requirementin the previous section). So, is the loss bounded?

Let C0 be the maximum value of the loss when we predict a value of zero. Ifwe have bounds on X and Y, we can find C0. Noting that the “all 0” function 0 isalways in the RKHS, we see that

λ||fS ||2K ≤ T (fS)

≤ T (0)

=1

n

n∑

i=1

V (0(xi), yi)

≤ C0.

Therefore,

||fS||2K ≤C0

λ

=⇒ |fS|∞ ≤ κ||fS ||K ≤ κ

C0

λ

Since the loss is L-Lipschitz, a bound on |fS |∞ implies boundedness of the lossfunction.

A few final notes:

Page 102: Statistical Learning: Algorithms and Theory

98 S. MUKHERJEE, STATISTICAL LEARNING

• if we keep λ fixed as we increase n, the generalization bound will tighten as

O(

1√n

)

. However, keeping λ fixed is equivalent to keeping our hypothesis

space fixed. As we get more data, we want λ to get smaller. If λ getssmaller too fast, the bounds become trivial.• It is worth noting that Ivanov regularization

fH,S = arg minf∈H

1

n

n∑

i=1

V (f(xi), yi)

s.t. ‖f‖2K ≤ τ

is not uniformly stable with β = O(

1n

)

. This is an important distinctionbetween Tikhonov and Ivanov regularization.

Page 103: Statistical Learning: Algorithms and Theory

BIBLIOGRAPHY

99