machine learning for engineers: chapter 8. elements of ... · this chapter as we have discussed in...

Machine Learning for Engineers:Chapter 8. Elements of Statistical Learning Theory

Osvaldo Simeone

King’s College London

January 1, 2021

Osvaldo Simeone ML4Engineers 1 / 65

This Chapter

As we have discussed in the previous chapters, a learning problemset-up is defined by

I the inductive bias, consisting of model class H, loss function `, andtraining algorithm;

I and a training set D = {(xn, tn)}Nn=1, consisting of N data pointsgenerated i.i.d. from the unknown population distribution p(x , t).

In this chapter, we will always view the training set D as a set of rvsdrawn i.i.d. from p(x , t) rather than a fixed data set.

The performance of a learned hypothesis, or (hard) predictor t̂(·), inH is defined by the population loss

Lp(t̂(·)) = E(x,t)∼p(x ,t)[`(t, t̂(x))].

This Chapter

As we have discussed in the previous chapters, a learning problemset-up is defined by

I the inductive bias, consisting of model class H, loss function `, andtraining algorithm;

I and a training set D = {(xn, tn)}Nn=1, consisting of N data pointsgenerated i.i.d. from the unknown population distribution p(x , t).

In this chapter, we will always view the training set D as a set of rvsdrawn i.i.d. from p(x , t) rather than a fixed data set.

The performance of a learned hypothesis, or (hard) predictor t̂(·), inH is defined by the population loss

Lp(t̂(·)) = E(x,t)∼p(x ,t)[`(t, t̂(x))].

This Chapter

Training algorithms, such as ERM, are generally dependent on thetraining loss

LD(t̂(·)) =1

N∑n=1

`(tn, t̂(xn))).

In this chapter, we discuss basic elements of statistical learning theory,with the main goal of addressing the following basic questions:

I Given an inductive bias, how many training samples (N) are needed tolearn to a given level of population loss?

F This is known as sample complexity.

I Conversely, how should we choose the inductive bias in order to ensurea suitably small population loss for a given data set size N?

This Chapter

LD(t̂(·)) =1

N∑n=1

`(tn, t̂(xn))).

This Chapter

LD(t̂(·)) =1

N∑n=1

`(tn, t̂(xn))).

Overview

Benchmarks and decomposition of the optimality error

Probably Approximately Correct (PAC) learning

Sample complexity of ERM for finite model classes

Proof (and generalization error)

Sample complexity of ERM for continuous model classes

Structural Risk Minimization

Appendix: PAC Bayes learning

Benchmarks and Decomposition of theOptimality Error

Benchmarks

How well does a learning algorithm perform? We have two keybenchmarks.

1) Population-optimal unconstrained predictor: This is the predictorthat minimizes the population loss without any constraint on themodel class, i.e.,

t̂∗(·) = arg mint̂(·)

Lp(t̂(·)).

2) Population-optimal within-class predictor: This is the predictorthat minimizes the population loss within a given model class H, i.e.,

t̂∗H(·) ∈ arg mint̂(·)∈H

Lp(t̂(·)).

Benchmarks

How well does a learning algorithm perform? We have two keybenchmarks.

1) Population-optimal unconstrained predictor: This is the predictorthat minimizes the population loss without any constraint on themodel class, i.e.,

t̂∗(·) = arg mint̂(·)

Lp(t̂(·)).

2) Population-optimal within-class predictor: This is the predictorthat minimizes the population loss within a given model class H, i.e.,

t̂∗H(·) ∈ arg mint̂(·)∈H

Lp(t̂(·)).

Benchmarks

The unconstrained predictor t̂∗(·) can be instantiated in a richerdomain than the within-class predictor t̂∗H(·), whose domain is limitedto the model class H.

Hence, the unconstrained minimum population loss Lp(t̂∗(·)) cannotbe larger than the minimum within-class population loss Lp(t̂∗H(·)),i.e., we have the inequality

Lp(t̂∗(·)) ≤ Lp(t̂∗H(·)).

Furthermore, we have the equality

Lp(t̂∗(·)) = Lp(t̂∗H(·))

if and only if the model class H is large enough so that thepopulation-optimal unconstrained predictor is in it, i.e.,t̂∗(·) = t̂∗H(·) ∈ H.

Benchmarks

The unconstrained predictor t̂∗(·) can be instantiated in a richerdomain than the within-class predictor t̂∗H(·), whose domain is limitedto the model class H.

Hence, the unconstrained minimum population loss Lp(t̂∗(·)) cannotbe larger than the minimum within-class population loss Lp(t̂∗H(·)),i.e., we have the inequality

Lp(t̂∗(·)) ≤ Lp(t̂∗H(·)).

Furthermore, we have the equality

Lp(t̂∗(·)) = Lp(t̂∗H(·))

if and only if the model class H is large enough so that thepopulation-optimal unconstrained predictor is in it, i.e.,t̂∗(·) = t̂∗H(·) ∈ H.

Example

Consider the problem of binary classification with a binary input, i.e.,x, t ∈ {0, 1}.The population distribution is such that we have x ∼Bern(p), with0 < p < 0.5, and t = x. Recall that this is unknown to the learner.

Under the detection-error loss `(t, t̂) = 1(t̂ 6= t), thepopulation-optimal predictor is clearly t̂∗(x) = x , yielding theminimum unconstrained population loss Lp(t̂∗(·)) = 0.

Consider now the “constant-predictor” model class

H = {t̂(x) = t̂ ∈ {0, 1} for all x ∈ {0, 1}}.

The population-optimal within-class predictor is t̂∗H(x) = 0, yieldingthe minimum within-class population loss

Lp(t̂∗H(·)) = (1− p)× 0 + p × 1 = p > Lp(t̂∗(·)) = 0.

Example

H = {t̂(x) = t̂ ∈ {0, 1} for all x ∈ {0, 1}}.

Lp(t̂∗H(·)) = (1− p)× 0 + p × 1 = p > Lp(t̂∗(·)) = 0.

Example

H = {t̂(x) = t̂ ∈ {0, 1} for all x ∈ {0, 1}}.

Lp(t̂∗H(·)) = (1− p)× 0 + p × 1 = p > Lp(t̂∗(·)) = 0.

Decomposition of Optimality Error

From now on, we will write predictors t̂(·) as t̂ in order to simplify thenotation.

We have seen in Chapter 4 that, for any predictor t̂, we candecompose the optimality error Lp(t̂)− Lp(t̂∗) as

Lp(t̂)− Lp(t̂∗)︸︷︷︸optimality error

= (Lp(t̂∗H)− Lp(t̂∗))︸︷︷︸bias

+ (Lp(t̂)− Lp(t̂∗H))︸︷︷︸estimation error

The bias (Lp(t̂∗H)− Lp(t̂∗)), also known as approximation error,depends on the choice of the model class H (see, e.g., the previousexample);

The estimation error (Lp(t̂)− Lp(t̂∗H)) depends on the model class H,on the training algorithm producing the predictor t̂ from the trainingdata, and on the training data itself.

Decomposition of Optimality Error

From now on, we will write predictors t̂(·) as t̂ in order to simplify thenotation.

We have seen in Chapter 4 that, for any predictor t̂, we candecompose the optimality error Lp(t̂)− Lp(t̂∗) as

Lp(t̂)− Lp(t̂∗)︸︷︷︸optimality error

= (Lp(t̂∗H)− Lp(t̂∗))︸︷︷︸bias

+ (Lp(t̂)− Lp(t̂∗H))︸︷︷︸estimation error

The bias (Lp(t̂∗H)− Lp(t̂∗)), also known as approximation error,depends on the choice of the model class H (see, e.g., the previousexample);

The estimation error (Lp(t̂)− Lp(t̂∗H)) depends on the model class H,on the training algorithm producing the predictor t̂ from the trainingdata, and on the training data itself.

ERMGiven the training data set

D = {(xn, tn)}Nn=1 ∼i.i.d.

p(x , t)

a training algorithm returns a predictor t̂D ∈ H.

Note that the selected model t̂D is random due to randomness of thedata set D.

As we have seen, a standard learning algorithm is empirical riskminimization (ERM), which minimizes the training loss LD(t̂) as

t̂ERMD = argmint̂∈H

LD(t̂)

with LD(t̂) =1

N∑n=1

`(tn, t̂(xn)).

We generally have the inequalities

Lp(t̂ERMD ) ≥ Lp(t̂∗H) ≥ Lp(t̂∗).

ERMGiven the training data set

D = {(xn, tn)}Nn=1 ∼i.i.d.

p(x , t)

a training algorithm returns a predictor t̂D ∈ H.

Note that the selected model t̂D is random due to randomness of thedata set D.

As we have seen, a standard learning algorithm is empirical riskminimization (ERM), which minimizes the training loss LD(t̂) as

t̂ERMD = argmint̂∈H

LD(t̂)

with LD(t̂) =1

N∑n=1

`(tn, t̂(xn)).

We generally have the inequalities

Lp(t̂ERMD ) ≥ Lp(t̂∗H) ≥ Lp(t̂∗).

Example

Consider the problem of binary classification with a binary input, i.e.,x, t ∈ {0, 1}.The population distribution is such that we have x ∼Bern(p), withp < 0.5; and t = x with probability q > 0.5, while t 6= x withprobability 1− q.

Under the detection-error loss, the population-optimal unconstrainedpredictor is t̂∗(x) = x , with corresponding unconstrained minimumpopulation loss Lp(t̂∗) = 1− q.

For the “constant-predictor” model class H, the population-optimalwithin-class predictor is t̂H(x) = 0, with minimum within-classpopulation loss

Lp(t̂∗H) = (1− p)Pr[t 6= 0|x = 0] + pPr[t 6= 0|x = 1]

= (1− p)(1− q) + pq ≥ 1− q = Lp(t̂∗).

Example

Lp(t̂∗H) = (1− p)Pr[t 6= 0|x = 0] + pPr[t 6= 0|x = 1]

= (1− p)(1− q) + pq ≥ 1− q = Lp(t̂∗).

Example

Lp(t̂∗H) = (1− p)Pr[t 6= 0|x = 0] + pPr[t 6= 0|x = 1]

= (1− p)(1− q) + pq ≥ 1− q = Lp(t̂∗).

Example

Given training data set D ={( 1︸︷︷︸x

, 0︸︷︷︸t

), (0, 1), (0, 1)} (N = 3), ERM

trains a model in class H by minimizing the training loss

LD(t̂) =1

3(1(t̂ 6= 0) + 2× 1(t̂ 6= 1)).

The ERM predictor is hence t̂ERMD = 1, which gives the training lossLD(t̂ERMD ) = 1

The population loss of the ERM predictor is

Lp(t̂ERMD ) = (1− p)Pr[t 6= 1|x = 0] + pPr[t 6= 1|x = 1]

= (1− p)q + p(1− q).

Example

, 0︸︷︷︸t

), (0, 1), (0, 1)} (N = 3), ERM

LD(t̂) =1

3(1(t̂ 6= 0) + 2× 1(t̂ 6= 1)).

= (1− p)q + p(1− q).

Example

, 0︸︷︷︸t

), (0, 1), (0, 1)} (N = 3), ERM

LD(t̂) =1

3(1(t̂ 6= 0) + 2× 1(t̂ 6= 1)).

= (1− p)q + p(1− q).

Example

For instance, if p = 0.3 and q = 0.7, we haveI minimum unconstrained population loss Lp(t̂∗) = 0.3I minimum within-class population loss Lp(t̂∗H) = 0.42I ERM population loss Lp(t̂ERMD ) = 0.58

We also have the decomposition of the optimality error

Lp(t̂ERMD )︸︷︷︸0.58

− Lp(t̂∗)︸︷︷︸0.3

= (Lp(t̂∗H)− Lp(t̂∗))︸︷︷︸bias=0.12

+ (Lp(t̂ERMD )− Lp(t̂∗H))︸︷︷︸estimation error=0.16

Probably Approximately Correct (PAC)Learning

Statistical Learning Theory

Statistical learning theory studies the estimation errorLp(t̂D)− Lp(t̂∗H), which depends on how data is used.

The bias Lp(t̂∗H)− Lp(t̂∗) is generally difficult to quantify since itdepends on the population-optimal unconstrained predictor t̂∗.

I In practice, if the loss Lp(t̂∗H) is too large, one needs to increase thecapacity of the model or choose a different class of models that is moresuitable for the problem.

If the estimation error is small, the training algorithm obtains close tothe best possible within-class population loss:

I How much data is needed to ensure a small estimation error?

The estimation error is random since it depends on the training set DI We first need to specify what we mean by “small” estimation error.

Approximately and Probably

Due to the randomness of D, a learning rule t̂D can only minimize thegeneralization loss Lp(t̂)

I approximately, i.e., Lp(t̂D) ≤ Lp(t̂∗H) + ε for some ε > 0

I and with probability at least 1− δ for some δ > 0F with probability no larger than δ, we may have Lp(t̂D) > Lp(t̂∗H) + ε

0 2 4 6 8 10

Approximately and Probably

Due to the randomness of D, a learning rule t̂D can only minimize thegeneralization loss Lp(t̂)

I approximately, i.e., Lp(t̂D) ≤ Lp(t̂∗H) + ε for some ε > 0

I and with probability at least 1− δ for some δ > 0F with probability no larger than δ, we may have Lp(t̂D) > Lp(t̂∗H) + ε

0 2 4 6 8 10

Probably Approximately Correct (PAC) Learning Rule

When operating on data sets D of N examples, a training algorithmA that produces a predictor t̂AD(·) is (N, ε, δ) PAC for a model classH, loss function `, and set of population distributions p(x , t) if, it hasan estimation error no larger than the accuracy parameter ε

Lp(t̂AD) ≤ Lp(t̂∗H) + ε

with probability no smaller than 1− δ for the confidence parameterδ > 0, that is, if we have

PrD ∼i.i.d.

p(x ,t)[Lp(t̂AD) ≤ Lp(t̂∗H) + ε] ≥ 1− δ,

for any true distribution p(x , t) in the set.

Probably Approximately Correct (PAC) Learning Rule

The PAC requirement is a worst-case constraint in the sense that itimposes the inequality

minp(x ,t)

{PrD ∼

i.i.d.p(x ,t)[Lp(t̂AD) ≤ Lp(t̂∗H) + ε]

}≥ 1− δ,

where the minimum is taken over all possible population distributionsin the set of interest.

The set of population distributions is typically considered to includeall possible population distributions.

Sample ComplexityFor a training algorithm A, the amount of data needed to achieve acertain accuracy-confidence level (ε, δ) is known as sample complexity.

For a model class H, loss function `, and class of populationdistributions p(x , t), a training algorithm A has sample complexityNAH(ε, δ) if, for the given ε, δ ∈ (0, 1), it is (N, ε, δ) PAC for all

N ≥ NAH(ε, δ).

5 10 15 20 25 30 35 40 45 500.8

popula

tion l

Is ERM PAC?

Intuitively, ERM is PAC by the law of large numbers.

The law of large numbers says that, for i.i.d. rvsu1, u2, · · · ,uM ∼ p(u) such that E [ui ] = µ, the empirical mean1M

∑Mm=1 um tends to the ensemble mean µ as M →∞:

I We write this as

M∑m=1

um → µ for M →∞ in probability;

I which means that for any ε > 0 we have the limit

[∣∣∣∣∣ 1

M∑m=1

um − µ

∣∣∣∣∣ > ε

]→ 0 for M →∞.

Is ERM PAC?

Intuitively, ERM is PAC by the law of large numbers.

The law of large numbers says that, for i.i.d. rvsu1, u2, · · · ,uM ∼ p(u) such that E [ui ] = µ, the empirical mean1M

∑Mm=1 um tends to the ensemble mean µ as M →∞:

I We write this as

M∑m=1

um → µ for M →∞ in probability;

I which means that for any ε > 0 we have the limit

[∣∣∣∣∣ 1

M∑m=1

um − µ

∣∣∣∣∣ > ε

]→ 0 for M →∞.

Is ERM PAC?

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Is ERM PAC?

By the law of large numbers, we hence have

LD(t̂) =1

N∑n=1

`(tn, t̂(xn)))→ Lp(t̂) in probability

separately for each predictor t̂.

Hence, intuitively, the ERM solution t̂ERMD – which minimizes LD(t̂) –should with high probability also minimize Lp(t̂) – i.e., be close to t̂∗H– as the data set size N increases.

But how large do we need N to be – i.e., what is the samplecomplexity?

Is ERM PAC?

By the law of large numbers, we hence have

LD(t̂) =1

N∑n=1

`(tn, t̂(xn)))→ Lp(t̂) in probability

separately for each predictor t̂.

Hence, intuitively, the ERM solution t̂ERMD – which minimizes LD(t̂) –should with high probability also minimize Lp(t̂) – i.e., be close to t̂∗H– as the data set size N increases.

But how large do we need N to be – i.e., what is the samplecomplexity?

ExampleConsider the model class of threshold functions

{t̂θ(x) =

{0, if x < θ

1, if x ≥ θ= 1(x ≥ θ)

where x and θ are real numbers (D = 1).

Assume that the population distribution p(x , t) is given as

p(x , t) = p(x)1(t = t̂0(x))

for some p(x).

Since the conditional p(t|x) of the population distribution is includedin H, we say that the class of probability distributions is “feasible” forH.For the detection loss, the optimal predictor is t̂∗ = t̂∗H = t̂0, i.e., thebias is zero and the population-optimal threshold (both unconstrainedand within-class) is θ∗ = 0.

{t̂θ(x) =

{0, if x < θ

1, if x ≥ θ= 1(x ≥ θ)

p(x , t) = p(x)1(t = t̂0(x))

for some p(x).

{t̂θ(x) =

{0, if x < θ

1, if x ≥ θ= 1(x ≥ θ)

p(x , t) = p(x)1(t = t̂0(x))

for some p(x).

ExampleAssume that p(x) is uniform in the interval [−0.5, 0.5]. Thepopulation loss is shown in the figure as a function of θ as a dashedline.

With the data in the figure, ERM may return any value of θ in theinterval between the two closest positive and negative samples

With N large enough, we see that t̂ERMD tends to the optimalpredictor t̂0.

-0.5 0 0.5

Sample Complexity of ERM for FiniteModel Classes

Finite Model Classes

To address the sample complexity of ERM, we will first considermodel classes with a finite number of models.

For example, consider the class of threshold classifiers

H ={t̂θ(x) = 1(x ≥ θ) : θ ∈ {θ1, ..., θ|H|}

where θ can only take a finite set of |H| of values.

We write the number of models in H as |H| (cardinality of set H).

Capacity of a Finite Model Class

For a given finite model class H, we define

model capacity = log(|H|) (nats)

= log2(|H|) (bits)

as the number of nats or bits required to index the hypotheses in H.

Intuitively, a larger model capacity entails a larger sample complexity.

This intuition can be made precise by the following theorem.

Capacity of a Finite Model Class

For a given finite model class H, we define

model capacity = log(|H|) (nats)

= log2(|H|) (bits)

as the number of nats or bits required to index the hypotheses in H.

Intuitively, a larger model capacity entails a larger sample complexity.

This intuition can be made precise by the following theorem.

Sample Complexity of ERM for Finite Model Classes

Theorem: For the detection-error loss (or any other loss bounded inthe interval [0, 1]) and any finite hypothesis class H, ERM is (N, ε, δ)PAC with estimation error

√2 log(|H|) + log(1/δ)

Equivalently, ERM achieves the estimation error

Lp(t̂ERMD )− Lp(t̂∗H) ≤√

2 log(|H|) + log(1/δ)

with probability no smaller than 1− δ.

Sample Complexity of ERM for Finite Model Classes

The theorem above can be equivalently stated in terms of samplecomplexity.

Theorem: For the detection-error loss (or any other loss bounded inthe interval [0, 1]) and any finite hypothesis class H, ERM has samplecomplexity

NERMH (ε, δ) =

⌈2 log(|H|) + log(1/δ)

So the sample complexity isI proportional to the model capacity log |H|;I proportional to the “number of confidence digits” log(1/δ);I and inversely proportional to the square of the accuracy ε.

Proof of the Theorem

(and Generalization Error)

The proof of the theorem is based onI a decomposition of the estimation error that includes the generalization

errorLp(t̂D)− LD(t̂D)

F the generalization error measures the difference between populationand training losses for the trained predictor t̂D;

I the law of large numbers (in a stronger form known as Hoeffdinginequality);

I and the union bound.

Decomposition of the Estimation ErrorThe estimation error can be decomposed into the followingcontributions

Lp(t̂D)− Lp(t̂∗H)︸︷︷︸estimation error

= (Lp(t̂D)− LD(t̂D))︸︷︷︸generalization error

+ (LD(t̂D)− LD(t̂∗H))︸︷︷︸training-based estimation error

+ LD(t̂∗H)− Lp(t̂∗H)︸︷︷︸empirical average error

As a visual aid, we have

Lp(t̂D) − Lp(t̂∗H)↓ ↑

LD(t̂D) → LD(t̂∗H)

All these gap terms have to be small with high probability withrespect to the selection of the data set D.

Decomposition of the Estimation ErrorThe estimation error can be decomposed into the followingcontributions

Lp(t̂D)− Lp(t̂∗H)︸︷︷︸estimation error

= (Lp(t̂D)− LD(t̂D))︸︷︷︸generalization error

+ (LD(t̂D)− LD(t̂∗H))︸︷︷︸training-based estimation error

As a visual aid, we have

Lp(t̂D) − Lp(t̂∗H)↓ ↑

LD(t̂D) → LD(t̂∗H)

All these gap terms have to be small with high probability withrespect to the selection of the data set D.

Decomposition of the Estimation Error

By the law of large numbers, the empirical average errorLD(t̂∗H)− Lp(t̂∗H) goes to zero in probability as N →∞. Note thatthe law of large numbers is applicable since t̂∗H is a fixed predictor andnot dependent on D.

For ERM, the training-based estimation error satisfiesLD(t̂ERMD )− LD(t̂∗H) ≤ 0, since ERM minimizes the training loss, andhence any other predictor cannot yield a smaller training loss.

For the generalization error Lp(t̂ERMD )− LD(t̂ERMD ), we cannot use thelaw of large numbers since the ERM predictor t̂ERMD is a randomvariable dependent on D and not a fixed predictor.

Generalization Error

Given the discussion above, statistical learning is largely devoted tothe study of the generalization error Lp(t̂D)− LD(t̂D) for giventraining algorithms:

I If the generalization error is small, the training loss is a reliableestimate of the population loss.

The generalization error Lp(t̂D)− LD(t̂D) generally depends on the“capacity” of the model class H, on the loss function `, and on theassumptions made on the data distribution.

Hoeffding’s inequality

For i.i.d. rvs u1,u2, · · · , uM ∼ p(u) such that E [ui ] = µ andPr[a ≤ ui ≤ b] = 1 for some a ≤ b, we have the large deviationinequality

[∣∣∣∣∣ 1

M∑m=1

um − µ

∣∣∣∣∣ > ε

]≤ 2 exp

(− 2Mε2

(b − a)2

Note that Hoeffding inequality implies the limit

[∣∣∣∣∣ 1M ∑Mm=1 um − µ

∣∣∣∣∣ > ε

]→ 0 for M →∞, and hence it recovers

the law of large numbers.

We now use this inequality to obtain bounds on the empirical averageand generalization errors.

[∣∣∣∣∣ 1

M∑m=1

um − µ

∣∣∣∣∣ > ε

]≤ 2 exp

(− 2Mε2

(b − a)2

[∣∣∣∣∣ 1M ∑Mm=1 um − µ

∣∣∣∣∣ > ε

[∣∣∣∣∣ 1

M∑m=1

um − µ

∣∣∣∣∣ > ε

]≤ 2 exp

(− 2Mε2

(b − a)2

[∣∣∣∣∣ 1M ∑Mm=1 um − µ

∣∣∣∣∣ > ε

Empirical Average Error

Let us consider the empirical average error LD(t̂∗H)− Lp(t̂∗H).

The training loss can be written as 1M

∑Mm=1 um by setting un as

`(tn, t̂∗H(xn)) and M as N.

With this choice, we have µ = E[`(tn, t̂∗H(xn))] = Lp(t̂∗H).

By assumption, the loss function is bounded between 0 and 1, andhence we can set a = 0 and b = 1.

Applying the Hoeffding inequality, we have for any ε1 > 0

Pr[LD(t̂∗H)− Lp(t̂∗H) > ε1

]≤ 2 exp

(−2Nε21

For the generalization error Lp(t̂ERMD )− LD(t̂ERMD ) of ERM, we wouldsimilarly like to bound the probability

Pr[Lp(t̂ERMD )− LD(t̂ERMD ) > ε2

This probability can be upper bounded by the probability that there isat least one predictor that satisfies the inequality, i.e.,

]≤Pr

[∃t̂ ∈ H : Lp(t̂)− LD(t̂) > ε2

⋃t̂∈H

{Lp(t̂)− LD(t̂) > ε2

} ,where the last equality follows from the interpretation of a union ofevents as logical OR.

For the generalization error Lp(t̂ERMD )− LD(t̂ERMD ) of ERM, we wouldsimilarly like to bound the probability

This probability can be upper bounded by the probability that there isat least one predictor that satisfies the inequality, i.e.,

]≤Pr

[∃t̂ ∈ H : Lp(t̂)− LD(t̂) > ε2

⋃t̂∈H

{Lp(t̂)− LD(t̂) > ε2

} ,where the last equality follows from the interpretation of a union ofevents as logical OR.

Now, we can use first the union bound and then again Hoeffding’sinequality to yield

⋃t̂∈H

{Lp(t̂)− LD(t̂) > ε2

}≤∑t̂∈H

Pr[Lp(t̂)− LD(t̂) > ε2

]≤2|H| exp

(−2Nε22

We note that, by this derivation, the boundPr[Lp(t̂D)− LD(t̂D) > ε2

]≤ 2|H| exp

(−2Nε22

)applies to any

learning algorithm and not just to ERM. We will use this fact later.

Now, we can use first the union bound and then again Hoeffding’sinequality to yield

⋃t̂∈H

{Lp(t̂)− LD(t̂) > ε2

}≤∑t̂∈H

Pr[Lp(t̂)− LD(t̂) > ε2

]≤2|H| exp

(−2Nε22

We note that, by this derivation, the boundPr[Lp(t̂D)− LD(t̂D) > ε2

]≤ 2|H| exp

(−2Nε22

)applies to any

learning algorithm and not just to ERM. We will use this fact later.

ProofPutting it all together, we obtain the inequality

Lp(t̂ERMD )− Lp(t̂∗H)︸︷︷︸estimation error

(Lp(t̂ERMD )− LD(t̂ERMD ))︸︷︷︸generalization error

LD(t̂∗H)− Lp(t̂∗H)︸︷︷︸”empirical average” error

≤2(|H|+ 1) exp

(−Nε2

The bound can be easily tightened to the value stated in the theoremby noting that the probability term for t̂∗H is counted twice. This isleft as an exercise.

Sample Complexity of ERM forContinuous Model Classes

Continuous Model Classes

What does the theorem proved above say about infinite models suchas linear classifiers or neural networks?

A simple approach would be to quantize hypothesis class H,obtaining a finite class Hb, by representing each element of the modelparameter vector θ with b bits.

Example: Consider a neural network with D weightsI the number of hypotheses in the model class Hb is |H| = (2b)D , and

the capacity of the hypothesis class is log2(|H|) = Db (bits) orlog(|H|) = bD log 2 (nats);

I by the theorem, the ERM sample complexity is

NERMH (ε, δ) =

⌈2bD log(2) + 2 log(2/δ)

which scales proportionally to the number of parameters D and to thebit resolution b.

Example: Consider a neural network with D weightsI the number of hypotheses in the model class Hb is |H| = (2b)D , and

the capacity of the hypothesis class is log2(|H|) = Db (bits) orlog(|H|) = bD log 2 (nats);

I by the theorem, the ERM sample complexity is

NERMH (ε, δ) =

⌈2bD log(2) + 2 log(2/δ)

which scales proportionally to the number of parameters D and to thebit resolution b.

The previous theorem seems to imply that, in order to learn acontinuous model class, an infinite number of samples is required.This is because a continuous model is only described exactly if we setb →∞.This conclusion is generally not true, but the theory needs to beextended to introduce a more refined notion of capacity of a modelclass: the Vapnik–Chervonenkis (VC) dimension.

We will see that, in a nutshell, the result proved above still holds forgeneral model classes by substituting log(|H|) with the VC dimension.

We focus on binary classification.

The previous theorem seems to imply that, in order to learn acontinuous model class, an infinite number of samples is required.This is because a continuous model is only described exactly if we setb →∞.This conclusion is generally not true, but the theory needs to beextended to introduce a more refined notion of capacity of a modelclass: the Vapnik–Chervonenkis (VC) dimension.

We will see that, in a nutshell, the result proved above still holds forgeneral model classes by substituting log(|H|) with the VC dimension.

We focus on binary classification.

Capacity of a Model Revisited

Consider again a finite class of models – what does it mean that itscapacity is log2(|H|) bits?

To start with some background, we say that an information sourcehas a capacity of b bits if it can produce any binary vector of b bits(possibly after remapping of the alphabet to bits).

Ex.: A source has capacity b = 2 bits if it can produce four messages,e.g., {00,01,10,11}, four hand gestures, four words, etc.

Consider a data set of N inputs X = (x1, ...., xN). A given modelt̂(·) ∈ H produces a set of N binary predictions (t̂(x1), ...., t̂(xN)).

We can now think of the model class as a source of capacity N bits ifI for some data set X = (x1, ...., xN), it can produce all possible 2N

binary vectors (t̂(x1), ...., t̂(xN)) of predictions by running over allmodels t̂(·) ∈ H.

Consider again a finite class of models – what does it mean that itscapacity is log2(|H|) bits?

To start with some background, we say that an information sourcehas a capacity of b bits if it can produce any binary vector of b bits(possibly after remapping of the alphabet to bits).

Ex.: A source has capacity b = 2 bits if it can produce four messages,e.g., {00,01,10,11}, four hand gestures, four words, etc.

Consider a data set of N inputs X = (x1, ...., xN). A given modelt̂(·) ∈ H produces a set of N binary predictions (t̂(x1), ...., t̂(xN)).

We can now think of the model class as a source of capacity N bits ifI for some data set X = (x1, ...., xN), it can produce all possible 2N

binary vectors (t̂(x1), ...., t̂(xN)) of predictions by running over allmodels t̂(·) ∈ H.

For a finite model class, there are only |H| models to choose from,and hence the maximum capacity is N = log2(|H|).

Example: Assume that we only have two models in the classH = {t̂1(·), t̂2(·)}:

I unless the two models are equivalent, for some input x , we can produceboth one-bit messages if t̂1(x) = 0 and t̂2(x) = 1 (or vice versa);

I but we can never produce all messages of two bits since we can onlychoose among two models.

But what about the case of a continuous model?

Example

Consider the set of all linear binary classifiers on the plane, withdecision region passing through the origin, i.e.,

{t̂θ(x) =

{0, if θT x < 0

1, if θT x ≥ 0= 1(θT x ≥ 0) : θ ∈ R2

Clearly we can obtain all messages of size N = 1: we can label any,and hence also some, point x1 as either 0 or 1...

. 𝑥1𝜃

Ƹ𝑡 = 1

Ƹ𝑡 = 0

. 𝑥1

𝜃Ƹ𝑡 = 0

Ƹ𝑡 = 1

message: 1 message: 0

Example

... and also N = 2: i.e., we can label some pair of points x1 and x2with any pairs of binary labels...

. 𝑥1𝜃

Ƹ𝑡 = 1

Ƹ𝑡 = 0

. 𝑥1

𝜃Ƹ𝑡 = 0

Ƹ𝑡 = 1

. 𝑥1

Ƹ𝑡 = 1Ƹ𝑡 = 0

. 𝑥2

message: 1,0 message: 0,1

. 𝑥2. 𝑥2

message: 0,0

. 𝑥1𝜃

Ƹ𝑡 = 1Ƹ𝑡 = 0

. 𝑥2

message: 1,1

Example

... but there is no data set of three data points for which we canobtain all eight messages of N = 3 bits.

Therefore, the capacity of this model is 2 bits.

Note that 2 is also the number of free parameters in this case.

We will now make this concept of capacity more formal through theintroduction of the VC dimension.

VC Dimension

A hypothesis class H is said to shatter a set of inputsX = (x1, ..., xN) if, no matter how the corresponding labels (t1, ..., tN)are selected, there exists a hypothesis t̂ ∈ H that ensures t̂(xn) = tnfor all n = 1, ...,N.

This is the same idea explained above: The set of inputsX = (x1, ..., xN) is shattered by H if the models in H can produce allpossible 2N messages of N bits when applied to X .

The VC dimension VCdim(H) (measured in bits) of the model H isthe size of the largest set X that can be shattered by H.

The VC dimension VCdim(H) is hence the capacity of the model asexplained above.

VC Dimension

Based on the definitions above, to prove that a model hasVCdim(H) = N, we need to carry out the following two steps:

I Step 1) Demonstrate the existence of a set X with |X | = N that isshattered by H; and

I Step 2) Prove that no set X of dimension N + 1 exists that isshattered by H.

For finite classes, we have the inequality VCdim(H) ≤ log2(|H|), since|H| hypotheses can create at most |H| different label configurations.

VC Dimension

Based on the definitions above, to prove that a model hasVCdim(H) = N, we need to carry out the following two steps:

I Step 1) Demonstrate the existence of a set X with |X | = N that isshattered by H; and

I Step 2) Prove that no set X of dimension N + 1 exists that isshattered by H.

For finite classes, we have the inequality VCdim(H) ≤ log2(|H|), since|H| hypotheses can create at most |H| different label configurations.

Examples

The threshold function model class

H ={t̂θ(x) = 1(x ≥ θ) : θ ∈ {θ1, ..., θ|H|}

with x ∈ R, has VCdim(H)= 1:I Step 1) any set X of one sample (N = 1) can be shattered – and

hence there is clearly at least one such point;

I Step 2) no sets of N = 2 points that can be shattered:F for any set X = (x1, x2) of two points with x1 ≤ x2, the label

assignment (t1, t2) = (1, 0) cannot be realized by any choice of thethreshold θ.

Examples

The threshold function model class

H ={t̂θ(x) = 1(x ≥ θ) : θ ∈ {θ1, ..., θ|H|}

with x ∈ R, has VCdim(H)= 1:I Step 1) any set X of one sample (N = 1) can be shattered – and

hence there is clearly at least one such point;

I Step 2) no sets of N = 2 points that can be shattered:F for any set X = (x1, x2) of two points with x1 ≤ x2, the label

assignment (t1, t2) = (1, 0) cannot be realized by any choice of thethreshold θ.

Examples

The model

H = {t̂a,b(x) = 1(a ≤ x ≤ b) : a ≤ b},

which assigns the label t = 1 within an interval [a, b] and the labelt = 0 outside it, has VCdim(H)= 2:

I Step 1) any set of N = 2 points can be shattered – and hence therealso exists one such set;

I Step 2) there are no sets X of N = 3 points that can be shattered:F for any set X = (x1, x2, x3) of three points with x1 ≤ x2 ≤ x3, the label

assignment (t1, t2, t3) = (1, 0, 1) cannot be realized.

I It can also be proved that the linear classifier in a dimension D hasVCdim(H)= D + 1.

Examples

The model

H = {t̂a,b(x) = 1(a ≤ x ≤ b) : a ≤ b},

Examples

The model

H = {t̂a,b(x) = 1(a ≤ x ≤ b) : a ≤ b},

VC DimensionTheorem: Under the detection-error loss, for a model class H withfinite VCdim(H)<∞, ERM is (N, ε, δ) PAC with accuracy

2VCdim(H) + log(1/δ)

for some constant C > 0.

Therefore, ERM achieves the estimation error

C2VCdim(H) + log(1/δ)

Equivalently, the sample complexity of ERM is

NERMH (ε, δ) =

This theorem generalizes the case of finite model classes bysubstituting VCdim(H) for log2(|H|).

VC DimensionTheorem: Under the detection-error loss, for a model class H withfinite VCdim(H)<∞, ERM is (N, ε, δ) PAC with accuracy

for some constant C > 0.

Therefore, ERM achieves the estimation error

C2VCdim(H) + log(1/δ)

Equivalently, the sample complexity of ERM is

NERMH (ε, δ) =

This theorem generalizes the case of finite model classes bysubstituting VCdim(H) for log2(|H|).

Fundamental Theorem of Learning

So, we have seen that ERM achieves a sample complexityproportional to the ratio VCdim(H)/N. Is there any trainingalgorithm that is more efficient?

Theorem: For a model H with finite VCdim(H)<∞, the samplecomplexity of any training algorithm A is lower bounded as

NAH(ε, δ) ≥ D

VCdim(H) + log(1/δ)

for some constant D > 0.

The theorem demonstrates that, if learning is possible for a givenmodel H, then ERM allows us to learn with close-to-optimal samplecomplexity.

Fundamental Theorem of Learning

So, we have seen that ERM achieves a sample complexityproportional to the ratio VCdim(H)/N. Is there any trainingalgorithm that is more efficient?

Theorem: For a model H with finite VCdim(H)<∞, the samplecomplexity of any training algorithm A is lower bounded as

NAH(ε, δ) ≥ D

VCdim(H) + log(1/δ)

for some constant D > 0.

The theorem demonstrates that, if learning is possible for a givenmodel H, then ERM allows us to learn with close-to-optimal samplecomplexity.

Assume that we need to select an inductive bias by choosing one of anested set of hypothesis classes H1 ⊆ H2 ⊆ ... ⊆ HMmax .

An example is the problem studied in Chapter 4 of selecting themodel order M ∈ {1, 2, ...,Mmax} for linear regression.

How do we choose M? We have seen that the standard approach isvalidation. SRM is an alternative approach that does not require toset aside validation data.

Consider a training algorithm AM producing a predictor t̂AMD ∈ HM in

class HM . Note that the algorithm, and its output, depend on themodel order M.

From the bound on the generalization error derived above, we havethe inequality

Lp(t̂AMD ) ≤ LD(t̂AM

√log(|HM |) + log(2/δ)

with probability at least 1− δ.

SRM minimizes this upper bound, which is a pessimistic estimate ofthe generalization loss, over the choice of the model M. Note thatthe training loss LD(t̂AM

D ) generally decreases with M, while themodel capacity log(|HM |) increases with it.

Assuming that the upper bound is reasonably tight, this allows one toavoid validation.

Consider a training algorithm AM producing a predictor t̂AMD ∈ HM in

class HM . Note that the algorithm, and its output, depend on themodel order M.

From the bound on the generalization error derived above, we havethe inequality

Lp(t̂AMD ) ≤ LD(t̂AM

√log(|HM |) + log(2/δ)

with probability at least 1− δ.

SRM minimizes this upper bound, which is a pessimistic estimate ofthe generalization loss, over the choice of the model M. Note thatthe training loss LD(t̂AM

D ) generally decreases with M, while themodel capacity log(|HM |) increases with it.

Assuming that the upper bound is reasonably tight, this allows one toavoid validation.

Summary

In this chapter, we have discussed basic elements of statistical learningtheory, with the main goal of addressing the following questions:

I Given an inductive bias, how many training examples (N) are neededto learn to a given level of generalization accuracy?

F Answer: The sample complexity of ERM depends on the capacity (orVC dimension) of the selected model class, as well as on accuracy andconfidence parameters.

I Conversely, how should we choose the inductive bias in order to ensuregeneralization for a given data set size N?

F Answer: Structural risk minimization suggests choosing the model class

H that minimizes LD(t̂AH) +√

log(|H|)+log(2/δ)2N

, where the second term

is a bound on the generalization error.

Summary

In this chapter, we have discussed basic elements of statistical learningtheory, with the main goal of addressing the following questions:

I Given an inductive bias, how many training examples (N) are neededto learn to a given level of generalization accuracy?

F Answer: The sample complexity of ERM depends on the capacity (orVC dimension) of the selected model class, as well as on accuracy andconfidence parameters.

I Conversely, how should we choose the inductive bias in order to ensuregeneralization for a given data set size N?

F Answer: Structural risk minimization suggests choosing the model class

H that minimizes LD(t̂AH) +√

log(|H|)+log(2/δ)2N

, where the second term

is a bound on the generalization error.

Summary

The theory we have developed in this chapter bounds theapproximation error by relating the generalization error to thecapacity of the model class.

A different class of bounds can be obtained by studying the“sensitivity” of a training procedure to the training data set:

I a more sensitive a training algorithm is expected to overfit more easilyand hence to generalize less effectively.

PAC Bayes analysis follows this approach and provides generallycurrent state-of-the-art generalization error bounds (see Appendix).

Summary

Furthermore, in this chapter, we have considered a worst-caseformulation of the learning objective in which, for the given fixed N,we are interested in the performance under the worst, i.e.,population-loss minimizing, distribution p(x , t).

Other formulations apply to fixed population distributions p(x , t) andoffer bounds that depend explicitly on it (e.g., information-theoreticbounds) or are apply uniformly across all possible distributions p(x , t).

Appendix

PAC Bayes

The PAC Bayes formulation assumes a probabilistic choice t̂ ∼ q(t̂|D)for the predictor given the training set D (e.g., using Bayes rule).

The distribution q(t̂|D) is typically referred to as posterior, althoughit may not correspond to a real posterior distribution.

Under this distribution, we define:I the average population loss

Lp(q(t̂|D)) = Et̂∼q(t̂|D)E(x,t)∼p(x,t)[`(t, t̂(x))];

I and the average training loss

LD(q(t̂|D)) = Et̂∼q(t̂|D)

N∑n=1

`(tn, t̂(xn))

PAC Bayes

Theorem: Let p(t̂) be a prior distribution on the predictors in thehypothesis class H and q(t̂|D) a posterior in the same space. Theprior distribution must be selected before observing the data D, whilethe posterior can depend on it. Then, with probability no smaller than1− δ, for any δ ∈ (0, 1), the generalization error satisfies theinequality

Lp(q(t̂|D)) ≤ LD(q(t̂|D)) +

√KL(q(t̂|D)||p(t̂)) + log(Nδ )

2(N − 1)

for all population distributions p(x , t).

The term KL(q(t̂|D)||p(t̂)) measures the sensitivity of the trainedmodel to the training set D.

The right-hand side of the inequality above can be used to define anoptimization criterion for the posterior q(t̂|D) – this is known asinformation risk minimization.

PAC Bayes

If the prior is uniform and the hypothesis class is finite, i.e., ifp(t̂) = 1/|H|, we obtain

KL(q(t̂|D)||p(t̂)) = Et̂∼q(t̂|D)[log q(t̂|D)] + log(|H|)

= −H(q(t̂|D)) + log(|H|) ≤ log(|H|),

and hence we have the inequality

Lp(q(t̂|D)) ≤ LD(q(t̂|D)) +

√log(|H|) + log(Nδ )

2(N − 1).

This tends to the PAC bound derived above as N increases.

machine learning for engineers: chapter 8. elements of ... · this chapter as we have discussed in...

Documents