machine learning for engineers: chapter 8. elements of ... · this chapter as we have discussed in...

106
Machine Learning for Engineers: Chapter 8. Elements of Statistical Learning Theory Osvaldo Simeone King’s College London January 1, 2021 Osvaldo Simeone ML4Engineers 1 / 65

Upload: others

Post on 18-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Machine Learning for Engineers:Chapter 8. Elements of Statistical Learning Theory

Osvaldo Simeone

King’s College London

January 1, 2021

Osvaldo Simeone ML4Engineers 1 / 65

Page 2: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

This Chapter

As we have discussed in the previous chapters, a learning problemset-up is defined by

I the inductive bias, consisting of model class H, loss function `, andtraining algorithm;

I and a training set D = {(xn, tn)}Nn=1, consisting of N data pointsgenerated i.i.d. from the unknown population distribution p(x , t).

In this chapter, we will always view the training set D as a set of rvsdrawn i.i.d. from p(x , t) rather than a fixed data set.

The performance of a learned hypothesis, or (hard) predictor t̂(·), inH is defined by the population loss

Lp(t̂(·)) = E(x,t)∼p(x ,t)[`(t, t̂(x))].

Osvaldo Simeone ML4Engineers 2 / 65

Page 3: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

This Chapter

As we have discussed in the previous chapters, a learning problemset-up is defined by

I the inductive bias, consisting of model class H, loss function `, andtraining algorithm;

I and a training set D = {(xn, tn)}Nn=1, consisting of N data pointsgenerated i.i.d. from the unknown population distribution p(x , t).

In this chapter, we will always view the training set D as a set of rvsdrawn i.i.d. from p(x , t) rather than a fixed data set.

The performance of a learned hypothesis, or (hard) predictor t̂(·), inH is defined by the population loss

Lp(t̂(·)) = E(x,t)∼p(x ,t)[`(t, t̂(x))].

Osvaldo Simeone ML4Engineers 2 / 65

Page 4: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

This Chapter

Training algorithms, such as ERM, are generally dependent on thetraining loss

LD(t̂(·)) =1

N

N∑n=1

`(tn, t̂(xn))).

In this chapter, we discuss basic elements of statistical learning theory,with the main goal of addressing the following basic questions:

I Given an inductive bias, how many training samples (N) are needed tolearn to a given level of population loss?

F This is known as sample complexity.

I Conversely, how should we choose the inductive bias in order to ensurea suitably small population loss for a given data set size N?

Osvaldo Simeone ML4Engineers 3 / 65

Page 5: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

This Chapter

Training algorithms, such as ERM, are generally dependent on thetraining loss

LD(t̂(·)) =1

N

N∑n=1

`(tn, t̂(xn))).

In this chapter, we discuss basic elements of statistical learning theory,with the main goal of addressing the following basic questions:

I Given an inductive bias, how many training samples (N) are needed tolearn to a given level of population loss?

F This is known as sample complexity.

I Conversely, how should we choose the inductive bias in order to ensurea suitably small population loss for a given data set size N?

Osvaldo Simeone ML4Engineers 3 / 65

Page 6: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

This Chapter

Training algorithms, such as ERM, are generally dependent on thetraining loss

LD(t̂(·)) =1

N

N∑n=1

`(tn, t̂(xn))).

In this chapter, we discuss basic elements of statistical learning theory,with the main goal of addressing the following basic questions:

I Given an inductive bias, how many training samples (N) are needed tolearn to a given level of population loss?

F This is known as sample complexity.

I Conversely, how should we choose the inductive bias in order to ensurea suitably small population loss for a given data set size N?

Osvaldo Simeone ML4Engineers 3 / 65

Page 7: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Overview

Benchmarks and decomposition of the optimality error

Probably Approximately Correct (PAC) learning

Sample complexity of ERM for finite model classes

Proof (and generalization error)

Sample complexity of ERM for continuous model classes

Structural Risk Minimization

Appendix: PAC Bayes learning

Osvaldo Simeone ML4Engineers 4 / 65

Page 8: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Benchmarks and Decomposition of theOptimality Error

Osvaldo Simeone ML4Engineers 5 / 65

Page 9: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Benchmarks

How well does a learning algorithm perform? We have two keybenchmarks.

1) Population-optimal unconstrained predictor: This is the predictorthat minimizes the population loss without any constraint on themodel class, i.e.,

t̂∗(·) = arg mint̂(·)

Lp(t̂(·)).

2) Population-optimal within-class predictor: This is the predictorthat minimizes the population loss within a given model class H, i.e.,

t̂∗H(·) ∈ arg mint̂(·)∈H

Lp(t̂(·)).

Osvaldo Simeone ML4Engineers 6 / 65

Page 10: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Benchmarks

How well does a learning algorithm perform? We have two keybenchmarks.

1) Population-optimal unconstrained predictor: This is the predictorthat minimizes the population loss without any constraint on themodel class, i.e.,

t̂∗(·) = arg mint̂(·)

Lp(t̂(·)).

2) Population-optimal within-class predictor: This is the predictorthat minimizes the population loss within a given model class H, i.e.,

t̂∗H(·) ∈ arg mint̂(·)∈H

Lp(t̂(·)).

Osvaldo Simeone ML4Engineers 6 / 65

Page 11: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Benchmarks

The unconstrained predictor t̂∗(·) can be instantiated in a richerdomain than the within-class predictor t̂∗H(·), whose domain is limitedto the model class H.

Hence, the unconstrained minimum population loss Lp(t̂∗(·)) cannotbe larger than the minimum within-class population loss Lp(t̂∗H(·)),i.e., we have the inequality

Lp(t̂∗(·)) ≤ Lp(t̂∗H(·)).

Furthermore, we have the equality

Lp(t̂∗(·)) = Lp(t̂∗H(·))

if and only if the model class H is large enough so that thepopulation-optimal unconstrained predictor is in it, i.e.,t̂∗(·) = t̂∗H(·) ∈ H.

Osvaldo Simeone ML4Engineers 7 / 65

Page 12: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Benchmarks

The unconstrained predictor t̂∗(·) can be instantiated in a richerdomain than the within-class predictor t̂∗H(·), whose domain is limitedto the model class H.

Hence, the unconstrained minimum population loss Lp(t̂∗(·)) cannotbe larger than the minimum within-class population loss Lp(t̂∗H(·)),i.e., we have the inequality

Lp(t̂∗(·)) ≤ Lp(t̂∗H(·)).

Furthermore, we have the equality

Lp(t̂∗(·)) = Lp(t̂∗H(·))

if and only if the model class H is large enough so that thepopulation-optimal unconstrained predictor is in it, i.e.,t̂∗(·) = t̂∗H(·) ∈ H.

Osvaldo Simeone ML4Engineers 7 / 65

Page 13: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Example

Consider the problem of binary classification with a binary input, i.e.,x, t ∈ {0, 1}.The population distribution is such that we have x ∼Bern(p), with0 < p < 0.5, and t = x. Recall that this is unknown to the learner.

Under the detection-error loss `(t, t̂) = 1(t̂ 6= t), thepopulation-optimal predictor is clearly t̂∗(x) = x , yielding theminimum unconstrained population loss Lp(t̂∗(·)) = 0.

Consider now the “constant-predictor” model class

H = {t̂(x) = t̂ ∈ {0, 1} for all x ∈ {0, 1}}.

The population-optimal within-class predictor is t̂∗H(x) = 0, yieldingthe minimum within-class population loss

Lp(t̂∗H(·)) = (1− p)× 0 + p × 1 = p > Lp(t̂∗(·)) = 0.

Osvaldo Simeone ML4Engineers 8 / 65

Page 14: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Example

Consider the problem of binary classification with a binary input, i.e.,x, t ∈ {0, 1}.The population distribution is such that we have x ∼Bern(p), with0 < p < 0.5, and t = x. Recall that this is unknown to the learner.

Under the detection-error loss `(t, t̂) = 1(t̂ 6= t), thepopulation-optimal predictor is clearly t̂∗(x) = x , yielding theminimum unconstrained population loss Lp(t̂∗(·)) = 0.

Consider now the “constant-predictor” model class

H = {t̂(x) = t̂ ∈ {0, 1} for all x ∈ {0, 1}}.

The population-optimal within-class predictor is t̂∗H(x) = 0, yieldingthe minimum within-class population loss

Lp(t̂∗H(·)) = (1− p)× 0 + p × 1 = p > Lp(t̂∗(·)) = 0.

Osvaldo Simeone ML4Engineers 8 / 65

Page 15: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Example

Consider the problem of binary classification with a binary input, i.e.,x, t ∈ {0, 1}.The population distribution is such that we have x ∼Bern(p), with0 < p < 0.5, and t = x. Recall that this is unknown to the learner.

Under the detection-error loss `(t, t̂) = 1(t̂ 6= t), thepopulation-optimal predictor is clearly t̂∗(x) = x , yielding theminimum unconstrained population loss Lp(t̂∗(·)) = 0.

Consider now the “constant-predictor” model class

H = {t̂(x) = t̂ ∈ {0, 1} for all x ∈ {0, 1}}.

The population-optimal within-class predictor is t̂∗H(x) = 0, yieldingthe minimum within-class population loss

Lp(t̂∗H(·)) = (1− p)× 0 + p × 1 = p > Lp(t̂∗(·)) = 0.

Osvaldo Simeone ML4Engineers 8 / 65

Page 16: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Decomposition of Optimality Error

From now on, we will write predictors t̂(·) as t̂ in order to simplify thenotation.

We have seen in Chapter 4 that, for any predictor t̂, we candecompose the optimality error Lp(t̂)− Lp(t̂∗) as

Lp(t̂)− Lp(t̂∗)︸ ︷︷ ︸optimality error

= (Lp(t̂∗H)− Lp(t̂∗))︸ ︷︷ ︸bias

+ (Lp(t̂)− Lp(t̂∗H))︸ ︷︷ ︸estimation error

The bias (Lp(t̂∗H)− Lp(t̂∗)), also known as approximation error,depends on the choice of the model class H (see, e.g., the previousexample);

The estimation error (Lp(t̂)− Lp(t̂∗H)) depends on the model class H,on the training algorithm producing the predictor t̂ from the trainingdata, and on the training data itself.

Osvaldo Simeone ML4Engineers 9 / 65

Page 17: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Decomposition of Optimality Error

From now on, we will write predictors t̂(·) as t̂ in order to simplify thenotation.

We have seen in Chapter 4 that, for any predictor t̂, we candecompose the optimality error Lp(t̂)− Lp(t̂∗) as

Lp(t̂)− Lp(t̂∗)︸ ︷︷ ︸optimality error

= (Lp(t̂∗H)− Lp(t̂∗))︸ ︷︷ ︸bias

+ (Lp(t̂)− Lp(t̂∗H))︸ ︷︷ ︸estimation error

The bias (Lp(t̂∗H)− Lp(t̂∗)), also known as approximation error,depends on the choice of the model class H (see, e.g., the previousexample);

The estimation error (Lp(t̂)− Lp(t̂∗H)) depends on the model class H,on the training algorithm producing the predictor t̂ from the trainingdata, and on the training data itself.

Osvaldo Simeone ML4Engineers 9 / 65

Page 18: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

ERMGiven the training data set

D = {(xn, tn)}Nn=1 ∼i.i.d.

p(x , t)

a training algorithm returns a predictor t̂D ∈ H.

Note that the selected model t̂D is random due to randomness of thedata set D.

As we have seen, a standard learning algorithm is empirical riskminimization (ERM), which minimizes the training loss LD(t̂) as

t̂ERMD = argmint̂∈H

LD(t̂)

with LD(t̂) =1

N

N∑n=1

`(tn, t̂(xn)).

We generally have the inequalities

Lp(t̂ERMD ) ≥ Lp(t̂∗H) ≥ Lp(t̂∗).

Osvaldo Simeone ML4Engineers 10 / 65

Page 19: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

ERMGiven the training data set

D = {(xn, tn)}Nn=1 ∼i.i.d.

p(x , t)

a training algorithm returns a predictor t̂D ∈ H.

Note that the selected model t̂D is random due to randomness of thedata set D.

As we have seen, a standard learning algorithm is empirical riskminimization (ERM), which minimizes the training loss LD(t̂) as

t̂ERMD = argmint̂∈H

LD(t̂)

with LD(t̂) =1

N

N∑n=1

`(tn, t̂(xn)).

We generally have the inequalities

Lp(t̂ERMD ) ≥ Lp(t̂∗H) ≥ Lp(t̂∗).

Osvaldo Simeone ML4Engineers 10 / 65

Page 20: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Example

Consider the problem of binary classification with a binary input, i.e.,x, t ∈ {0, 1}.The population distribution is such that we have x ∼Bern(p), withp < 0.5; and t = x with probability q > 0.5, while t 6= x withprobability 1− q.

Under the detection-error loss, the population-optimal unconstrainedpredictor is t̂∗(x) = x , with corresponding unconstrained minimumpopulation loss Lp(t̂∗) = 1− q.

For the “constant-predictor” model class H, the population-optimalwithin-class predictor is t̂H(x) = 0, with minimum within-classpopulation loss

Lp(t̂∗H) = (1− p)Pr[t 6= 0|x = 0] + pPr[t 6= 0|x = 1]

= (1− p)(1− q) + pq ≥ 1− q = Lp(t̂∗).

Osvaldo Simeone ML4Engineers 11 / 65

Page 21: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Example

Consider the problem of binary classification with a binary input, i.e.,x, t ∈ {0, 1}.The population distribution is such that we have x ∼Bern(p), withp < 0.5; and t = x with probability q > 0.5, while t 6= x withprobability 1− q.

Under the detection-error loss, the population-optimal unconstrainedpredictor is t̂∗(x) = x , with corresponding unconstrained minimumpopulation loss Lp(t̂∗) = 1− q.

For the “constant-predictor” model class H, the population-optimalwithin-class predictor is t̂H(x) = 0, with minimum within-classpopulation loss

Lp(t̂∗H) = (1− p)Pr[t 6= 0|x = 0] + pPr[t 6= 0|x = 1]

= (1− p)(1− q) + pq ≥ 1− q = Lp(t̂∗).

Osvaldo Simeone ML4Engineers 11 / 65

Page 22: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Example

Consider the problem of binary classification with a binary input, i.e.,x, t ∈ {0, 1}.The population distribution is such that we have x ∼Bern(p), withp < 0.5; and t = x with probability q > 0.5, while t 6= x withprobability 1− q.

Under the detection-error loss, the population-optimal unconstrainedpredictor is t̂∗(x) = x , with corresponding unconstrained minimumpopulation loss Lp(t̂∗) = 1− q.

For the “constant-predictor” model class H, the population-optimalwithin-class predictor is t̂H(x) = 0, with minimum within-classpopulation loss

Lp(t̂∗H) = (1− p)Pr[t 6= 0|x = 0] + pPr[t 6= 0|x = 1]

= (1− p)(1− q) + pq ≥ 1− q = Lp(t̂∗).

Osvaldo Simeone ML4Engineers 11 / 65

Page 23: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Example

Given training data set D ={( 1︸︷︷︸x

, 0︸︷︷︸t

), (0, 1), (0, 1)} (N = 3), ERM

trains a model in class H by minimizing the training loss

LD(t̂) =1

3(1(t̂ 6= 0) + 2× 1(t̂ 6= 1)).

The ERM predictor is hence t̂ERMD = 1, which gives the training lossLD(t̂ERMD ) = 1

3 .

The population loss of the ERM predictor is

Lp(t̂ERMD ) = (1− p)Pr[t 6= 1|x = 0] + pPr[t 6= 1|x = 1]

= (1− p)q + p(1− q).

Osvaldo Simeone ML4Engineers 12 / 65

Page 24: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Example

Given training data set D ={( 1︸︷︷︸x

, 0︸︷︷︸t

), (0, 1), (0, 1)} (N = 3), ERM

trains a model in class H by minimizing the training loss

LD(t̂) =1

3(1(t̂ 6= 0) + 2× 1(t̂ 6= 1)).

The ERM predictor is hence t̂ERMD = 1, which gives the training lossLD(t̂ERMD ) = 1

3 .

The population loss of the ERM predictor is

Lp(t̂ERMD ) = (1− p)Pr[t 6= 1|x = 0] + pPr[t 6= 1|x = 1]

= (1− p)q + p(1− q).

Osvaldo Simeone ML4Engineers 12 / 65

Page 25: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Example

Given training data set D ={( 1︸︷︷︸x

, 0︸︷︷︸t

), (0, 1), (0, 1)} (N = 3), ERM

trains a model in class H by minimizing the training loss

LD(t̂) =1

3(1(t̂ 6= 0) + 2× 1(t̂ 6= 1)).

The ERM predictor is hence t̂ERMD = 1, which gives the training lossLD(t̂ERMD ) = 1

3 .

The population loss of the ERM predictor is

Lp(t̂ERMD ) = (1− p)Pr[t 6= 1|x = 0] + pPr[t 6= 1|x = 1]

= (1− p)q + p(1− q).

Osvaldo Simeone ML4Engineers 12 / 65

Page 26: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Example

For instance, if p = 0.3 and q = 0.7, we haveI minimum unconstrained population loss Lp(t̂∗) = 0.3I minimum within-class population loss Lp(t̂∗H) = 0.42I ERM population loss Lp(t̂ERMD ) = 0.58

We also have the decomposition of the optimality error

Lp(t̂ERMD )︸ ︷︷ ︸0.58

− Lp(t̂∗)︸ ︷︷ ︸0.3

= (Lp(t̂∗H)− Lp(t̂∗))︸ ︷︷ ︸bias=0.12

+ (Lp(t̂ERMD )− Lp(t̂∗H))︸ ︷︷ ︸estimation error=0.16

Osvaldo Simeone ML4Engineers 13 / 65

Page 27: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Probably Approximately Correct (PAC)Learning

Osvaldo Simeone ML4Engineers 14 / 65

Page 28: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Statistical Learning Theory

Statistical learning theory studies the estimation errorLp(t̂D)− Lp(t̂∗H), which depends on how data is used.

The bias Lp(t̂∗H)− Lp(t̂∗) is generally difficult to quantify since itdepends on the population-optimal unconstrained predictor t̂∗.

I In practice, if the loss Lp(t̂∗H) is too large, one needs to increase thecapacity of the model or choose a different class of models that is moresuitable for the problem.

If the estimation error is small, the training algorithm obtains close tothe best possible within-class population loss:

I How much data is needed to ensure a small estimation error?

The estimation error is random since it depends on the training set DI We first need to specify what we mean by “small” estimation error.

Osvaldo Simeone ML4Engineers 15 / 65

Page 29: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Statistical Learning Theory

Statistical learning theory studies the estimation errorLp(t̂D)− Lp(t̂∗H), which depends on how data is used.

The bias Lp(t̂∗H)− Lp(t̂∗) is generally difficult to quantify since itdepends on the population-optimal unconstrained predictor t̂∗.

I In practice, if the loss Lp(t̂∗H) is too large, one needs to increase thecapacity of the model or choose a different class of models that is moresuitable for the problem.

If the estimation error is small, the training algorithm obtains close tothe best possible within-class population loss:

I How much data is needed to ensure a small estimation error?

The estimation error is random since it depends on the training set DI We first need to specify what we mean by “small” estimation error.

Osvaldo Simeone ML4Engineers 15 / 65

Page 30: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Statistical Learning Theory

Statistical learning theory studies the estimation errorLp(t̂D)− Lp(t̂∗H), which depends on how data is used.

The bias Lp(t̂∗H)− Lp(t̂∗) is generally difficult to quantify since itdepends on the population-optimal unconstrained predictor t̂∗.

I In practice, if the loss Lp(t̂∗H) is too large, one needs to increase thecapacity of the model or choose a different class of models that is moresuitable for the problem.

If the estimation error is small, the training algorithm obtains close tothe best possible within-class population loss:

I How much data is needed to ensure a small estimation error?

The estimation error is random since it depends on the training set DI We first need to specify what we mean by “small” estimation error.

Osvaldo Simeone ML4Engineers 15 / 65

Page 31: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Approximately and Probably

Due to the randomness of D, a learning rule t̂D can only minimize thegeneralization loss Lp(t̂)

I approximately, i.e., Lp(t̂D) ≤ Lp(t̂∗H) + ε for some ε > 0

I and with probability at least 1− δ for some δ > 0F with probability no larger than δ, we may have Lp(t̂D) > Lp(t̂∗H) + ε

0

5

10

15

20

25

30

0 2 4 6 8 10

Osvaldo Simeone ML4Engineers 16 / 65

Page 32: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Approximately and Probably

Due to the randomness of D, a learning rule t̂D can only minimize thegeneralization loss Lp(t̂)

I approximately, i.e., Lp(t̂D) ≤ Lp(t̂∗H) + ε for some ε > 0

I and with probability at least 1− δ for some δ > 0F with probability no larger than δ, we may have Lp(t̂D) > Lp(t̂∗H) + ε

0

5

10

15

20

25

30

0 2 4 6 8 10

Osvaldo Simeone ML4Engineers 16 / 65

Page 33: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Probably Approximately Correct (PAC) Learning Rule

When operating on data sets D of N examples, a training algorithmA that produces a predictor t̂AD(·) is (N, ε, δ) PAC for a model classH, loss function `, and set of population distributions p(x , t) if, it hasan estimation error no larger than the accuracy parameter ε

Lp(t̂AD) ≤ Lp(t̂∗H) + ε

with probability no smaller than 1− δ for the confidence parameterδ > 0, that is, if we have

PrD ∼i.i.d.

p(x ,t)[Lp(t̂AD) ≤ Lp(t̂∗H) + ε] ≥ 1− δ,

for any true distribution p(x , t) in the set.

Osvaldo Simeone ML4Engineers 17 / 65

Page 34: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Probably Approximately Correct (PAC) Learning Rule

The PAC requirement is a worst-case constraint in the sense that itimposes the inequality

minp(x ,t)

{PrD ∼

i.i.d.p(x ,t)[Lp(t̂AD) ≤ Lp(t̂∗H) + ε]

}≥ 1− δ,

where the minimum is taken over all possible population distributionsin the set of interest.

The set of population distributions is typically considered to includeall possible population distributions.

Osvaldo Simeone ML4Engineers 18 / 65

Page 35: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Sample ComplexityFor a training algorithm A, the amount of data needed to achieve acertain accuracy-confidence level (ε, δ) is known as sample complexity.

For a model class H, loss function `, and class of populationdistributions p(x , t), a training algorithm A has sample complexityNAH(ε, δ) if, for the given ε, δ ∈ (0, 1), it is (N, ε, δ) PAC for all

N ≥ NAH(ε, δ).

5 10 15 20 25 30 35 40 45 500.8

1

1.2

1.4

1.6

1.8

2

2.2

popula

tion l

oss

Osvaldo Simeone ML4Engineers 19 / 65

Page 36: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Is ERM PAC?

Intuitively, ERM is PAC by the law of large numbers.

The law of large numbers says that, for i.i.d. rvsu1, u2, · · · ,uM ∼ p(u) such that E [ui ] = µ, the empirical mean1M

∑Mm=1 um tends to the ensemble mean µ as M →∞:

I We write this as

1

M

M∑m=1

um → µ for M →∞ in probability;

I which means that for any ε > 0 we have the limit

Pr

[∣∣∣∣∣ 1

M

M∑m=1

um − µ

∣∣∣∣∣ > ε

]→ 0 for M →∞.

Osvaldo Simeone ML4Engineers 20 / 65

Page 37: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Is ERM PAC?

Intuitively, ERM is PAC by the law of large numbers.

The law of large numbers says that, for i.i.d. rvsu1, u2, · · · ,uM ∼ p(u) such that E [ui ] = µ, the empirical mean1M

∑Mm=1 um tends to the ensemble mean µ as M →∞:

I We write this as

1

M

M∑m=1

um → µ for M →∞ in probability;

I which means that for any ε > 0 we have the limit

Pr

[∣∣∣∣∣ 1

M

M∑m=1

um − µ

∣∣∣∣∣ > ε

]→ 0 for M →∞.

Osvaldo Simeone ML4Engineers 20 / 65

Page 38: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Is ERM PAC?

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

3.5

Osvaldo Simeone ML4Engineers 21 / 65

Page 39: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Is ERM PAC?

By the law of large numbers, we hence have

LD(t̂) =1

N

N∑n=1

`(tn, t̂(xn)))→ Lp(t̂) in probability

separately for each predictor t̂.

Hence, intuitively, the ERM solution t̂ERMD – which minimizes LD(t̂) –should with high probability also minimize Lp(t̂) – i.e., be close to t̂∗H– as the data set size N increases.

But how large do we need N to be – i.e., what is the samplecomplexity?

Osvaldo Simeone ML4Engineers 22 / 65

Page 40: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Is ERM PAC?

By the law of large numbers, we hence have

LD(t̂) =1

N

N∑n=1

`(tn, t̂(xn)))→ Lp(t̂) in probability

separately for each predictor t̂.

Hence, intuitively, the ERM solution t̂ERMD – which minimizes LD(t̂) –should with high probability also minimize Lp(t̂) – i.e., be close to t̂∗H– as the data set size N increases.

But how large do we need N to be – i.e., what is the samplecomplexity?

Osvaldo Simeone ML4Engineers 22 / 65

Page 41: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

ExampleConsider the model class of threshold functions

H =

{t̂θ(x) =

{0, if x < θ

1, if x ≥ θ= 1(x ≥ θ)

},

where x and θ are real numbers (D = 1).

Assume that the population distribution p(x , t) is given as

p(x , t) = p(x)1(t = t̂0(x))

for some p(x).

Since the conditional p(t|x) of the population distribution is includedin H, we say that the class of probability distributions is “feasible” forH.For the detection loss, the optimal predictor is t̂∗ = t̂∗H = t̂0, i.e., thebias is zero and the population-optimal threshold (both unconstrainedand within-class) is θ∗ = 0.

Osvaldo Simeone ML4Engineers 23 / 65

Page 42: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

ExampleConsider the model class of threshold functions

H =

{t̂θ(x) =

{0, if x < θ

1, if x ≥ θ= 1(x ≥ θ)

},

where x and θ are real numbers (D = 1).

Assume that the population distribution p(x , t) is given as

p(x , t) = p(x)1(t = t̂0(x))

for some p(x).

Since the conditional p(t|x) of the population distribution is includedin H, we say that the class of probability distributions is “feasible” forH.For the detection loss, the optimal predictor is t̂∗ = t̂∗H = t̂0, i.e., thebias is zero and the population-optimal threshold (both unconstrainedand within-class) is θ∗ = 0.

Osvaldo Simeone ML4Engineers 23 / 65

Page 43: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

ExampleConsider the model class of threshold functions

H =

{t̂θ(x) =

{0, if x < θ

1, if x ≥ θ= 1(x ≥ θ)

},

where x and θ are real numbers (D = 1).

Assume that the population distribution p(x , t) is given as

p(x , t) = p(x)1(t = t̂0(x))

for some p(x).

Since the conditional p(t|x) of the population distribution is includedin H, we say that the class of probability distributions is “feasible” forH.For the detection loss, the optimal predictor is t̂∗ = t̂∗H = t̂0, i.e., thebias is zero and the population-optimal threshold (both unconstrainedand within-class) is θ∗ = 0.

Osvaldo Simeone ML4Engineers 23 / 65

Page 44: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

ExampleAssume that p(x) is uniform in the interval [−0.5, 0.5]. Thepopulation loss is shown in the figure as a function of θ as a dashedline.

With the data in the figure, ERM may return any value of θ in theinterval between the two closest positive and negative samples

With N large enough, we see that t̂ERMD tends to the optimalpredictor t̂0.

-0.5 0 0.5

0

0.2

0.4

0.6

-0.5 0 0.5

0

0.2

0.4

0.6

Osvaldo Simeone ML4Engineers 24 / 65

Page 45: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Sample Complexity of ERM for FiniteModel Classes

Osvaldo Simeone ML4Engineers 25 / 65

Page 46: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Finite Model Classes

To address the sample complexity of ERM, we will first considermodel classes with a finite number of models.

For example, consider the class of threshold classifiers

H ={t̂θ(x) = 1(x ≥ θ) : θ ∈ {θ1, ..., θ|H|}

},

where θ can only take a finite set of |H| of values.

We write the number of models in H as |H| (cardinality of set H).

Osvaldo Simeone ML4Engineers 26 / 65

Page 47: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Capacity of a Finite Model Class

For a given finite model class H, we define

model capacity = log(|H|) (nats)

= log2(|H|) (bits)

as the number of nats or bits required to index the hypotheses in H.

Intuitively, a larger model capacity entails a larger sample complexity.

This intuition can be made precise by the following theorem.

Osvaldo Simeone ML4Engineers 27 / 65

Page 48: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Capacity of a Finite Model Class

For a given finite model class H, we define

model capacity = log(|H|) (nats)

= log2(|H|) (bits)

as the number of nats or bits required to index the hypotheses in H.

Intuitively, a larger model capacity entails a larger sample complexity.

This intuition can be made precise by the following theorem.

Osvaldo Simeone ML4Engineers 27 / 65

Page 49: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Sample Complexity of ERM for Finite Model Classes

Theorem: For the detection-error loss (or any other loss bounded inthe interval [0, 1]) and any finite hypothesis class H, ERM is (N, ε, δ)PAC with estimation error

ε =

√2 log(|H|) + log(1/δ)

N.

Equivalently, ERM achieves the estimation error

Lp(t̂ERMD )− Lp(t̂∗H) ≤√

2 log(|H|) + log(1/δ)

N

with probability no smaller than 1− δ.

Osvaldo Simeone ML4Engineers 28 / 65

Page 50: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Sample Complexity of ERM for Finite Model Classes

The theorem above can be equivalently stated in terms of samplecomplexity.

Theorem: For the detection-error loss (or any other loss bounded inthe interval [0, 1]) and any finite hypothesis class H, ERM has samplecomplexity

NERMH (ε, δ) =

⌈2 log(|H|) + log(1/δ)

ε2

⌉.

So the sample complexity isI proportional to the model capacity log |H|;I proportional to the “number of confidence digits” log(1/δ);I and inversely proportional to the square of the accuracy ε.

Osvaldo Simeone ML4Engineers 29 / 65

Page 51: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Proof of the Theorem

(and Generalization Error)

Osvaldo Simeone ML4Engineers 30 / 65

Page 52: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Proof

The proof of the theorem is based onI a decomposition of the estimation error that includes the generalization

errorLp(t̂D)− LD(t̂D)

F the generalization error measures the difference between populationand training losses for the trained predictor t̂D;

I the law of large numbers (in a stronger form known as Hoeffdinginequality);

I and the union bound.

Osvaldo Simeone ML4Engineers 31 / 65

Page 53: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Decomposition of the Estimation ErrorThe estimation error can be decomposed into the followingcontributions

Lp(t̂D)− Lp(t̂∗H)︸ ︷︷ ︸estimation error

= (Lp(t̂D)− LD(t̂D))︸ ︷︷ ︸generalization error

+ (LD(t̂D)− LD(t̂∗H))︸ ︷︷ ︸training-based estimation error

+ LD(t̂∗H)− Lp(t̂∗H)︸ ︷︷ ︸empirical average error

.

As a visual aid, we have

Lp(t̂D) − Lp(t̂∗H)↓ ↑

LD(t̂D) → LD(t̂∗H)

All these gap terms have to be small with high probability withrespect to the selection of the data set D.

Osvaldo Simeone ML4Engineers 32 / 65

Page 54: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Decomposition of the Estimation ErrorThe estimation error can be decomposed into the followingcontributions

Lp(t̂D)− Lp(t̂∗H)︸ ︷︷ ︸estimation error

= (Lp(t̂D)− LD(t̂D))︸ ︷︷ ︸generalization error

+ (LD(t̂D)− LD(t̂∗H))︸ ︷︷ ︸training-based estimation error

+ LD(t̂∗H)− Lp(t̂∗H)︸ ︷︷ ︸empirical average error

.

As a visual aid, we have

Lp(t̂D) − Lp(t̂∗H)↓ ↑

LD(t̂D) → LD(t̂∗H)

All these gap terms have to be small with high probability withrespect to the selection of the data set D.

Osvaldo Simeone ML4Engineers 32 / 65

Page 55: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Decomposition of the Estimation Error

By the law of large numbers, the empirical average errorLD(t̂∗H)− Lp(t̂∗H) goes to zero in probability as N →∞. Note thatthe law of large numbers is applicable since t̂∗H is a fixed predictor andnot dependent on D.

For ERM, the training-based estimation error satisfiesLD(t̂ERMD )− LD(t̂∗H) ≤ 0, since ERM minimizes the training loss, andhence any other predictor cannot yield a smaller training loss.

For the generalization error Lp(t̂ERMD )− LD(t̂ERMD ), we cannot use thelaw of large numbers since the ERM predictor t̂ERMD is a randomvariable dependent on D and not a fixed predictor.

Osvaldo Simeone ML4Engineers 33 / 65

Page 56: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Decomposition of the Estimation Error

By the law of large numbers, the empirical average errorLD(t̂∗H)− Lp(t̂∗H) goes to zero in probability as N →∞. Note thatthe law of large numbers is applicable since t̂∗H is a fixed predictor andnot dependent on D.

For ERM, the training-based estimation error satisfiesLD(t̂ERMD )− LD(t̂∗H) ≤ 0, since ERM minimizes the training loss, andhence any other predictor cannot yield a smaller training loss.

For the generalization error Lp(t̂ERMD )− LD(t̂ERMD ), we cannot use thelaw of large numbers since the ERM predictor t̂ERMD is a randomvariable dependent on D and not a fixed predictor.

Osvaldo Simeone ML4Engineers 33 / 65

Page 57: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Decomposition of the Estimation Error

By the law of large numbers, the empirical average errorLD(t̂∗H)− Lp(t̂∗H) goes to zero in probability as N →∞. Note thatthe law of large numbers is applicable since t̂∗H is a fixed predictor andnot dependent on D.

For ERM, the training-based estimation error satisfiesLD(t̂ERMD )− LD(t̂∗H) ≤ 0, since ERM minimizes the training loss, andhence any other predictor cannot yield a smaller training loss.

For the generalization error Lp(t̂ERMD )− LD(t̂ERMD ), we cannot use thelaw of large numbers since the ERM predictor t̂ERMD is a randomvariable dependent on D and not a fixed predictor.

Osvaldo Simeone ML4Engineers 33 / 65

Page 58: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Generalization Error

Given the discussion above, statistical learning is largely devoted tothe study of the generalization error Lp(t̂D)− LD(t̂D) for giventraining algorithms:

I If the generalization error is small, the training loss is a reliableestimate of the population loss.

The generalization error Lp(t̂D)− LD(t̂D) generally depends on the“capacity” of the model class H, on the loss function `, and on theassumptions made on the data distribution.

Osvaldo Simeone ML4Engineers 34 / 65

Page 59: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Hoeffding’s inequality

For i.i.d. rvs u1,u2, · · · , uM ∼ p(u) such that E [ui ] = µ andPr[a ≤ ui ≤ b] = 1 for some a ≤ b, we have the large deviationinequality

Pr

[∣∣∣∣∣ 1

M

M∑m=1

um − µ

∣∣∣∣∣ > ε

]≤ 2 exp

(− 2Mε2

(b − a)2

).

Note that Hoeffding inequality implies the limit

Pr

[∣∣∣∣∣ 1M ∑Mm=1 um − µ

∣∣∣∣∣ > ε

]→ 0 for M →∞, and hence it recovers

the law of large numbers.

We now use this inequality to obtain bounds on the empirical averageand generalization errors.

Osvaldo Simeone ML4Engineers 35 / 65

Page 60: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Hoeffding’s inequality

For i.i.d. rvs u1,u2, · · · , uM ∼ p(u) such that E [ui ] = µ andPr[a ≤ ui ≤ b] = 1 for some a ≤ b, we have the large deviationinequality

Pr

[∣∣∣∣∣ 1

M

M∑m=1

um − µ

∣∣∣∣∣ > ε

]≤ 2 exp

(− 2Mε2

(b − a)2

).

Note that Hoeffding inequality implies the limit

Pr

[∣∣∣∣∣ 1M ∑Mm=1 um − µ

∣∣∣∣∣ > ε

]→ 0 for M →∞, and hence it recovers

the law of large numbers.

We now use this inequality to obtain bounds on the empirical averageand generalization errors.

Osvaldo Simeone ML4Engineers 35 / 65

Page 61: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Hoeffding’s inequality

For i.i.d. rvs u1,u2, · · · , uM ∼ p(u) such that E [ui ] = µ andPr[a ≤ ui ≤ b] = 1 for some a ≤ b, we have the large deviationinequality

Pr

[∣∣∣∣∣ 1

M

M∑m=1

um − µ

∣∣∣∣∣ > ε

]≤ 2 exp

(− 2Mε2

(b − a)2

).

Note that Hoeffding inequality implies the limit

Pr

[∣∣∣∣∣ 1M ∑Mm=1 um − µ

∣∣∣∣∣ > ε

]→ 0 for M →∞, and hence it recovers

the law of large numbers.

We now use this inequality to obtain bounds on the empirical averageand generalization errors.

Osvaldo Simeone ML4Engineers 35 / 65

Page 62: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Empirical Average Error

Let us consider the empirical average error LD(t̂∗H)− Lp(t̂∗H).

The training loss can be written as 1M

∑Mm=1 um by setting un as

`(tn, t̂∗H(xn)) and M as N.

With this choice, we have µ = E[`(tn, t̂∗H(xn))] = Lp(t̂∗H).

By assumption, the loss function is bounded between 0 and 1, andhence we can set a = 0 and b = 1.

Applying the Hoeffding inequality, we have for any ε1 > 0

Pr[LD(t̂∗H)− Lp(t̂∗H) > ε1

]≤ 2 exp

(−2Nε21

).

Osvaldo Simeone ML4Engineers 36 / 65

Page 63: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Generalization Error

For the generalization error Lp(t̂ERMD )− LD(t̂ERMD ) of ERM, we wouldsimilarly like to bound the probability

Pr[Lp(t̂ERMD )− LD(t̂ERMD ) > ε2

].

This probability can be upper bounded by the probability that there isat least one predictor that satisfies the inequality, i.e.,

Pr[Lp(t̂ERMD )− LD(t̂ERMD ) > ε2

]≤Pr

[∃t̂ ∈ H : Lp(t̂)− LD(t̂) > ε2

]=Pr

⋃t̂∈H

{Lp(t̂)− LD(t̂) > ε2

} ,where the last equality follows from the interpretation of a union ofevents as logical OR.

Osvaldo Simeone ML4Engineers 37 / 65

Page 64: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Generalization Error

For the generalization error Lp(t̂ERMD )− LD(t̂ERMD ) of ERM, we wouldsimilarly like to bound the probability

Pr[Lp(t̂ERMD )− LD(t̂ERMD ) > ε2

].

This probability can be upper bounded by the probability that there isat least one predictor that satisfies the inequality, i.e.,

Pr[Lp(t̂ERMD )− LD(t̂ERMD ) > ε2

]≤Pr

[∃t̂ ∈ H : Lp(t̂)− LD(t̂) > ε2

]=Pr

⋃t̂∈H

{Lp(t̂)− LD(t̂) > ε2

} ,where the last equality follows from the interpretation of a union ofevents as logical OR.

Osvaldo Simeone ML4Engineers 37 / 65

Page 65: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Generalization Error

Now, we can use first the union bound and then again Hoeffding’sinequality to yield

Pr

⋃t̂∈H

{Lp(t̂)− LD(t̂) > ε2

}≤∑t̂∈H

Pr[Lp(t̂)− LD(t̂) > ε2

]≤2|H| exp

(−2Nε22

).

We note that, by this derivation, the boundPr[Lp(t̂D)− LD(t̂D) > ε2

]≤ 2|H| exp

(−2Nε22

)applies to any

learning algorithm and not just to ERM. We will use this fact later.

Osvaldo Simeone ML4Engineers 38 / 65

Page 66: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Generalization Error

Now, we can use first the union bound and then again Hoeffding’sinequality to yield

Pr

⋃t̂∈H

{Lp(t̂)− LD(t̂) > ε2

}≤∑t̂∈H

Pr[Lp(t̂)− LD(t̂) > ε2

]≤2|H| exp

(−2Nε22

).

We note that, by this derivation, the boundPr[Lp(t̂D)− LD(t̂D) > ε2

]≤ 2|H| exp

(−2Nε22

)applies to any

learning algorithm and not just to ERM. We will use this fact later.

Osvaldo Simeone ML4Engineers 38 / 65

Page 67: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

ProofPutting it all together, we obtain the inequality

Pr

Lp(t̂ERMD )− Lp(t̂∗H)︸ ︷︷ ︸estimation error

> ε

≤Pr

(Lp(t̂ERMD )− LD(t̂ERMD ))︸ ︷︷ ︸generalization error

+ LD(t̂∗H)− Lp(t̂∗H)︸ ︷︷ ︸empirical average error

> ε

≤Pr

(Lp(t̂ERMD )− LD(t̂ERMD ))︸ ︷︷ ︸generalization error

2

+ Pr

LD(t̂∗H)− Lp(t̂∗H)︸ ︷︷ ︸”empirical average” error

2

≤2(|H|+ 1) exp

(−Nε2

2

).

The bound can be easily tightened to the value stated in the theoremby noting that the probability term for t̂∗H is counted twice. This isleft as an exercise.

Osvaldo Simeone ML4Engineers 39 / 65

Page 68: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Sample Complexity of ERM forContinuous Model Classes

Osvaldo Simeone ML4Engineers 40 / 65

Page 69: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Continuous Model Classes

What does the theorem proved above say about infinite models suchas linear classifiers or neural networks?

A simple approach would be to quantize hypothesis class H,obtaining a finite class Hb, by representing each element of the modelparameter vector θ with b bits.

Osvaldo Simeone ML4Engineers 41 / 65

Page 70: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Continuous Model Classes

Example: Consider a neural network with D weightsI the number of hypotheses in the model class Hb is |H| = (2b)D , and

the capacity of the hypothesis class is log2(|H|) = Db (bits) orlog(|H|) = bD log 2 (nats);

I by the theorem, the ERM sample complexity is

NERMH (ε, δ) =

⌈2bD log(2) + 2 log(2/δ)

ε2

⌉,

which scales proportionally to the number of parameters D and to thebit resolution b.

Osvaldo Simeone ML4Engineers 42 / 65

Page 71: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Continuous Model Classes

Example: Consider a neural network with D weightsI the number of hypotheses in the model class Hb is |H| = (2b)D , and

the capacity of the hypothesis class is log2(|H|) = Db (bits) orlog(|H|) = bD log 2 (nats);

I by the theorem, the ERM sample complexity is

NERMH (ε, δ) =

⌈2bD log(2) + 2 log(2/δ)

ε2

⌉,

which scales proportionally to the number of parameters D and to thebit resolution b.

Osvaldo Simeone ML4Engineers 42 / 65

Page 72: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Continuous Model Classes

The previous theorem seems to imply that, in order to learn acontinuous model class, an infinite number of samples is required.This is because a continuous model is only described exactly if we setb →∞.This conclusion is generally not true, but the theory needs to beextended to introduce a more refined notion of capacity of a modelclass: the Vapnik–Chervonenkis (VC) dimension.

We will see that, in a nutshell, the result proved above still holds forgeneral model classes by substituting log(|H|) with the VC dimension.

We focus on binary classification.

Osvaldo Simeone ML4Engineers 43 / 65

Page 73: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Continuous Model Classes

The previous theorem seems to imply that, in order to learn acontinuous model class, an infinite number of samples is required.This is because a continuous model is only described exactly if we setb →∞.This conclusion is generally not true, but the theory needs to beextended to introduce a more refined notion of capacity of a modelclass: the Vapnik–Chervonenkis (VC) dimension.

We will see that, in a nutshell, the result proved above still holds forgeneral model classes by substituting log(|H|) with the VC dimension.

We focus on binary classification.

Osvaldo Simeone ML4Engineers 43 / 65

Page 74: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Capacity of a Model Revisited

Consider again a finite class of models – what does it mean that itscapacity is log2(|H|) bits?

To start with some background, we say that an information sourcehas a capacity of b bits if it can produce any binary vector of b bits(possibly after remapping of the alphabet to bits).

Ex.: A source has capacity b = 2 bits if it can produce four messages,e.g., {00,01,10,11}, four hand gestures, four words, etc.

Consider a data set of N inputs X = (x1, ...., xN). A given modelt̂(·) ∈ H produces a set of N binary predictions (t̂(x1), ...., t̂(xN)).

We can now think of the model class as a source of capacity N bits ifI for some data set X = (x1, ...., xN), it can produce all possible 2N

binary vectors (t̂(x1), ...., t̂(xN)) of predictions by running over allmodels t̂(·) ∈ H.

Osvaldo Simeone ML4Engineers 44 / 65

Page 75: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Capacity of a Model Revisited

Consider again a finite class of models – what does it mean that itscapacity is log2(|H|) bits?

To start with some background, we say that an information sourcehas a capacity of b bits if it can produce any binary vector of b bits(possibly after remapping of the alphabet to bits).

Ex.: A source has capacity b = 2 bits if it can produce four messages,e.g., {00,01,10,11}, four hand gestures, four words, etc.

Consider a data set of N inputs X = (x1, ...., xN). A given modelt̂(·) ∈ H produces a set of N binary predictions (t̂(x1), ...., t̂(xN)).

We can now think of the model class as a source of capacity N bits ifI for some data set X = (x1, ...., xN), it can produce all possible 2N

binary vectors (t̂(x1), ...., t̂(xN)) of predictions by running over allmodels t̂(·) ∈ H.

Osvaldo Simeone ML4Engineers 44 / 65

Page 76: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Capacity of a Model Revisited

For a finite model class, there are only |H| models to choose from,and hence the maximum capacity is N = log2(|H|).

Example: Assume that we only have two models in the classH = {t̂1(·), t̂2(·)}:

I unless the two models are equivalent, for some input x , we can produceboth one-bit messages if t̂1(x) = 0 and t̂2(x) = 1 (or vice versa);

I but we can never produce all messages of two bits since we can onlychoose among two models.

But what about the case of a continuous model?

Osvaldo Simeone ML4Engineers 45 / 65

Page 77: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Example

Consider the set of all linear binary classifiers on the plane, withdecision region passing through the origin, i.e.,

H =

{t̂θ(x) =

{0, if θT x < 0

1, if θT x ≥ 0= 1(θT x ≥ 0) : θ ∈ R2

},

Clearly we can obtain all messages of size N = 1: we can label any,and hence also some, point x1 as either 0 or 1...

. 𝑥1𝜃

Ƹ𝑡 = 1

Ƹ𝑡 = 0

. 𝑥1

𝜃Ƹ𝑡 = 0

Ƹ𝑡 = 1

message: 1 message: 0

Osvaldo Simeone ML4Engineers 46 / 65

Page 78: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Example

... and also N = 2: i.e., we can label some pair of points x1 and x2with any pairs of binary labels...

. 𝑥1𝜃

Ƹ𝑡 = 1

Ƹ𝑡 = 0

. 𝑥1

𝜃Ƹ𝑡 = 0

Ƹ𝑡 = 1

. 𝑥1

𝜃

Ƹ𝑡 = 1Ƹ𝑡 = 0

. 𝑥2

message: 1,0 message: 0,1

. 𝑥2. 𝑥2

message: 0,0

. 𝑥1𝜃

Ƹ𝑡 = 1Ƹ𝑡 = 0

. 𝑥2

message: 1,1

Osvaldo Simeone ML4Engineers 47 / 65

Page 79: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Example

... but there is no data set of three data points for which we canobtain all eight messages of N = 3 bits.

Therefore, the capacity of this model is 2 bits.

Note that 2 is also the number of free parameters in this case.

We will now make this concept of capacity more formal through theintroduction of the VC dimension.

Osvaldo Simeone ML4Engineers 48 / 65

Page 80: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

VC Dimension

A hypothesis class H is said to shatter a set of inputsX = (x1, ..., xN) if, no matter how the corresponding labels (t1, ..., tN)are selected, there exists a hypothesis t̂ ∈ H that ensures t̂(xn) = tnfor all n = 1, ...,N.

This is the same idea explained above: The set of inputsX = (x1, ..., xN) is shattered by H if the models in H can produce allpossible 2N messages of N bits when applied to X .

The VC dimension VCdim(H) (measured in bits) of the model H isthe size of the largest set X that can be shattered by H.

The VC dimension VCdim(H) is hence the capacity of the model asexplained above.

Osvaldo Simeone ML4Engineers 49 / 65

Page 81: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

VC Dimension

A hypothesis class H is said to shatter a set of inputsX = (x1, ..., xN) if, no matter how the corresponding labels (t1, ..., tN)are selected, there exists a hypothesis t̂ ∈ H that ensures t̂(xn) = tnfor all n = 1, ...,N.

This is the same idea explained above: The set of inputsX = (x1, ..., xN) is shattered by H if the models in H can produce allpossible 2N messages of N bits when applied to X .

The VC dimension VCdim(H) (measured in bits) of the model H isthe size of the largest set X that can be shattered by H.

The VC dimension VCdim(H) is hence the capacity of the model asexplained above.

Osvaldo Simeone ML4Engineers 49 / 65

Page 82: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

VC Dimension

A hypothesis class H is said to shatter a set of inputsX = (x1, ..., xN) if, no matter how the corresponding labels (t1, ..., tN)are selected, there exists a hypothesis t̂ ∈ H that ensures t̂(xn) = tnfor all n = 1, ...,N.

This is the same idea explained above: The set of inputsX = (x1, ..., xN) is shattered by H if the models in H can produce allpossible 2N messages of N bits when applied to X .

The VC dimension VCdim(H) (measured in bits) of the model H isthe size of the largest set X that can be shattered by H.

The VC dimension VCdim(H) is hence the capacity of the model asexplained above.

Osvaldo Simeone ML4Engineers 49 / 65

Page 83: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

VC Dimension

Based on the definitions above, to prove that a model hasVCdim(H) = N, we need to carry out the following two steps:

I Step 1) Demonstrate the existence of a set X with |X | = N that isshattered by H; and

I Step 2) Prove that no set X of dimension N + 1 exists that isshattered by H.

For finite classes, we have the inequality VCdim(H) ≤ log2(|H|), since|H| hypotheses can create at most |H| different label configurations.

Osvaldo Simeone ML4Engineers 50 / 65

Page 84: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

VC Dimension

Based on the definitions above, to prove that a model hasVCdim(H) = N, we need to carry out the following two steps:

I Step 1) Demonstrate the existence of a set X with |X | = N that isshattered by H; and

I Step 2) Prove that no set X of dimension N + 1 exists that isshattered by H.

For finite classes, we have the inequality VCdim(H) ≤ log2(|H|), since|H| hypotheses can create at most |H| different label configurations.

Osvaldo Simeone ML4Engineers 50 / 65

Page 85: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Examples

The threshold function model class

H ={t̂θ(x) = 1(x ≥ θ) : θ ∈ {θ1, ..., θ|H|}

},

with x ∈ R, has VCdim(H)= 1:I Step 1) any set X of one sample (N = 1) can be shattered – and

hence there is clearly at least one such point;

I Step 2) no sets of N = 2 points that can be shattered:F for any set X = (x1, x2) of two points with x1 ≤ x2, the label

assignment (t1, t2) = (1, 0) cannot be realized by any choice of thethreshold θ.

Osvaldo Simeone ML4Engineers 51 / 65

Page 86: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Examples

The threshold function model class

H ={t̂θ(x) = 1(x ≥ θ) : θ ∈ {θ1, ..., θ|H|}

},

with x ∈ R, has VCdim(H)= 1:I Step 1) any set X of one sample (N = 1) can be shattered – and

hence there is clearly at least one such point;

I Step 2) no sets of N = 2 points that can be shattered:F for any set X = (x1, x2) of two points with x1 ≤ x2, the label

assignment (t1, t2) = (1, 0) cannot be realized by any choice of thethreshold θ.

Osvaldo Simeone ML4Engineers 51 / 65

Page 87: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Examples

The model

H = {t̂a,b(x) = 1(a ≤ x ≤ b) : a ≤ b},

which assigns the label t = 1 within an interval [a, b] and the labelt = 0 outside it, has VCdim(H)= 2:

I Step 1) any set of N = 2 points can be shattered – and hence therealso exists one such set;

I Step 2) there are no sets X of N = 3 points that can be shattered:F for any set X = (x1, x2, x3) of three points with x1 ≤ x2 ≤ x3, the label

assignment (t1, t2, t3) = (1, 0, 1) cannot be realized.

I It can also be proved that the linear classifier in a dimension D hasVCdim(H)= D + 1.

Osvaldo Simeone ML4Engineers 52 / 65

Page 88: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Examples

The model

H = {t̂a,b(x) = 1(a ≤ x ≤ b) : a ≤ b},

which assigns the label t = 1 within an interval [a, b] and the labelt = 0 outside it, has VCdim(H)= 2:

I Step 1) any set of N = 2 points can be shattered – and hence therealso exists one such set;

I Step 2) there are no sets X of N = 3 points that can be shattered:F for any set X = (x1, x2, x3) of three points with x1 ≤ x2 ≤ x3, the label

assignment (t1, t2, t3) = (1, 0, 1) cannot be realized.

I It can also be proved that the linear classifier in a dimension D hasVCdim(H)= D + 1.

Osvaldo Simeone ML4Engineers 52 / 65

Page 89: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Examples

The model

H = {t̂a,b(x) = 1(a ≤ x ≤ b) : a ≤ b},

which assigns the label t = 1 within an interval [a, b] and the labelt = 0 outside it, has VCdim(H)= 2:

I Step 1) any set of N = 2 points can be shattered – and hence therealso exists one such set;

I Step 2) there are no sets X of N = 3 points that can be shattered:F for any set X = (x1, x2, x3) of three points with x1 ≤ x2 ≤ x3, the label

assignment (t1, t2, t3) = (1, 0, 1) cannot be realized.

I It can also be proved that the linear classifier in a dimension D hasVCdim(H)= D + 1.

Osvaldo Simeone ML4Engineers 52 / 65

Page 90: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

VC DimensionTheorem: Under the detection-error loss, for a model class H withfinite VCdim(H)<∞, ERM is (N, ε, δ) PAC with accuracy

ε =

√C

2VCdim(H) + log(1/δ)

N

for some constant C > 0.

Therefore, ERM achieves the estimation error

Lp(t̂ERMD )− Lp(t̂∗H) ≤√

C2VCdim(H) + log(1/δ)

N

with probability no smaller than 1− δ.

Equivalently, the sample complexity of ERM is

NERMH (ε, δ) =

⌈C

2VCdim(H) + log(1/δ)

ε2

⌉.

This theorem generalizes the case of finite model classes bysubstituting VCdim(H) for log2(|H|).

Osvaldo Simeone ML4Engineers 53 / 65

Page 91: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

VC DimensionTheorem: Under the detection-error loss, for a model class H withfinite VCdim(H)<∞, ERM is (N, ε, δ) PAC with accuracy

ε =

√C

2VCdim(H) + log(1/δ)

N

for some constant C > 0.

Therefore, ERM achieves the estimation error

Lp(t̂ERMD )− Lp(t̂∗H) ≤√

C2VCdim(H) + log(1/δ)

N

with probability no smaller than 1− δ.

Equivalently, the sample complexity of ERM is

NERMH (ε, δ) =

⌈C

2VCdim(H) + log(1/δ)

ε2

⌉.

This theorem generalizes the case of finite model classes bysubstituting VCdim(H) for log2(|H|).

Osvaldo Simeone ML4Engineers 53 / 65

Page 92: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Fundamental Theorem of Learning

So, we have seen that ERM achieves a sample complexityproportional to the ratio VCdim(H)/N. Is there any trainingalgorithm that is more efficient?

Theorem: For a model H with finite VCdim(H)<∞, the samplecomplexity of any training algorithm A is lower bounded as

NAH(ε, δ) ≥ D

VCdim(H) + log(1/δ)

ε2

for some constant D > 0.

The theorem demonstrates that, if learning is possible for a givenmodel H, then ERM allows us to learn with close-to-optimal samplecomplexity.

Osvaldo Simeone ML4Engineers 54 / 65

Page 93: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Fundamental Theorem of Learning

So, we have seen that ERM achieves a sample complexityproportional to the ratio VCdim(H)/N. Is there any trainingalgorithm that is more efficient?

Theorem: For a model H with finite VCdim(H)<∞, the samplecomplexity of any training algorithm A is lower bounded as

NAH(ε, δ) ≥ D

VCdim(H) + log(1/δ)

ε2

for some constant D > 0.

The theorem demonstrates that, if learning is possible for a givenmodel H, then ERM allows us to learn with close-to-optimal samplecomplexity.

Osvaldo Simeone ML4Engineers 54 / 65

Page 94: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Structural Risk Minimization

Osvaldo Simeone ML4Engineers 55 / 65

Page 95: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Structural Risk Minimization

Assume that we need to select an inductive bias by choosing one of anested set of hypothesis classes H1 ⊆ H2 ⊆ ... ⊆ HMmax .

An example is the problem studied in Chapter 4 of selecting themodel order M ∈ {1, 2, ...,Mmax} for linear regression.

How do we choose M? We have seen that the standard approach isvalidation. SRM is an alternative approach that does not require toset aside validation data.

Osvaldo Simeone ML4Engineers 56 / 65

Page 96: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Structural Risk Minimization

Consider a training algorithm AM producing a predictor t̂AMD ∈ HM in

class HM . Note that the algorithm, and its output, depend on themodel order M.

From the bound on the generalization error derived above, we havethe inequality

Lp(t̂AMD ) ≤ LD(t̂AM

D ) +

√log(|HM |) + log(2/δ)

2N

with probability at least 1− δ.

SRM minimizes this upper bound, which is a pessimistic estimate ofthe generalization loss, over the choice of the model M. Note thatthe training loss LD(t̂AM

D ) generally decreases with M, while themodel capacity log(|HM |) increases with it.

Assuming that the upper bound is reasonably tight, this allows one toavoid validation.

Osvaldo Simeone ML4Engineers 57 / 65

Page 97: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Structural Risk Minimization

Consider a training algorithm AM producing a predictor t̂AMD ∈ HM in

class HM . Note that the algorithm, and its output, depend on themodel order M.

From the bound on the generalization error derived above, we havethe inequality

Lp(t̂AMD ) ≤ LD(t̂AM

D ) +

√log(|HM |) + log(2/δ)

2N

with probability at least 1− δ.

SRM minimizes this upper bound, which is a pessimistic estimate ofthe generalization loss, over the choice of the model M. Note thatthe training loss LD(t̂AM

D ) generally decreases with M, while themodel capacity log(|HM |) increases with it.

Assuming that the upper bound is reasonably tight, this allows one toavoid validation.

Osvaldo Simeone ML4Engineers 57 / 65

Page 98: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Summary

Osvaldo Simeone ML4Engineers 58 / 65

Page 99: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Summary

In this chapter, we have discussed basic elements of statistical learningtheory, with the main goal of addressing the following questions:

I Given an inductive bias, how many training examples (N) are neededto learn to a given level of generalization accuracy?

F Answer: The sample complexity of ERM depends on the capacity (orVC dimension) of the selected model class, as well as on accuracy andconfidence parameters.

I Conversely, how should we choose the inductive bias in order to ensuregeneralization for a given data set size N?

F Answer: Structural risk minimization suggests choosing the model class

H that minimizes LD(t̂AH) +√

log(|H|)+log(2/δ)2N

, where the second term

is a bound on the generalization error.

Osvaldo Simeone ML4Engineers 59 / 65

Page 100: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Summary

In this chapter, we have discussed basic elements of statistical learningtheory, with the main goal of addressing the following questions:

I Given an inductive bias, how many training examples (N) are neededto learn to a given level of generalization accuracy?

F Answer: The sample complexity of ERM depends on the capacity (orVC dimension) of the selected model class, as well as on accuracy andconfidence parameters.

I Conversely, how should we choose the inductive bias in order to ensuregeneralization for a given data set size N?

F Answer: Structural risk minimization suggests choosing the model class

H that minimizes LD(t̂AH) +√

log(|H|)+log(2/δ)2N

, where the second term

is a bound on the generalization error.

Osvaldo Simeone ML4Engineers 59 / 65

Page 101: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Summary

The theory we have developed in this chapter bounds theapproximation error by relating the generalization error to thecapacity of the model class.

A different class of bounds can be obtained by studying the“sensitivity” of a training procedure to the training data set:

I a more sensitive a training algorithm is expected to overfit more easilyand hence to generalize less effectively.

PAC Bayes analysis follows this approach and provides generallycurrent state-of-the-art generalization error bounds (see Appendix).

Osvaldo Simeone ML4Engineers 60 / 65

Page 102: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Summary

Furthermore, in this chapter, we have considered a worst-caseformulation of the learning objective in which, for the given fixed N,we are interested in the performance under the worst, i.e.,population-loss minimizing, distribution p(x , t).

Other formulations apply to fixed population distributions p(x , t) andoffer bounds that depend explicitly on it (e.g., information-theoreticbounds) or are apply uniformly across all possible distributions p(x , t).

Osvaldo Simeone ML4Engineers 61 / 65

Page 103: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

Appendix

Osvaldo Simeone ML4Engineers 62 / 65

Page 104: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

PAC Bayes

The PAC Bayes formulation assumes a probabilistic choice t̂ ∼ q(t̂|D)for the predictor given the training set D (e.g., using Bayes rule).

The distribution q(t̂|D) is typically referred to as posterior, althoughit may not correspond to a real posterior distribution.

Under this distribution, we define:I the average population loss

Lp(q(t̂|D)) = Et̂∼q(t̂|D)E(x,t)∼p(x,t)[`(t, t̂(x))];

I and the average training loss

LD(q(t̂|D)) = Et̂∼q(t̂|D)

[1

N

N∑n=1

`(tn, t̂(xn))

].

Osvaldo Simeone ML4Engineers 63 / 65

Page 105: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

PAC Bayes

Theorem: Let p(t̂) be a prior distribution on the predictors in thehypothesis class H and q(t̂|D) a posterior in the same space. Theprior distribution must be selected before observing the data D, whilethe posterior can depend on it. Then, with probability no smaller than1− δ, for any δ ∈ (0, 1), the generalization error satisfies theinequality

Lp(q(t̂|D)) ≤ LD(q(t̂|D)) +

√KL(q(t̂|D)||p(t̂)) + log(Nδ )

2(N − 1)

for all population distributions p(x , t).

The term KL(q(t̂|D)||p(t̂)) measures the sensitivity of the trainedmodel to the training set D.

The right-hand side of the inequality above can be used to define anoptimization criterion for the posterior q(t̂|D) – this is known asinformation risk minimization.

Osvaldo Simeone ML4Engineers 64 / 65

Page 106: Machine Learning for Engineers: Chapter 8. Elements of ... · This Chapter As we have discussed in the previous chapters, a learning problem set-up is de ned by I the inductive bias,

PAC Bayes

If the prior is uniform and the hypothesis class is finite, i.e., ifp(t̂) = 1/|H|, we obtain

KL(q(t̂|D)||p(t̂)) = Et̂∼q(t̂|D)[log q(t̂|D)] + log(|H|)

= −H(q(t̂|D)) + log(|H|) ≤ log(|H|),

and hence we have the inequality

Lp(q(t̂|D)) ≤ LD(q(t̂|D)) +

√log(|H|) + log(Nδ )

2(N − 1).

This tends to the PAC bound derived above as N increases.

Osvaldo Simeone ML4Engineers 65 / 65