machine learning for engineers: chapter 8. elements of ... · this chapter as we have discussed in...
TRANSCRIPT
Machine Learning for Engineers:Chapter 8. Elements of Statistical Learning Theory
Osvaldo Simeone
King’s College London
January 1, 2021
Osvaldo Simeone ML4Engineers 1 / 65
This Chapter
As we have discussed in the previous chapters, a learning problemset-up is defined by
I the inductive bias, consisting of model class H, loss function `, andtraining algorithm;
I and a training set D = {(xn, tn)}Nn=1, consisting of N data pointsgenerated i.i.d. from the unknown population distribution p(x , t).
In this chapter, we will always view the training set D as a set of rvsdrawn i.i.d. from p(x , t) rather than a fixed data set.
The performance of a learned hypothesis, or (hard) predictor t̂(·), inH is defined by the population loss
Lp(t̂(·)) = E(x,t)∼p(x ,t)[`(t, t̂(x))].
Osvaldo Simeone ML4Engineers 2 / 65
This Chapter
As we have discussed in the previous chapters, a learning problemset-up is defined by
I the inductive bias, consisting of model class H, loss function `, andtraining algorithm;
I and a training set D = {(xn, tn)}Nn=1, consisting of N data pointsgenerated i.i.d. from the unknown population distribution p(x , t).
In this chapter, we will always view the training set D as a set of rvsdrawn i.i.d. from p(x , t) rather than a fixed data set.
The performance of a learned hypothesis, or (hard) predictor t̂(·), inH is defined by the population loss
Lp(t̂(·)) = E(x,t)∼p(x ,t)[`(t, t̂(x))].
Osvaldo Simeone ML4Engineers 2 / 65
This Chapter
Training algorithms, such as ERM, are generally dependent on thetraining loss
LD(t̂(·)) =1
N
N∑n=1
`(tn, t̂(xn))).
In this chapter, we discuss basic elements of statistical learning theory,with the main goal of addressing the following basic questions:
I Given an inductive bias, how many training samples (N) are needed tolearn to a given level of population loss?
F This is known as sample complexity.
I Conversely, how should we choose the inductive bias in order to ensurea suitably small population loss for a given data set size N?
Osvaldo Simeone ML4Engineers 3 / 65
This Chapter
Training algorithms, such as ERM, are generally dependent on thetraining loss
LD(t̂(·)) =1
N
N∑n=1
`(tn, t̂(xn))).
In this chapter, we discuss basic elements of statistical learning theory,with the main goal of addressing the following basic questions:
I Given an inductive bias, how many training samples (N) are needed tolearn to a given level of population loss?
F This is known as sample complexity.
I Conversely, how should we choose the inductive bias in order to ensurea suitably small population loss for a given data set size N?
Osvaldo Simeone ML4Engineers 3 / 65
This Chapter
Training algorithms, such as ERM, are generally dependent on thetraining loss
LD(t̂(·)) =1
N
N∑n=1
`(tn, t̂(xn))).
In this chapter, we discuss basic elements of statistical learning theory,with the main goal of addressing the following basic questions:
I Given an inductive bias, how many training samples (N) are needed tolearn to a given level of population loss?
F This is known as sample complexity.
I Conversely, how should we choose the inductive bias in order to ensurea suitably small population loss for a given data set size N?
Osvaldo Simeone ML4Engineers 3 / 65
Overview
Benchmarks and decomposition of the optimality error
Probably Approximately Correct (PAC) learning
Sample complexity of ERM for finite model classes
Proof (and generalization error)
Sample complexity of ERM for continuous model classes
Structural Risk Minimization
Appendix: PAC Bayes learning
Osvaldo Simeone ML4Engineers 4 / 65
Benchmarks and Decomposition of theOptimality Error
Osvaldo Simeone ML4Engineers 5 / 65
Benchmarks
How well does a learning algorithm perform? We have two keybenchmarks.
1) Population-optimal unconstrained predictor: This is the predictorthat minimizes the population loss without any constraint on themodel class, i.e.,
t̂∗(·) = arg mint̂(·)
Lp(t̂(·)).
2) Population-optimal within-class predictor: This is the predictorthat minimizes the population loss within a given model class H, i.e.,
t̂∗H(·) ∈ arg mint̂(·)∈H
Lp(t̂(·)).
Osvaldo Simeone ML4Engineers 6 / 65
Benchmarks
How well does a learning algorithm perform? We have two keybenchmarks.
1) Population-optimal unconstrained predictor: This is the predictorthat minimizes the population loss without any constraint on themodel class, i.e.,
t̂∗(·) = arg mint̂(·)
Lp(t̂(·)).
2) Population-optimal within-class predictor: This is the predictorthat minimizes the population loss within a given model class H, i.e.,
t̂∗H(·) ∈ arg mint̂(·)∈H
Lp(t̂(·)).
Osvaldo Simeone ML4Engineers 6 / 65
Benchmarks
The unconstrained predictor t̂∗(·) can be instantiated in a richerdomain than the within-class predictor t̂∗H(·), whose domain is limitedto the model class H.
Hence, the unconstrained minimum population loss Lp(t̂∗(·)) cannotbe larger than the minimum within-class population loss Lp(t̂∗H(·)),i.e., we have the inequality
Lp(t̂∗(·)) ≤ Lp(t̂∗H(·)).
Furthermore, we have the equality
Lp(t̂∗(·)) = Lp(t̂∗H(·))
if and only if the model class H is large enough so that thepopulation-optimal unconstrained predictor is in it, i.e.,t̂∗(·) = t̂∗H(·) ∈ H.
Osvaldo Simeone ML4Engineers 7 / 65
Benchmarks
The unconstrained predictor t̂∗(·) can be instantiated in a richerdomain than the within-class predictor t̂∗H(·), whose domain is limitedto the model class H.
Hence, the unconstrained minimum population loss Lp(t̂∗(·)) cannotbe larger than the minimum within-class population loss Lp(t̂∗H(·)),i.e., we have the inequality
Lp(t̂∗(·)) ≤ Lp(t̂∗H(·)).
Furthermore, we have the equality
Lp(t̂∗(·)) = Lp(t̂∗H(·))
if and only if the model class H is large enough so that thepopulation-optimal unconstrained predictor is in it, i.e.,t̂∗(·) = t̂∗H(·) ∈ H.
Osvaldo Simeone ML4Engineers 7 / 65
Example
Consider the problem of binary classification with a binary input, i.e.,x, t ∈ {0, 1}.The population distribution is such that we have x ∼Bern(p), with0 < p < 0.5, and t = x. Recall that this is unknown to the learner.
Under the detection-error loss `(t, t̂) = 1(t̂ 6= t), thepopulation-optimal predictor is clearly t̂∗(x) = x , yielding theminimum unconstrained population loss Lp(t̂∗(·)) = 0.
Consider now the “constant-predictor” model class
H = {t̂(x) = t̂ ∈ {0, 1} for all x ∈ {0, 1}}.
The population-optimal within-class predictor is t̂∗H(x) = 0, yieldingthe minimum within-class population loss
Lp(t̂∗H(·)) = (1− p)× 0 + p × 1 = p > Lp(t̂∗(·)) = 0.
Osvaldo Simeone ML4Engineers 8 / 65
Example
Consider the problem of binary classification with a binary input, i.e.,x, t ∈ {0, 1}.The population distribution is such that we have x ∼Bern(p), with0 < p < 0.5, and t = x. Recall that this is unknown to the learner.
Under the detection-error loss `(t, t̂) = 1(t̂ 6= t), thepopulation-optimal predictor is clearly t̂∗(x) = x , yielding theminimum unconstrained population loss Lp(t̂∗(·)) = 0.
Consider now the “constant-predictor” model class
H = {t̂(x) = t̂ ∈ {0, 1} for all x ∈ {0, 1}}.
The population-optimal within-class predictor is t̂∗H(x) = 0, yieldingthe minimum within-class population loss
Lp(t̂∗H(·)) = (1− p)× 0 + p × 1 = p > Lp(t̂∗(·)) = 0.
Osvaldo Simeone ML4Engineers 8 / 65
Example
Consider the problem of binary classification with a binary input, i.e.,x, t ∈ {0, 1}.The population distribution is such that we have x ∼Bern(p), with0 < p < 0.5, and t = x. Recall that this is unknown to the learner.
Under the detection-error loss `(t, t̂) = 1(t̂ 6= t), thepopulation-optimal predictor is clearly t̂∗(x) = x , yielding theminimum unconstrained population loss Lp(t̂∗(·)) = 0.
Consider now the “constant-predictor” model class
H = {t̂(x) = t̂ ∈ {0, 1} for all x ∈ {0, 1}}.
The population-optimal within-class predictor is t̂∗H(x) = 0, yieldingthe minimum within-class population loss
Lp(t̂∗H(·)) = (1− p)× 0 + p × 1 = p > Lp(t̂∗(·)) = 0.
Osvaldo Simeone ML4Engineers 8 / 65
Decomposition of Optimality Error
From now on, we will write predictors t̂(·) as t̂ in order to simplify thenotation.
We have seen in Chapter 4 that, for any predictor t̂, we candecompose the optimality error Lp(t̂)− Lp(t̂∗) as
Lp(t̂)− Lp(t̂∗)︸ ︷︷ ︸optimality error
= (Lp(t̂∗H)− Lp(t̂∗))︸ ︷︷ ︸bias
+ (Lp(t̂)− Lp(t̂∗H))︸ ︷︷ ︸estimation error
The bias (Lp(t̂∗H)− Lp(t̂∗)), also known as approximation error,depends on the choice of the model class H (see, e.g., the previousexample);
The estimation error (Lp(t̂)− Lp(t̂∗H)) depends on the model class H,on the training algorithm producing the predictor t̂ from the trainingdata, and on the training data itself.
Osvaldo Simeone ML4Engineers 9 / 65
Decomposition of Optimality Error
From now on, we will write predictors t̂(·) as t̂ in order to simplify thenotation.
We have seen in Chapter 4 that, for any predictor t̂, we candecompose the optimality error Lp(t̂)− Lp(t̂∗) as
Lp(t̂)− Lp(t̂∗)︸ ︷︷ ︸optimality error
= (Lp(t̂∗H)− Lp(t̂∗))︸ ︷︷ ︸bias
+ (Lp(t̂)− Lp(t̂∗H))︸ ︷︷ ︸estimation error
The bias (Lp(t̂∗H)− Lp(t̂∗)), also known as approximation error,depends on the choice of the model class H (see, e.g., the previousexample);
The estimation error (Lp(t̂)− Lp(t̂∗H)) depends on the model class H,on the training algorithm producing the predictor t̂ from the trainingdata, and on the training data itself.
Osvaldo Simeone ML4Engineers 9 / 65
ERMGiven the training data set
D = {(xn, tn)}Nn=1 ∼i.i.d.
p(x , t)
a training algorithm returns a predictor t̂D ∈ H.
Note that the selected model t̂D is random due to randomness of thedata set D.
As we have seen, a standard learning algorithm is empirical riskminimization (ERM), which minimizes the training loss LD(t̂) as
t̂ERMD = argmint̂∈H
LD(t̂)
with LD(t̂) =1
N
N∑n=1
`(tn, t̂(xn)).
We generally have the inequalities
Lp(t̂ERMD ) ≥ Lp(t̂∗H) ≥ Lp(t̂∗).
Osvaldo Simeone ML4Engineers 10 / 65
ERMGiven the training data set
D = {(xn, tn)}Nn=1 ∼i.i.d.
p(x , t)
a training algorithm returns a predictor t̂D ∈ H.
Note that the selected model t̂D is random due to randomness of thedata set D.
As we have seen, a standard learning algorithm is empirical riskminimization (ERM), which minimizes the training loss LD(t̂) as
t̂ERMD = argmint̂∈H
LD(t̂)
with LD(t̂) =1
N
N∑n=1
`(tn, t̂(xn)).
We generally have the inequalities
Lp(t̂ERMD ) ≥ Lp(t̂∗H) ≥ Lp(t̂∗).
Osvaldo Simeone ML4Engineers 10 / 65
Example
Consider the problem of binary classification with a binary input, i.e.,x, t ∈ {0, 1}.The population distribution is such that we have x ∼Bern(p), withp < 0.5; and t = x with probability q > 0.5, while t 6= x withprobability 1− q.
Under the detection-error loss, the population-optimal unconstrainedpredictor is t̂∗(x) = x , with corresponding unconstrained minimumpopulation loss Lp(t̂∗) = 1− q.
For the “constant-predictor” model class H, the population-optimalwithin-class predictor is t̂H(x) = 0, with minimum within-classpopulation loss
Lp(t̂∗H) = (1− p)Pr[t 6= 0|x = 0] + pPr[t 6= 0|x = 1]
= (1− p)(1− q) + pq ≥ 1− q = Lp(t̂∗).
Osvaldo Simeone ML4Engineers 11 / 65
Example
Consider the problem of binary classification with a binary input, i.e.,x, t ∈ {0, 1}.The population distribution is such that we have x ∼Bern(p), withp < 0.5; and t = x with probability q > 0.5, while t 6= x withprobability 1− q.
Under the detection-error loss, the population-optimal unconstrainedpredictor is t̂∗(x) = x , with corresponding unconstrained minimumpopulation loss Lp(t̂∗) = 1− q.
For the “constant-predictor” model class H, the population-optimalwithin-class predictor is t̂H(x) = 0, with minimum within-classpopulation loss
Lp(t̂∗H) = (1− p)Pr[t 6= 0|x = 0] + pPr[t 6= 0|x = 1]
= (1− p)(1− q) + pq ≥ 1− q = Lp(t̂∗).
Osvaldo Simeone ML4Engineers 11 / 65
Example
Consider the problem of binary classification with a binary input, i.e.,x, t ∈ {0, 1}.The population distribution is such that we have x ∼Bern(p), withp < 0.5; and t = x with probability q > 0.5, while t 6= x withprobability 1− q.
Under the detection-error loss, the population-optimal unconstrainedpredictor is t̂∗(x) = x , with corresponding unconstrained minimumpopulation loss Lp(t̂∗) = 1− q.
For the “constant-predictor” model class H, the population-optimalwithin-class predictor is t̂H(x) = 0, with minimum within-classpopulation loss
Lp(t̂∗H) = (1− p)Pr[t 6= 0|x = 0] + pPr[t 6= 0|x = 1]
= (1− p)(1− q) + pq ≥ 1− q = Lp(t̂∗).
Osvaldo Simeone ML4Engineers 11 / 65
Example
Given training data set D ={( 1︸︷︷︸x
, 0︸︷︷︸t
), (0, 1), (0, 1)} (N = 3), ERM
trains a model in class H by minimizing the training loss
LD(t̂) =1
3(1(t̂ 6= 0) + 2× 1(t̂ 6= 1)).
The ERM predictor is hence t̂ERMD = 1, which gives the training lossLD(t̂ERMD ) = 1
3 .
The population loss of the ERM predictor is
Lp(t̂ERMD ) = (1− p)Pr[t 6= 1|x = 0] + pPr[t 6= 1|x = 1]
= (1− p)q + p(1− q).
Osvaldo Simeone ML4Engineers 12 / 65
Example
Given training data set D ={( 1︸︷︷︸x
, 0︸︷︷︸t
), (0, 1), (0, 1)} (N = 3), ERM
trains a model in class H by minimizing the training loss
LD(t̂) =1
3(1(t̂ 6= 0) + 2× 1(t̂ 6= 1)).
The ERM predictor is hence t̂ERMD = 1, which gives the training lossLD(t̂ERMD ) = 1
3 .
The population loss of the ERM predictor is
Lp(t̂ERMD ) = (1− p)Pr[t 6= 1|x = 0] + pPr[t 6= 1|x = 1]
= (1− p)q + p(1− q).
Osvaldo Simeone ML4Engineers 12 / 65
Example
Given training data set D ={( 1︸︷︷︸x
, 0︸︷︷︸t
), (0, 1), (0, 1)} (N = 3), ERM
trains a model in class H by minimizing the training loss
LD(t̂) =1
3(1(t̂ 6= 0) + 2× 1(t̂ 6= 1)).
The ERM predictor is hence t̂ERMD = 1, which gives the training lossLD(t̂ERMD ) = 1
3 .
The population loss of the ERM predictor is
Lp(t̂ERMD ) = (1− p)Pr[t 6= 1|x = 0] + pPr[t 6= 1|x = 1]
= (1− p)q + p(1− q).
Osvaldo Simeone ML4Engineers 12 / 65
Example
For instance, if p = 0.3 and q = 0.7, we haveI minimum unconstrained population loss Lp(t̂∗) = 0.3I minimum within-class population loss Lp(t̂∗H) = 0.42I ERM population loss Lp(t̂ERMD ) = 0.58
We also have the decomposition of the optimality error
Lp(t̂ERMD )︸ ︷︷ ︸0.58
− Lp(t̂∗)︸ ︷︷ ︸0.3
= (Lp(t̂∗H)− Lp(t̂∗))︸ ︷︷ ︸bias=0.12
+ (Lp(t̂ERMD )− Lp(t̂∗H))︸ ︷︷ ︸estimation error=0.16
Osvaldo Simeone ML4Engineers 13 / 65
Probably Approximately Correct (PAC)Learning
Osvaldo Simeone ML4Engineers 14 / 65
Statistical Learning Theory
Statistical learning theory studies the estimation errorLp(t̂D)− Lp(t̂∗H), which depends on how data is used.
The bias Lp(t̂∗H)− Lp(t̂∗) is generally difficult to quantify since itdepends on the population-optimal unconstrained predictor t̂∗.
I In practice, if the loss Lp(t̂∗H) is too large, one needs to increase thecapacity of the model or choose a different class of models that is moresuitable for the problem.
If the estimation error is small, the training algorithm obtains close tothe best possible within-class population loss:
I How much data is needed to ensure a small estimation error?
The estimation error is random since it depends on the training set DI We first need to specify what we mean by “small” estimation error.
Osvaldo Simeone ML4Engineers 15 / 65
Statistical Learning Theory
Statistical learning theory studies the estimation errorLp(t̂D)− Lp(t̂∗H), which depends on how data is used.
The bias Lp(t̂∗H)− Lp(t̂∗) is generally difficult to quantify since itdepends on the population-optimal unconstrained predictor t̂∗.
I In practice, if the loss Lp(t̂∗H) is too large, one needs to increase thecapacity of the model or choose a different class of models that is moresuitable for the problem.
If the estimation error is small, the training algorithm obtains close tothe best possible within-class population loss:
I How much data is needed to ensure a small estimation error?
The estimation error is random since it depends on the training set DI We first need to specify what we mean by “small” estimation error.
Osvaldo Simeone ML4Engineers 15 / 65
Statistical Learning Theory
Statistical learning theory studies the estimation errorLp(t̂D)− Lp(t̂∗H), which depends on how data is used.
The bias Lp(t̂∗H)− Lp(t̂∗) is generally difficult to quantify since itdepends on the population-optimal unconstrained predictor t̂∗.
I In practice, if the loss Lp(t̂∗H) is too large, one needs to increase thecapacity of the model or choose a different class of models that is moresuitable for the problem.
If the estimation error is small, the training algorithm obtains close tothe best possible within-class population loss:
I How much data is needed to ensure a small estimation error?
The estimation error is random since it depends on the training set DI We first need to specify what we mean by “small” estimation error.
Osvaldo Simeone ML4Engineers 15 / 65
Approximately and Probably
Due to the randomness of D, a learning rule t̂D can only minimize thegeneralization loss Lp(t̂)
I approximately, i.e., Lp(t̂D) ≤ Lp(t̂∗H) + ε for some ε > 0
I and with probability at least 1− δ for some δ > 0F with probability no larger than δ, we may have Lp(t̂D) > Lp(t̂∗H) + ε
0
5
10
15
20
25
30
0 2 4 6 8 10
Osvaldo Simeone ML4Engineers 16 / 65
Approximately and Probably
Due to the randomness of D, a learning rule t̂D can only minimize thegeneralization loss Lp(t̂)
I approximately, i.e., Lp(t̂D) ≤ Lp(t̂∗H) + ε for some ε > 0
I and with probability at least 1− δ for some δ > 0F with probability no larger than δ, we may have Lp(t̂D) > Lp(t̂∗H) + ε
0
5
10
15
20
25
30
0 2 4 6 8 10
Osvaldo Simeone ML4Engineers 16 / 65
Probably Approximately Correct (PAC) Learning Rule
When operating on data sets D of N examples, a training algorithmA that produces a predictor t̂AD(·) is (N, ε, δ) PAC for a model classH, loss function `, and set of population distributions p(x , t) if, it hasan estimation error no larger than the accuracy parameter ε
Lp(t̂AD) ≤ Lp(t̂∗H) + ε
with probability no smaller than 1− δ for the confidence parameterδ > 0, that is, if we have
PrD ∼i.i.d.
p(x ,t)[Lp(t̂AD) ≤ Lp(t̂∗H) + ε] ≥ 1− δ,
for any true distribution p(x , t) in the set.
Osvaldo Simeone ML4Engineers 17 / 65
Probably Approximately Correct (PAC) Learning Rule
The PAC requirement is a worst-case constraint in the sense that itimposes the inequality
minp(x ,t)
{PrD ∼
i.i.d.p(x ,t)[Lp(t̂AD) ≤ Lp(t̂∗H) + ε]
}≥ 1− δ,
where the minimum is taken over all possible population distributionsin the set of interest.
The set of population distributions is typically considered to includeall possible population distributions.
Osvaldo Simeone ML4Engineers 18 / 65
Sample ComplexityFor a training algorithm A, the amount of data needed to achieve acertain accuracy-confidence level (ε, δ) is known as sample complexity.
For a model class H, loss function `, and class of populationdistributions p(x , t), a training algorithm A has sample complexityNAH(ε, δ) if, for the given ε, δ ∈ (0, 1), it is (N, ε, δ) PAC for all
N ≥ NAH(ε, δ).
5 10 15 20 25 30 35 40 45 500.8
1
1.2
1.4
1.6
1.8
2
2.2
popula
tion l
oss
Osvaldo Simeone ML4Engineers 19 / 65
Is ERM PAC?
Intuitively, ERM is PAC by the law of large numbers.
The law of large numbers says that, for i.i.d. rvsu1, u2, · · · ,uM ∼ p(u) such that E [ui ] = µ, the empirical mean1M
∑Mm=1 um tends to the ensemble mean µ as M →∞:
I We write this as
1
M
M∑m=1
um → µ for M →∞ in probability;
I which means that for any ε > 0 we have the limit
Pr
[∣∣∣∣∣ 1
M
M∑m=1
um − µ
∣∣∣∣∣ > ε
]→ 0 for M →∞.
Osvaldo Simeone ML4Engineers 20 / 65
Is ERM PAC?
Intuitively, ERM is PAC by the law of large numbers.
The law of large numbers says that, for i.i.d. rvsu1, u2, · · · ,uM ∼ p(u) such that E [ui ] = µ, the empirical mean1M
∑Mm=1 um tends to the ensemble mean µ as M →∞:
I We write this as
1
M
M∑m=1
um → µ for M →∞ in probability;
I which means that for any ε > 0 we have the limit
Pr
[∣∣∣∣∣ 1
M
M∑m=1
um − µ
∣∣∣∣∣ > ε
]→ 0 for M →∞.
Osvaldo Simeone ML4Engineers 20 / 65
Is ERM PAC?
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
3.5
Osvaldo Simeone ML4Engineers 21 / 65
Is ERM PAC?
By the law of large numbers, we hence have
LD(t̂) =1
N
N∑n=1
`(tn, t̂(xn)))→ Lp(t̂) in probability
separately for each predictor t̂.
Hence, intuitively, the ERM solution t̂ERMD – which minimizes LD(t̂) –should with high probability also minimize Lp(t̂) – i.e., be close to t̂∗H– as the data set size N increases.
But how large do we need N to be – i.e., what is the samplecomplexity?
Osvaldo Simeone ML4Engineers 22 / 65
Is ERM PAC?
By the law of large numbers, we hence have
LD(t̂) =1
N
N∑n=1
`(tn, t̂(xn)))→ Lp(t̂) in probability
separately for each predictor t̂.
Hence, intuitively, the ERM solution t̂ERMD – which minimizes LD(t̂) –should with high probability also minimize Lp(t̂) – i.e., be close to t̂∗H– as the data set size N increases.
But how large do we need N to be – i.e., what is the samplecomplexity?
Osvaldo Simeone ML4Engineers 22 / 65
ExampleConsider the model class of threshold functions
H =
{t̂θ(x) =
{0, if x < θ
1, if x ≥ θ= 1(x ≥ θ)
},
where x and θ are real numbers (D = 1).
Assume that the population distribution p(x , t) is given as
p(x , t) = p(x)1(t = t̂0(x))
for some p(x).
Since the conditional p(t|x) of the population distribution is includedin H, we say that the class of probability distributions is “feasible” forH.For the detection loss, the optimal predictor is t̂∗ = t̂∗H = t̂0, i.e., thebias is zero and the population-optimal threshold (both unconstrainedand within-class) is θ∗ = 0.
Osvaldo Simeone ML4Engineers 23 / 65
ExampleConsider the model class of threshold functions
H =
{t̂θ(x) =
{0, if x < θ
1, if x ≥ θ= 1(x ≥ θ)
},
where x and θ are real numbers (D = 1).
Assume that the population distribution p(x , t) is given as
p(x , t) = p(x)1(t = t̂0(x))
for some p(x).
Since the conditional p(t|x) of the population distribution is includedin H, we say that the class of probability distributions is “feasible” forH.For the detection loss, the optimal predictor is t̂∗ = t̂∗H = t̂0, i.e., thebias is zero and the population-optimal threshold (both unconstrainedand within-class) is θ∗ = 0.
Osvaldo Simeone ML4Engineers 23 / 65
ExampleConsider the model class of threshold functions
H =
{t̂θ(x) =
{0, if x < θ
1, if x ≥ θ= 1(x ≥ θ)
},
where x and θ are real numbers (D = 1).
Assume that the population distribution p(x , t) is given as
p(x , t) = p(x)1(t = t̂0(x))
for some p(x).
Since the conditional p(t|x) of the population distribution is includedin H, we say that the class of probability distributions is “feasible” forH.For the detection loss, the optimal predictor is t̂∗ = t̂∗H = t̂0, i.e., thebias is zero and the population-optimal threshold (both unconstrainedand within-class) is θ∗ = 0.
Osvaldo Simeone ML4Engineers 23 / 65
ExampleAssume that p(x) is uniform in the interval [−0.5, 0.5]. Thepopulation loss is shown in the figure as a function of θ as a dashedline.
With the data in the figure, ERM may return any value of θ in theinterval between the two closest positive and negative samples
With N large enough, we see that t̂ERMD tends to the optimalpredictor t̂0.
-0.5 0 0.5
0
0.2
0.4
0.6
-0.5 0 0.5
0
0.2
0.4
0.6
Osvaldo Simeone ML4Engineers 24 / 65
Sample Complexity of ERM for FiniteModel Classes
Osvaldo Simeone ML4Engineers 25 / 65
Finite Model Classes
To address the sample complexity of ERM, we will first considermodel classes with a finite number of models.
For example, consider the class of threshold classifiers
H ={t̂θ(x) = 1(x ≥ θ) : θ ∈ {θ1, ..., θ|H|}
},
where θ can only take a finite set of |H| of values.
We write the number of models in H as |H| (cardinality of set H).
Osvaldo Simeone ML4Engineers 26 / 65
Capacity of a Finite Model Class
For a given finite model class H, we define
model capacity = log(|H|) (nats)
= log2(|H|) (bits)
as the number of nats or bits required to index the hypotheses in H.
Intuitively, a larger model capacity entails a larger sample complexity.
This intuition can be made precise by the following theorem.
Osvaldo Simeone ML4Engineers 27 / 65
Capacity of a Finite Model Class
For a given finite model class H, we define
model capacity = log(|H|) (nats)
= log2(|H|) (bits)
as the number of nats or bits required to index the hypotheses in H.
Intuitively, a larger model capacity entails a larger sample complexity.
This intuition can be made precise by the following theorem.
Osvaldo Simeone ML4Engineers 27 / 65
Sample Complexity of ERM for Finite Model Classes
Theorem: For the detection-error loss (or any other loss bounded inthe interval [0, 1]) and any finite hypothesis class H, ERM is (N, ε, δ)PAC with estimation error
ε =
√2 log(|H|) + log(1/δ)
N.
Equivalently, ERM achieves the estimation error
Lp(t̂ERMD )− Lp(t̂∗H) ≤√
2 log(|H|) + log(1/δ)
N
with probability no smaller than 1− δ.
Osvaldo Simeone ML4Engineers 28 / 65
Sample Complexity of ERM for Finite Model Classes
The theorem above can be equivalently stated in terms of samplecomplexity.
Theorem: For the detection-error loss (or any other loss bounded inthe interval [0, 1]) and any finite hypothesis class H, ERM has samplecomplexity
NERMH (ε, δ) =
⌈2 log(|H|) + log(1/δ)
ε2
⌉.
So the sample complexity isI proportional to the model capacity log |H|;I proportional to the “number of confidence digits” log(1/δ);I and inversely proportional to the square of the accuracy ε.
Osvaldo Simeone ML4Engineers 29 / 65
Proof of the Theorem
(and Generalization Error)
Osvaldo Simeone ML4Engineers 30 / 65
Proof
The proof of the theorem is based onI a decomposition of the estimation error that includes the generalization
errorLp(t̂D)− LD(t̂D)
F the generalization error measures the difference between populationand training losses for the trained predictor t̂D;
I the law of large numbers (in a stronger form known as Hoeffdinginequality);
I and the union bound.
Osvaldo Simeone ML4Engineers 31 / 65
Decomposition of the Estimation ErrorThe estimation error can be decomposed into the followingcontributions
Lp(t̂D)− Lp(t̂∗H)︸ ︷︷ ︸estimation error
= (Lp(t̂D)− LD(t̂D))︸ ︷︷ ︸generalization error
+ (LD(t̂D)− LD(t̂∗H))︸ ︷︷ ︸training-based estimation error
+ LD(t̂∗H)− Lp(t̂∗H)︸ ︷︷ ︸empirical average error
.
As a visual aid, we have
Lp(t̂D) − Lp(t̂∗H)↓ ↑
LD(t̂D) → LD(t̂∗H)
All these gap terms have to be small with high probability withrespect to the selection of the data set D.
Osvaldo Simeone ML4Engineers 32 / 65
Decomposition of the Estimation ErrorThe estimation error can be decomposed into the followingcontributions
Lp(t̂D)− Lp(t̂∗H)︸ ︷︷ ︸estimation error
= (Lp(t̂D)− LD(t̂D))︸ ︷︷ ︸generalization error
+ (LD(t̂D)− LD(t̂∗H))︸ ︷︷ ︸training-based estimation error
+ LD(t̂∗H)− Lp(t̂∗H)︸ ︷︷ ︸empirical average error
.
As a visual aid, we have
Lp(t̂D) − Lp(t̂∗H)↓ ↑
LD(t̂D) → LD(t̂∗H)
All these gap terms have to be small with high probability withrespect to the selection of the data set D.
Osvaldo Simeone ML4Engineers 32 / 65
Decomposition of the Estimation Error
By the law of large numbers, the empirical average errorLD(t̂∗H)− Lp(t̂∗H) goes to zero in probability as N →∞. Note thatthe law of large numbers is applicable since t̂∗H is a fixed predictor andnot dependent on D.
For ERM, the training-based estimation error satisfiesLD(t̂ERMD )− LD(t̂∗H) ≤ 0, since ERM minimizes the training loss, andhence any other predictor cannot yield a smaller training loss.
For the generalization error Lp(t̂ERMD )− LD(t̂ERMD ), we cannot use thelaw of large numbers since the ERM predictor t̂ERMD is a randomvariable dependent on D and not a fixed predictor.
Osvaldo Simeone ML4Engineers 33 / 65
Decomposition of the Estimation Error
By the law of large numbers, the empirical average errorLD(t̂∗H)− Lp(t̂∗H) goes to zero in probability as N →∞. Note thatthe law of large numbers is applicable since t̂∗H is a fixed predictor andnot dependent on D.
For ERM, the training-based estimation error satisfiesLD(t̂ERMD )− LD(t̂∗H) ≤ 0, since ERM minimizes the training loss, andhence any other predictor cannot yield a smaller training loss.
For the generalization error Lp(t̂ERMD )− LD(t̂ERMD ), we cannot use thelaw of large numbers since the ERM predictor t̂ERMD is a randomvariable dependent on D and not a fixed predictor.
Osvaldo Simeone ML4Engineers 33 / 65
Decomposition of the Estimation Error
By the law of large numbers, the empirical average errorLD(t̂∗H)− Lp(t̂∗H) goes to zero in probability as N →∞. Note thatthe law of large numbers is applicable since t̂∗H is a fixed predictor andnot dependent on D.
For ERM, the training-based estimation error satisfiesLD(t̂ERMD )− LD(t̂∗H) ≤ 0, since ERM minimizes the training loss, andhence any other predictor cannot yield a smaller training loss.
For the generalization error Lp(t̂ERMD )− LD(t̂ERMD ), we cannot use thelaw of large numbers since the ERM predictor t̂ERMD is a randomvariable dependent on D and not a fixed predictor.
Osvaldo Simeone ML4Engineers 33 / 65
Generalization Error
Given the discussion above, statistical learning is largely devoted tothe study of the generalization error Lp(t̂D)− LD(t̂D) for giventraining algorithms:
I If the generalization error is small, the training loss is a reliableestimate of the population loss.
The generalization error Lp(t̂D)− LD(t̂D) generally depends on the“capacity” of the model class H, on the loss function `, and on theassumptions made on the data distribution.
Osvaldo Simeone ML4Engineers 34 / 65
Hoeffding’s inequality
For i.i.d. rvs u1,u2, · · · , uM ∼ p(u) such that E [ui ] = µ andPr[a ≤ ui ≤ b] = 1 for some a ≤ b, we have the large deviationinequality
Pr
[∣∣∣∣∣ 1
M
M∑m=1
um − µ
∣∣∣∣∣ > ε
]≤ 2 exp
(− 2Mε2
(b − a)2
).
Note that Hoeffding inequality implies the limit
Pr
[∣∣∣∣∣ 1M ∑Mm=1 um − µ
∣∣∣∣∣ > ε
]→ 0 for M →∞, and hence it recovers
the law of large numbers.
We now use this inequality to obtain bounds on the empirical averageand generalization errors.
Osvaldo Simeone ML4Engineers 35 / 65
Hoeffding’s inequality
For i.i.d. rvs u1,u2, · · · , uM ∼ p(u) such that E [ui ] = µ andPr[a ≤ ui ≤ b] = 1 for some a ≤ b, we have the large deviationinequality
Pr
[∣∣∣∣∣ 1
M
M∑m=1
um − µ
∣∣∣∣∣ > ε
]≤ 2 exp
(− 2Mε2
(b − a)2
).
Note that Hoeffding inequality implies the limit
Pr
[∣∣∣∣∣ 1M ∑Mm=1 um − µ
∣∣∣∣∣ > ε
]→ 0 for M →∞, and hence it recovers
the law of large numbers.
We now use this inequality to obtain bounds on the empirical averageand generalization errors.
Osvaldo Simeone ML4Engineers 35 / 65
Hoeffding’s inequality
For i.i.d. rvs u1,u2, · · · , uM ∼ p(u) such that E [ui ] = µ andPr[a ≤ ui ≤ b] = 1 for some a ≤ b, we have the large deviationinequality
Pr
[∣∣∣∣∣ 1
M
M∑m=1
um − µ
∣∣∣∣∣ > ε
]≤ 2 exp
(− 2Mε2
(b − a)2
).
Note that Hoeffding inequality implies the limit
Pr
[∣∣∣∣∣ 1M ∑Mm=1 um − µ
∣∣∣∣∣ > ε
]→ 0 for M →∞, and hence it recovers
the law of large numbers.
We now use this inequality to obtain bounds on the empirical averageand generalization errors.
Osvaldo Simeone ML4Engineers 35 / 65
Empirical Average Error
Let us consider the empirical average error LD(t̂∗H)− Lp(t̂∗H).
The training loss can be written as 1M
∑Mm=1 um by setting un as
`(tn, t̂∗H(xn)) and M as N.
With this choice, we have µ = E[`(tn, t̂∗H(xn))] = Lp(t̂∗H).
By assumption, the loss function is bounded between 0 and 1, andhence we can set a = 0 and b = 1.
Applying the Hoeffding inequality, we have for any ε1 > 0
Pr[LD(t̂∗H)− Lp(t̂∗H) > ε1
]≤ 2 exp
(−2Nε21
).
Osvaldo Simeone ML4Engineers 36 / 65
Generalization Error
For the generalization error Lp(t̂ERMD )− LD(t̂ERMD ) of ERM, we wouldsimilarly like to bound the probability
Pr[Lp(t̂ERMD )− LD(t̂ERMD ) > ε2
].
This probability can be upper bounded by the probability that there isat least one predictor that satisfies the inequality, i.e.,
Pr[Lp(t̂ERMD )− LD(t̂ERMD ) > ε2
]≤Pr
[∃t̂ ∈ H : Lp(t̂)− LD(t̂) > ε2
]=Pr
⋃t̂∈H
{Lp(t̂)− LD(t̂) > ε2
} ,where the last equality follows from the interpretation of a union ofevents as logical OR.
Osvaldo Simeone ML4Engineers 37 / 65
Generalization Error
For the generalization error Lp(t̂ERMD )− LD(t̂ERMD ) of ERM, we wouldsimilarly like to bound the probability
Pr[Lp(t̂ERMD )− LD(t̂ERMD ) > ε2
].
This probability can be upper bounded by the probability that there isat least one predictor that satisfies the inequality, i.e.,
Pr[Lp(t̂ERMD )− LD(t̂ERMD ) > ε2
]≤Pr
[∃t̂ ∈ H : Lp(t̂)− LD(t̂) > ε2
]=Pr
⋃t̂∈H
{Lp(t̂)− LD(t̂) > ε2
} ,where the last equality follows from the interpretation of a union ofevents as logical OR.
Osvaldo Simeone ML4Engineers 37 / 65
Generalization Error
Now, we can use first the union bound and then again Hoeffding’sinequality to yield
Pr
⋃t̂∈H
{Lp(t̂)− LD(t̂) > ε2
}≤∑t̂∈H
Pr[Lp(t̂)− LD(t̂) > ε2
]≤2|H| exp
(−2Nε22
).
We note that, by this derivation, the boundPr[Lp(t̂D)− LD(t̂D) > ε2
]≤ 2|H| exp
(−2Nε22
)applies to any
learning algorithm and not just to ERM. We will use this fact later.
Osvaldo Simeone ML4Engineers 38 / 65
Generalization Error
Now, we can use first the union bound and then again Hoeffding’sinequality to yield
Pr
⋃t̂∈H
{Lp(t̂)− LD(t̂) > ε2
}≤∑t̂∈H
Pr[Lp(t̂)− LD(t̂) > ε2
]≤2|H| exp
(−2Nε22
).
We note that, by this derivation, the boundPr[Lp(t̂D)− LD(t̂D) > ε2
]≤ 2|H| exp
(−2Nε22
)applies to any
learning algorithm and not just to ERM. We will use this fact later.
Osvaldo Simeone ML4Engineers 38 / 65
ProofPutting it all together, we obtain the inequality
Pr
Lp(t̂ERMD )− Lp(t̂∗H)︸ ︷︷ ︸estimation error
> ε
≤Pr
(Lp(t̂ERMD )− LD(t̂ERMD ))︸ ︷︷ ︸generalization error
+ LD(t̂∗H)− Lp(t̂∗H)︸ ︷︷ ︸empirical average error
> ε
≤Pr
(Lp(t̂ERMD )− LD(t̂ERMD ))︸ ︷︷ ︸generalization error
>ε
2
+ Pr
LD(t̂∗H)− Lp(t̂∗H)︸ ︷︷ ︸”empirical average” error
>ε
2
≤2(|H|+ 1) exp
(−Nε2
2
).
The bound can be easily tightened to the value stated in the theoremby noting that the probability term for t̂∗H is counted twice. This isleft as an exercise.
Osvaldo Simeone ML4Engineers 39 / 65
Sample Complexity of ERM forContinuous Model Classes
Osvaldo Simeone ML4Engineers 40 / 65
Continuous Model Classes
What does the theorem proved above say about infinite models suchas linear classifiers or neural networks?
A simple approach would be to quantize hypothesis class H,obtaining a finite class Hb, by representing each element of the modelparameter vector θ with b bits.
Osvaldo Simeone ML4Engineers 41 / 65
Continuous Model Classes
Example: Consider a neural network with D weightsI the number of hypotheses in the model class Hb is |H| = (2b)D , and
the capacity of the hypothesis class is log2(|H|) = Db (bits) orlog(|H|) = bD log 2 (nats);
I by the theorem, the ERM sample complexity is
NERMH (ε, δ) =
⌈2bD log(2) + 2 log(2/δ)
ε2
⌉,
which scales proportionally to the number of parameters D and to thebit resolution b.
Osvaldo Simeone ML4Engineers 42 / 65
Continuous Model Classes
Example: Consider a neural network with D weightsI the number of hypotheses in the model class Hb is |H| = (2b)D , and
the capacity of the hypothesis class is log2(|H|) = Db (bits) orlog(|H|) = bD log 2 (nats);
I by the theorem, the ERM sample complexity is
NERMH (ε, δ) =
⌈2bD log(2) + 2 log(2/δ)
ε2
⌉,
which scales proportionally to the number of parameters D and to thebit resolution b.
Osvaldo Simeone ML4Engineers 42 / 65
Continuous Model Classes
The previous theorem seems to imply that, in order to learn acontinuous model class, an infinite number of samples is required.This is because a continuous model is only described exactly if we setb →∞.This conclusion is generally not true, but the theory needs to beextended to introduce a more refined notion of capacity of a modelclass: the Vapnik–Chervonenkis (VC) dimension.
We will see that, in a nutshell, the result proved above still holds forgeneral model classes by substituting log(|H|) with the VC dimension.
We focus on binary classification.
Osvaldo Simeone ML4Engineers 43 / 65
Continuous Model Classes
The previous theorem seems to imply that, in order to learn acontinuous model class, an infinite number of samples is required.This is because a continuous model is only described exactly if we setb →∞.This conclusion is generally not true, but the theory needs to beextended to introduce a more refined notion of capacity of a modelclass: the Vapnik–Chervonenkis (VC) dimension.
We will see that, in a nutshell, the result proved above still holds forgeneral model classes by substituting log(|H|) with the VC dimension.
We focus on binary classification.
Osvaldo Simeone ML4Engineers 43 / 65
Capacity of a Model Revisited
Consider again a finite class of models – what does it mean that itscapacity is log2(|H|) bits?
To start with some background, we say that an information sourcehas a capacity of b bits if it can produce any binary vector of b bits(possibly after remapping of the alphabet to bits).
Ex.: A source has capacity b = 2 bits if it can produce four messages,e.g., {00,01,10,11}, four hand gestures, four words, etc.
Consider a data set of N inputs X = (x1, ...., xN). A given modelt̂(·) ∈ H produces a set of N binary predictions (t̂(x1), ...., t̂(xN)).
We can now think of the model class as a source of capacity N bits ifI for some data set X = (x1, ...., xN), it can produce all possible 2N
binary vectors (t̂(x1), ...., t̂(xN)) of predictions by running over allmodels t̂(·) ∈ H.
Osvaldo Simeone ML4Engineers 44 / 65
Capacity of a Model Revisited
Consider again a finite class of models – what does it mean that itscapacity is log2(|H|) bits?
To start with some background, we say that an information sourcehas a capacity of b bits if it can produce any binary vector of b bits(possibly after remapping of the alphabet to bits).
Ex.: A source has capacity b = 2 bits if it can produce four messages,e.g., {00,01,10,11}, four hand gestures, four words, etc.
Consider a data set of N inputs X = (x1, ...., xN). A given modelt̂(·) ∈ H produces a set of N binary predictions (t̂(x1), ...., t̂(xN)).
We can now think of the model class as a source of capacity N bits ifI for some data set X = (x1, ...., xN), it can produce all possible 2N
binary vectors (t̂(x1), ...., t̂(xN)) of predictions by running over allmodels t̂(·) ∈ H.
Osvaldo Simeone ML4Engineers 44 / 65
Capacity of a Model Revisited
For a finite model class, there are only |H| models to choose from,and hence the maximum capacity is N = log2(|H|).
Example: Assume that we only have two models in the classH = {t̂1(·), t̂2(·)}:
I unless the two models are equivalent, for some input x , we can produceboth one-bit messages if t̂1(x) = 0 and t̂2(x) = 1 (or vice versa);
I but we can never produce all messages of two bits since we can onlychoose among two models.
But what about the case of a continuous model?
Osvaldo Simeone ML4Engineers 45 / 65
Example
Consider the set of all linear binary classifiers on the plane, withdecision region passing through the origin, i.e.,
H =
{t̂θ(x) =
{0, if θT x < 0
1, if θT x ≥ 0= 1(θT x ≥ 0) : θ ∈ R2
},
Clearly we can obtain all messages of size N = 1: we can label any,and hence also some, point x1 as either 0 or 1...
. 𝑥1𝜃
Ƹ𝑡 = 1
Ƹ𝑡 = 0
. 𝑥1
𝜃Ƹ𝑡 = 0
Ƹ𝑡 = 1
message: 1 message: 0
Osvaldo Simeone ML4Engineers 46 / 65
Example
... and also N = 2: i.e., we can label some pair of points x1 and x2with any pairs of binary labels...
. 𝑥1𝜃
Ƹ𝑡 = 1
Ƹ𝑡 = 0
. 𝑥1
𝜃Ƹ𝑡 = 0
Ƹ𝑡 = 1
. 𝑥1
𝜃
Ƹ𝑡 = 1Ƹ𝑡 = 0
. 𝑥2
message: 1,0 message: 0,1
. 𝑥2. 𝑥2
message: 0,0
. 𝑥1𝜃
Ƹ𝑡 = 1Ƹ𝑡 = 0
. 𝑥2
message: 1,1
Osvaldo Simeone ML4Engineers 47 / 65
Example
... but there is no data set of three data points for which we canobtain all eight messages of N = 3 bits.
Therefore, the capacity of this model is 2 bits.
Note that 2 is also the number of free parameters in this case.
We will now make this concept of capacity more formal through theintroduction of the VC dimension.
Osvaldo Simeone ML4Engineers 48 / 65
VC Dimension
A hypothesis class H is said to shatter a set of inputsX = (x1, ..., xN) if, no matter how the corresponding labels (t1, ..., tN)are selected, there exists a hypothesis t̂ ∈ H that ensures t̂(xn) = tnfor all n = 1, ...,N.
This is the same idea explained above: The set of inputsX = (x1, ..., xN) is shattered by H if the models in H can produce allpossible 2N messages of N bits when applied to X .
The VC dimension VCdim(H) (measured in bits) of the model H isthe size of the largest set X that can be shattered by H.
The VC dimension VCdim(H) is hence the capacity of the model asexplained above.
Osvaldo Simeone ML4Engineers 49 / 65
VC Dimension
A hypothesis class H is said to shatter a set of inputsX = (x1, ..., xN) if, no matter how the corresponding labels (t1, ..., tN)are selected, there exists a hypothesis t̂ ∈ H that ensures t̂(xn) = tnfor all n = 1, ...,N.
This is the same idea explained above: The set of inputsX = (x1, ..., xN) is shattered by H if the models in H can produce allpossible 2N messages of N bits when applied to X .
The VC dimension VCdim(H) (measured in bits) of the model H isthe size of the largest set X that can be shattered by H.
The VC dimension VCdim(H) is hence the capacity of the model asexplained above.
Osvaldo Simeone ML4Engineers 49 / 65
VC Dimension
A hypothesis class H is said to shatter a set of inputsX = (x1, ..., xN) if, no matter how the corresponding labels (t1, ..., tN)are selected, there exists a hypothesis t̂ ∈ H that ensures t̂(xn) = tnfor all n = 1, ...,N.
This is the same idea explained above: The set of inputsX = (x1, ..., xN) is shattered by H if the models in H can produce allpossible 2N messages of N bits when applied to X .
The VC dimension VCdim(H) (measured in bits) of the model H isthe size of the largest set X that can be shattered by H.
The VC dimension VCdim(H) is hence the capacity of the model asexplained above.
Osvaldo Simeone ML4Engineers 49 / 65
VC Dimension
Based on the definitions above, to prove that a model hasVCdim(H) = N, we need to carry out the following two steps:
I Step 1) Demonstrate the existence of a set X with |X | = N that isshattered by H; and
I Step 2) Prove that no set X of dimension N + 1 exists that isshattered by H.
For finite classes, we have the inequality VCdim(H) ≤ log2(|H|), since|H| hypotheses can create at most |H| different label configurations.
Osvaldo Simeone ML4Engineers 50 / 65
VC Dimension
Based on the definitions above, to prove that a model hasVCdim(H) = N, we need to carry out the following two steps:
I Step 1) Demonstrate the existence of a set X with |X | = N that isshattered by H; and
I Step 2) Prove that no set X of dimension N + 1 exists that isshattered by H.
For finite classes, we have the inequality VCdim(H) ≤ log2(|H|), since|H| hypotheses can create at most |H| different label configurations.
Osvaldo Simeone ML4Engineers 50 / 65
Examples
The threshold function model class
H ={t̂θ(x) = 1(x ≥ θ) : θ ∈ {θ1, ..., θ|H|}
},
with x ∈ R, has VCdim(H)= 1:I Step 1) any set X of one sample (N = 1) can be shattered – and
hence there is clearly at least one such point;
I Step 2) no sets of N = 2 points that can be shattered:F for any set X = (x1, x2) of two points with x1 ≤ x2, the label
assignment (t1, t2) = (1, 0) cannot be realized by any choice of thethreshold θ.
Osvaldo Simeone ML4Engineers 51 / 65
Examples
The threshold function model class
H ={t̂θ(x) = 1(x ≥ θ) : θ ∈ {θ1, ..., θ|H|}
},
with x ∈ R, has VCdim(H)= 1:I Step 1) any set X of one sample (N = 1) can be shattered – and
hence there is clearly at least one such point;
I Step 2) no sets of N = 2 points that can be shattered:F for any set X = (x1, x2) of two points with x1 ≤ x2, the label
assignment (t1, t2) = (1, 0) cannot be realized by any choice of thethreshold θ.
Osvaldo Simeone ML4Engineers 51 / 65
Examples
The model
H = {t̂a,b(x) = 1(a ≤ x ≤ b) : a ≤ b},
which assigns the label t = 1 within an interval [a, b] and the labelt = 0 outside it, has VCdim(H)= 2:
I Step 1) any set of N = 2 points can be shattered – and hence therealso exists one such set;
I Step 2) there are no sets X of N = 3 points that can be shattered:F for any set X = (x1, x2, x3) of three points with x1 ≤ x2 ≤ x3, the label
assignment (t1, t2, t3) = (1, 0, 1) cannot be realized.
I It can also be proved that the linear classifier in a dimension D hasVCdim(H)= D + 1.
Osvaldo Simeone ML4Engineers 52 / 65
Examples
The model
H = {t̂a,b(x) = 1(a ≤ x ≤ b) : a ≤ b},
which assigns the label t = 1 within an interval [a, b] and the labelt = 0 outside it, has VCdim(H)= 2:
I Step 1) any set of N = 2 points can be shattered – and hence therealso exists one such set;
I Step 2) there are no sets X of N = 3 points that can be shattered:F for any set X = (x1, x2, x3) of three points with x1 ≤ x2 ≤ x3, the label
assignment (t1, t2, t3) = (1, 0, 1) cannot be realized.
I It can also be proved that the linear classifier in a dimension D hasVCdim(H)= D + 1.
Osvaldo Simeone ML4Engineers 52 / 65
Examples
The model
H = {t̂a,b(x) = 1(a ≤ x ≤ b) : a ≤ b},
which assigns the label t = 1 within an interval [a, b] and the labelt = 0 outside it, has VCdim(H)= 2:
I Step 1) any set of N = 2 points can be shattered – and hence therealso exists one such set;
I Step 2) there are no sets X of N = 3 points that can be shattered:F for any set X = (x1, x2, x3) of three points with x1 ≤ x2 ≤ x3, the label
assignment (t1, t2, t3) = (1, 0, 1) cannot be realized.
I It can also be proved that the linear classifier in a dimension D hasVCdim(H)= D + 1.
Osvaldo Simeone ML4Engineers 52 / 65
VC DimensionTheorem: Under the detection-error loss, for a model class H withfinite VCdim(H)<∞, ERM is (N, ε, δ) PAC with accuracy
ε =
√C
2VCdim(H) + log(1/δ)
N
for some constant C > 0.
Therefore, ERM achieves the estimation error
Lp(t̂ERMD )− Lp(t̂∗H) ≤√
C2VCdim(H) + log(1/δ)
N
with probability no smaller than 1− δ.
Equivalently, the sample complexity of ERM is
NERMH (ε, δ) =
⌈C
2VCdim(H) + log(1/δ)
ε2
⌉.
This theorem generalizes the case of finite model classes bysubstituting VCdim(H) for log2(|H|).
Osvaldo Simeone ML4Engineers 53 / 65
VC DimensionTheorem: Under the detection-error loss, for a model class H withfinite VCdim(H)<∞, ERM is (N, ε, δ) PAC with accuracy
ε =
√C
2VCdim(H) + log(1/δ)
N
for some constant C > 0.
Therefore, ERM achieves the estimation error
Lp(t̂ERMD )− Lp(t̂∗H) ≤√
C2VCdim(H) + log(1/δ)
N
with probability no smaller than 1− δ.
Equivalently, the sample complexity of ERM is
NERMH (ε, δ) =
⌈C
2VCdim(H) + log(1/δ)
ε2
⌉.
This theorem generalizes the case of finite model classes bysubstituting VCdim(H) for log2(|H|).
Osvaldo Simeone ML4Engineers 53 / 65
Fundamental Theorem of Learning
So, we have seen that ERM achieves a sample complexityproportional to the ratio VCdim(H)/N. Is there any trainingalgorithm that is more efficient?
Theorem: For a model H with finite VCdim(H)<∞, the samplecomplexity of any training algorithm A is lower bounded as
NAH(ε, δ) ≥ D
VCdim(H) + log(1/δ)
ε2
for some constant D > 0.
The theorem demonstrates that, if learning is possible for a givenmodel H, then ERM allows us to learn with close-to-optimal samplecomplexity.
Osvaldo Simeone ML4Engineers 54 / 65
Fundamental Theorem of Learning
So, we have seen that ERM achieves a sample complexityproportional to the ratio VCdim(H)/N. Is there any trainingalgorithm that is more efficient?
Theorem: For a model H with finite VCdim(H)<∞, the samplecomplexity of any training algorithm A is lower bounded as
NAH(ε, δ) ≥ D
VCdim(H) + log(1/δ)
ε2
for some constant D > 0.
The theorem demonstrates that, if learning is possible for a givenmodel H, then ERM allows us to learn with close-to-optimal samplecomplexity.
Osvaldo Simeone ML4Engineers 54 / 65
Structural Risk Minimization
Osvaldo Simeone ML4Engineers 55 / 65
Structural Risk Minimization
Assume that we need to select an inductive bias by choosing one of anested set of hypothesis classes H1 ⊆ H2 ⊆ ... ⊆ HMmax .
An example is the problem studied in Chapter 4 of selecting themodel order M ∈ {1, 2, ...,Mmax} for linear regression.
How do we choose M? We have seen that the standard approach isvalidation. SRM is an alternative approach that does not require toset aside validation data.
Osvaldo Simeone ML4Engineers 56 / 65
Structural Risk Minimization
Consider a training algorithm AM producing a predictor t̂AMD ∈ HM in
class HM . Note that the algorithm, and its output, depend on themodel order M.
From the bound on the generalization error derived above, we havethe inequality
Lp(t̂AMD ) ≤ LD(t̂AM
D ) +
√log(|HM |) + log(2/δ)
2N
with probability at least 1− δ.
SRM minimizes this upper bound, which is a pessimistic estimate ofthe generalization loss, over the choice of the model M. Note thatthe training loss LD(t̂AM
D ) generally decreases with M, while themodel capacity log(|HM |) increases with it.
Assuming that the upper bound is reasonably tight, this allows one toavoid validation.
Osvaldo Simeone ML4Engineers 57 / 65
Structural Risk Minimization
Consider a training algorithm AM producing a predictor t̂AMD ∈ HM in
class HM . Note that the algorithm, and its output, depend on themodel order M.
From the bound on the generalization error derived above, we havethe inequality
Lp(t̂AMD ) ≤ LD(t̂AM
D ) +
√log(|HM |) + log(2/δ)
2N
with probability at least 1− δ.
SRM minimizes this upper bound, which is a pessimistic estimate ofthe generalization loss, over the choice of the model M. Note thatthe training loss LD(t̂AM
D ) generally decreases with M, while themodel capacity log(|HM |) increases with it.
Assuming that the upper bound is reasonably tight, this allows one toavoid validation.
Osvaldo Simeone ML4Engineers 57 / 65
Summary
Osvaldo Simeone ML4Engineers 58 / 65
Summary
In this chapter, we have discussed basic elements of statistical learningtheory, with the main goal of addressing the following questions:
I Given an inductive bias, how many training examples (N) are neededto learn to a given level of generalization accuracy?
F Answer: The sample complexity of ERM depends on the capacity (orVC dimension) of the selected model class, as well as on accuracy andconfidence parameters.
I Conversely, how should we choose the inductive bias in order to ensuregeneralization for a given data set size N?
F Answer: Structural risk minimization suggests choosing the model class
H that minimizes LD(t̂AH) +√
log(|H|)+log(2/δ)2N
, where the second term
is a bound on the generalization error.
Osvaldo Simeone ML4Engineers 59 / 65
Summary
In this chapter, we have discussed basic elements of statistical learningtheory, with the main goal of addressing the following questions:
I Given an inductive bias, how many training examples (N) are neededto learn to a given level of generalization accuracy?
F Answer: The sample complexity of ERM depends on the capacity (orVC dimension) of the selected model class, as well as on accuracy andconfidence parameters.
I Conversely, how should we choose the inductive bias in order to ensuregeneralization for a given data set size N?
F Answer: Structural risk minimization suggests choosing the model class
H that minimizes LD(t̂AH) +√
log(|H|)+log(2/δ)2N
, where the second term
is a bound on the generalization error.
Osvaldo Simeone ML4Engineers 59 / 65
Summary
The theory we have developed in this chapter bounds theapproximation error by relating the generalization error to thecapacity of the model class.
A different class of bounds can be obtained by studying the“sensitivity” of a training procedure to the training data set:
I a more sensitive a training algorithm is expected to overfit more easilyand hence to generalize less effectively.
PAC Bayes analysis follows this approach and provides generallycurrent state-of-the-art generalization error bounds (see Appendix).
Osvaldo Simeone ML4Engineers 60 / 65
Summary
Furthermore, in this chapter, we have considered a worst-caseformulation of the learning objective in which, for the given fixed N,we are interested in the performance under the worst, i.e.,population-loss minimizing, distribution p(x , t).
Other formulations apply to fixed population distributions p(x , t) andoffer bounds that depend explicitly on it (e.g., information-theoreticbounds) or are apply uniformly across all possible distributions p(x , t).
Osvaldo Simeone ML4Engineers 61 / 65
Appendix
Osvaldo Simeone ML4Engineers 62 / 65
PAC Bayes
The PAC Bayes formulation assumes a probabilistic choice t̂ ∼ q(t̂|D)for the predictor given the training set D (e.g., using Bayes rule).
The distribution q(t̂|D) is typically referred to as posterior, althoughit may not correspond to a real posterior distribution.
Under this distribution, we define:I the average population loss
Lp(q(t̂|D)) = Et̂∼q(t̂|D)E(x,t)∼p(x,t)[`(t, t̂(x))];
I and the average training loss
LD(q(t̂|D)) = Et̂∼q(t̂|D)
[1
N
N∑n=1
`(tn, t̂(xn))
].
Osvaldo Simeone ML4Engineers 63 / 65
PAC Bayes
Theorem: Let p(t̂) be a prior distribution on the predictors in thehypothesis class H and q(t̂|D) a posterior in the same space. Theprior distribution must be selected before observing the data D, whilethe posterior can depend on it. Then, with probability no smaller than1− δ, for any δ ∈ (0, 1), the generalization error satisfies theinequality
Lp(q(t̂|D)) ≤ LD(q(t̂|D)) +
√KL(q(t̂|D)||p(t̂)) + log(Nδ )
2(N − 1)
for all population distributions p(x , t).
The term KL(q(t̂|D)||p(t̂)) measures the sensitivity of the trainedmodel to the training set D.
The right-hand side of the inequality above can be used to define anoptimization criterion for the posterior q(t̂|D) – this is known asinformation risk minimization.
Osvaldo Simeone ML4Engineers 64 / 65
PAC Bayes
If the prior is uniform and the hypothesis class is finite, i.e., ifp(t̂) = 1/|H|, we obtain
KL(q(t̂|D)||p(t̂)) = Et̂∼q(t̂|D)[log q(t̂|D)] + log(|H|)
= −H(q(t̂|D)) + log(|H|) ≤ log(|H|),
and hence we have the inequality
Lp(q(t̂|D)) ≤ LD(q(t̂|D)) +
√log(|H|) + log(Nδ )
2(N − 1).
This tends to the PAC bound derived above as N increases.
Osvaldo Simeone ML4Engineers 65 / 65