lecture 10: pac learning - bguinabd171/wiki.files/lecture10_handouts.pdf · kontorovich and sabato...

24
Lecture 10: PAC learning Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 10 1 / 24

Upload: others

Post on 14-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Lecture 10: PAC learning

Introduction to Learningand Analysis of Big Data

Kontorovich and Sabato (BGU) Lecture 10 1 / 24

Page 2: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Sample complexity analysis

We mentioned some relationships between sample complexity andlearning:

I Larger sample size =⇒ smaller overfittingI A richer class H =⇒ larger sample complexity.I ERM on linear predictors: sample complexity is O(d).I Hard-SVM: sample complexity is O(min(d , 1

γ2∗

)).

Why does this relationship hold?

What properties of H make it “richer” or “simpler” to learn?

What is the actual statistical complexity of learning with H?

We want answers that hold for all distributions D over X × Y.

The question

Given a hypothesis class H, how many examples are needed to guaranteethat an ERM algorithm over H will output a low-error predictor?

•Kontorovich and Sabato (BGU) Lecture 10 2 / 24

Page 3: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Simplifying assumptions

Assume that D is realizable by H.

Definition

D is realizable by H if there exists some h∗ ∈ H such that err(h∗,D) = 0.

For any x with non-zero probability in D, its label must be h∗(x).

So, for any training sample S ∼ Dm, err(h∗, S) = 0.

Suppose we run an ERM with H on S . Then

hS ∈ argminh∈H

err(h, S).

For any S ∼ Dm, err(hS , S) = 0.

But hS could be different from h∗.I E.g. thresholds: the exact threshold is not known from S .

Can we guarantee that err(hS ,D) = 0?

•Kontorovich and Sabato (BGU) Lecture 10 3 / 24

Page 4: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

What can we guarantee?We cannot guarantee that for all training samples, err(hS ,D) = 0:

I There is a chance that S turns out very different from D.

I Even if S is quite good, we cannot always find h∗ exactly.

So, we will only require that for almost all samples, err(hS ,D) is low.I Set a confidence parameter δ ∈ (0, 1). Allow a δ-fraction of the

samples to cause the algorithm to choose a very bad hS .

I Set an error parameter ε ∈ (0, 1). Require the rest of the non-badsamples to have err(hS ,D) ≤ ε.

I We want a size of the training sample m that will guarantee:

PS∼Dm [err(hS ,D) ≤ ε] ≥ 1− δ

(ε, δ)-sample complexity: The sample size m that is needed to geterror ε with probability 1− δ for any distribution.

•Kontorovich and Sabato (BGU) Lecture 10 4 / 24

Page 5: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

What can we guarantee?

Assume that D is realizable by H, err(h∗,D) = 0.

Run an ERM algorithm. Get hS such that err(hS ,S) = 0.

The algorithm may select any h ∈ H with err(h,S) = 0.

We need to guarantee that for any h ∈ H that the algorithm might select,err(h,D) ≤ ε.

S is a good sample if:

All h ∈ H with err(h,D) > ε have err(h,S) > 0.

We need to find a sample size m such that:

For any distribution D which is realizable by H,

PS∼Dm [S is a good sample] ≥ 1− δ.

•Kontorovich and Sabato (BGU) Lecture 10 5 / 24

Page 6: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

What can we guarantee?

Fix some “bad” hbad ∈ H: with err(hbad,D) > ε.

What is the probability that err(hbad,S) = 0 (i.e., hbad “looks good”)?

err(hbad,S) = 0 iff for all (x , y) ∈ S , hbad(x) = y .

We have:P(X ,Y )∼D[hbad(X ) 6= Y ] = err(hbad,D) > ε.

S ∼ Dm consists of independent random pairs from D, so

P[err(hbad,S) = 0] = PS∼Dm [∀(x , y) ∈ S , hbad(x) = y ]

= (P(X ,Y )∼D[hbad(x) = y ])m

≤ (1− ε)m

≤ e−εm.

But we need no bad h to look good on S .

•Kontorovich and Sabato (BGU) Lecture 10 6 / 24

Page 7: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Guarantee for a finite hypothesis class

Assume a finite H = {h1, h2, . . . , hk}.I If not finite (e.g. thresholds), can usually discretize.

All samples S ∼ Dm

err(h1, S) = 0

err(h7, S) = 0

err(h18, S) = 0

Suppose h1, h7, h18 have err(h,D) > ε.

For samples outside the small circles, ERMcannot select a bad h.

Probability mass of circle for hi :

pi := PS∼Dm [err(hi ,S) = 0] ≤ e−εm.

Size outside small circles: at least

1−∑

i :err(hi ,D)>ε

pi .

This is an application of the union bound:P[A or B] ≤ P[A] + P[B].

•Kontorovich and Sabato (BGU) Lecture 10 7 / 24

Page 8: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Guarantee for a finite hypothesis classProbability that ERM selects a bad h:

PS∼Dm [err(hS ,D) > ε]

≤ P[∃h ∈ H s.t. err(h,D) > ε and err(h,S) = 0]

≤∑

h:err(h,D)>ε

P[err(h,S) = 0]

≤∑

h:err(h,D)>ε

e−εm

≤ |H|e−εm.

Our confidence parameter is δ, so we want

PS∼Dm [err(hS ,D) > ε] ≤ δ.

If m ≥ log(|H|)+log(1/δ)ε , then

PS∼Dm [err(hS ,D) > ε] ≤ |H|e−εm ≤ δ.

•Kontorovich and Sabato (BGU) Lecture 10 8 / 24

Page 9: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Probably Approximately Correct learning

Theorem

Let ε, δ ∈ (0, 1). For any finite hypothesis class H, and any distribution Dover X × Y which is realizable by H, if the training sample size m has

m ≥ log(|H|) + log(1/δ)

ε

then any ERM algorithm with training sample size m gets an error of atmost ε, with a probability of at least 1− δ over the random trainingsamples.

The ERM alg. Probably finds an Approximately Correct hypothesis.

This is called PAC-learning.

“With high probability” (w.h.p.) ≡ with probability at least 1− δ.

Kontorovich and Sabato (BGU) Lecture 10 9 / 24

Page 10: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Probably Approximately Correct learning

The (ε, δ)-sample complexity for learning H in the realizable setting is atmost

log(|H|) + log(1/δ)

ε.

For a better accuracy (= lower ε), need linearly more samples.

For higher confidence (= lower δ), need logarithmically more samples.

If H is larger, need more examples for same confidence and accuracy!

What happens if H includes all possible functions?

Overfitting: err(hS ,S)� err(hS ,D).

With probability 1− δ,

err(hS ,D)− err(hS ,S) ≤ log(|H|) + log(1/δ)

m.

Larger sample size, smaller H =⇒ less overfitting.

•Kontorovich and Sabato (BGU) Lecture 10 10 / 24

Page 11: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Example: Which diet allows living beyond 90?

Given a person’s diet, predict whether they will live beyond 90.

Suppose we consider d possible foods.

A person’s diet is encoded as a binary vector describing which foods theyeat: X = {0, 1}d .

Require 95% probability of getting prediction error less than 10%.

I δ = 0.05, ε = 0.1.

Sufficient training sample size: m ≥ log(|H|)+log(1/δ)ε .

Set H = Boolean conjunctions of some features (foods) or their negation.

I E.g. h(x) = ¬x(2) ∧ x(14) ∧ x(17) ∧ ¬x(32)

Then |H| = 3d

Sufficient sample size: m ≥ d log(3)+log(1/0.05)0.1 ≈ 11d + 30.

Smaller d means m can be smaller, but analysis holds only if D remainsrealizable by H.

•Kontorovich and Sabato (BGU) Lecture 10 11 / 24

Page 12: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

PAC learning in the agnostic setting

The agnostic setting

Make no assumptions on D. Given ε, δ ∈ (0, 1), require that withprobability at least 1− δ over S ∼ Dm,

err(hS ,D) ≤ infh∈H

err(h,D) + ε.

Try to get close to the best rule in H.

If D happens to be realizable by H, the agnostic requirement is thesame as the requirement in the realizable setting.

What sample size do we need in the agnostic setting with ERM?

Kontorovich and Sabato (BGU) Lecture 10 12 / 24

Page 13: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Sample size for the agnostic settingSuppose we run ERM on S ∼ Dm and get hS .

In the agnostic setting, err(hS , S) might be non-zero.

Can we guarantee that err(hS ,S) is close to err(hS ,D)?

Fix some h ∈ H. We will bound

|err(h, S)− err(h,D)| =

∣∣∣∣∣ 1

m

m∑i=1

I[h(xi ) 6= yi ]− P(X ,Y )∼D[h(X ) 6= Y ]

∣∣∣∣∣ .Define Zi = I[h(xi ) 6= yi ].

Z1, . . . ,Zm are statistically independent.

∀i ≤ m,P[Zi = 1] = err(h,D).

Hoeffding’s inequality

Let Z1, . . . ,Zm be independent random variables over {0, 1},where for all i ≤ m, P[Zi = 1] = p. Then

P

[∣∣∣∣∣ 1

m

m∑i=1

Zi − p

∣∣∣∣∣ ≥ ε]≤ 2 exp(−2ε2m).

Conclusion: for any fixed h ∈ H, P[|err(h,S)− err(h,D)| ≥ ε] ≤ 2 exp(−2ε2m).

•Kontorovich and Sabato (BGU) Lecture 10 13 / 24

Page 14: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Sample size for the agnostic setting

We showed that for any ε ∈ (0, 1), and any h ∈ H,

P[|err(h,S)− err(h,D)| ≥ ε] ≤ 2 exp(−2ε2m).

For the ERM algorithm, hS ∈ argminh∈H err(h,S).

S is a good sample if for all h ∈ H, |err(h, S)− err(h,D)| ≤ ε/2.

Let h∗ ∈ argminh∈H err(h,D). For a good sample S :

err(hS ,D) ≤ err(hS , S) + ε/2 ≤ err(h∗,S) + ε/2 ≤ err(h∗,D) + ε.

What is the probability that a sample S ∼ Dm is not good?

P[∃h ∈ H, |err(h,S)− err(h,D)| ≥ ε/2] ≤ |H| · 2 exp(−ε2m/2).

Set

m ≥ 2 log(|H|) + 2 log(2/δ)

ε2.

ThenPS∼Dm [S is a good sample] ≥ 1− δ.

•Kontorovich and Sabato (BGU) Lecture 10 14 / 24

Page 15: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Agnostic PAC learning guarantees

Theorem

Let ε, δ ∈ (0, 1). For any finite hypothesis class H, and any distribution Dover X × Y, if the training sample size m has

m ≥ 2 log(|H|) + 2 log(2/δ)

ε2

then any ERM algorithm with training sample size m gets an error of atmost ε, with a probability of at least 1− δ over the random trainingsamples.

Compare to sample size for ERM in the realizable case:

m ≥ log(|H|) + log(1/δ)

ε

Main difference: In agnostic setting, dependence on ε is stronger.

•Kontorovich and Sabato (BGU) Lecture 10 15 / 24

Page 16: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

The Bias-Complexity tradeoffLet H ⊆ H′, both finite.

Approximation error: errapp(H,D) := infh∈H err(h,D).

For all D, errapp(H,D) ≥ errapp(H′,D)

Estimation error: Let hS ,H be the output of ERM for H on S .

errest(S ,H,D) := err(hS ,H,D)− infh∈H

err(h,D).

With a probability 1− δ,

errest(S ,H,D) ≤√

2 log(|H|) + 2 log(2/δ)

m.

A bound on overfitting: With a probability 1− δ,

|err(hS ,H,S)− err(hS ,H,D)| ≤√

log(|H|) + log(2/δ)

2m.

Bounds for H are smaller than bounds for H′.Trade-off: approximation error vs. estimation error/overfitting.

•Kontorovich and Sabato (BGU) Lecture 10 16 / 24

Page 17: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Computational complexity of ERM

ERM with a hypothesis class HGiven a training sample S ∼ Dm, output hS such that

hS ∈ argminh∈H

err(h,S).

We showed a bound on the statistical complexity of ERM in the realizableand agnostic cases.

What about the computational complexity?

Naive algorithm (finite H): calculate err(h,S) for all h ∈ H, choose smallest.

If H is infinite, discretize it, or try all possible labelings.

But even a finite H might be very large:I H = Boolean conjunctions.

I Sample size is O(d)

I But |H| = 3d , naive algorithm is O(3d)

•Kontorovich and Sabato (BGU) Lecture 10 17 / 24

Page 18: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

The Computational complexity of ERMThe true computational complexity of ERM depends on H.

Sometimes can be much better than enumerating H.

Example for realizable setting: H = Boolean conjunctions over d features.

ERM algorithm for Boolean conjunctions (realizable setting)

input A training sample S ,output A function hS : X → Y.1: Xpos ← {x | (x , 1) ∈ S}.2: Start with conjunction of all literals (h always returns 0)3: for x ∈ Xpos do4: for i = 1 to d do5: if x(i) is positive then6: Remove negation of feature i from conjunction.7: else8: Remove feature i from conjunction.9: end if

10: end for11: end for12: Return final conjunction

•Kontorovich and Sabato (BGU) Lecture 10 18 / 24

Page 19: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Computational complexity of ERM

This ERM algorithm for Boolean conjunctions is linear in d .I For the agnostic (non-realizable) setting: NP-hard.

Some hypothesis classes don’t have an efficient ERM algorithm,even in the realizable setting.

H =3-DNF: All disjunctions of 3 Boolean conjunctions:

h(x) := A1(x) ∨ A2(X ) ∨ A3(x), Ai (x) are Boolean conjunctions.

I |H| ≤ 33d . Sufficient sample size: log(|H|)+log(1/δ)ε ≤ 3d log(3)+log(1/δ)

ε .

I But no ERM algorithm polynomial in n, unless RP = NP.

For 3-DNF, there is a trick.I There is a class H′ which contains the class 3-DNF, and has an

efficient ERM algorithm

I H′ is richer than 3-DNF: higher sample complexity.

I Tradeoff between statistical complexity and computational complexity!

•Kontorovich and Sabato (BGU) Lecture 10 19 / 24

Page 20: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Computational complexity in agnostic setting

Many hypothesis classes have an efficient algorithm in the realizablesetting, but not in the agnostic setting.

I Recall linear predictors.

A possible solution:I Try to find an hS ∈ H with a low err(hS ,S).

I If m ≥ log(|H|)+log(2/δ)2ε2 , then with probability 1− δ,

∀h ∈ H, |err(h,S)− err(h,D)| ≤ ε.

I Use any heuristic to find a hS .

I Get the guarantee err(hS ,D) ≤ err(hS ,S) + ε.

I No guarantee on distance from minh∈H err(h,D).

I Soft-SVM is based on the same idea (but with an infinite class).

Kontorovich and Sabato (BGU) Lecture 10 20 / 24

Page 21: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

A Heuristic learning algorithm for Boolean conjunction

Boolean conjunction: ERM is NP-hard in the agnostic case.

A greedy heuristic:I Start with a function h that is always trueI In each iteration t,

F Add to h the literal that would decrease err(h,S) the most.F Stop when no literal decreases the error anymore.

I Return the last h.

No guarantee that err(hS , S) is close to minh∈H err(h,S).

No guarantee that err(hS , S) is low.

But if m ≥ log(|H|)+log(2/δ)2ε2

, then with high probability,

err(hS ,D) ≤ err(hS ,S) + ε.

•Kontorovich and Sabato (BGU) Lecture 10 21 / 24

Page 22: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Infinite hypothesis classes

For a finite H:

Realizable setting: err(hS ,D) ≤ ε with probability 1− δ if

m ≥ log(|H|) + log(1/δ)

ε.

Agnostic setting: err(hS ,D) ≤ infh∈H err(h,D) + ε with probability1− δ if

m ≥ 2 log(|H|) + 2 log(2/δ)

ε2.

Required sample size depends on log(|H|).

What if H is infinite?

Kontorovich and Sabato (BGU) Lecture 10 22 / 24

Page 23: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

Infinite hypothesis classes

General H, could be infinite.

Need a property of H that measures its sample complexity.

VC(H): the VC-dimension of H.

I VC(H) is the size of largest set of examples that can be labeled in allpossible label combinations using hypotheses from H.

I VC(H) measures how much “variation” exists in the functions in H.I For linear predictors: VC(H) = d .

Sample complexity bounds for infinite H use VC(H) instead of log(|H|).

Think of 2VC(H) as the “effective size” of H.

2VC(H) ≤ |H| for all H.

There are classes with an infinite VC-dimension:

I Such classes are not learnable for a general distribution D.I If a class is not learnable, there is no sample size that would guarantee ε, δ

PAC-learning for all distributions D.I All finite classes are learnableI some infinite classes are learnableI If X is infinite, the class of all functions H = YX is not learnable.

•Kontorovich and Sabato (BGU) Lecture 10 23 / 24

Page 24: Lecture 10: PAC learning - BGUinabd171/wiki.files/lecture10_handouts.pdf · Kontorovich and Sabato (BGU) Lecture 10 17 / 24 The Computational complexity of ERM The true computational

PAC learning: Summary

PAC-learning addresses distribution-free learning with a givenhypothesis class.

PAC analysis provides bounds on sample complexity and onoverfitting.

The sample complexity of ERM is near-optimal amongdistribution-free algorithms.

But for many problems there is no efficient ERM algorithm.

Approaches to get efficient algorithms:I Use a heuristic to find a hypothesis with a low error on the sampleI Change the hypothesis class to one that can be done efficientlyI Change the goal: try to minimize a different loss that matches the task.

Kontorovich and Sabato (BGU) Lecture 10 24 / 24