machine learning 10-601wcohen/10-601/pac-learning.pdf · machine learning 10-601 tom m. mitchell...
TRANSCRIPT
![Page 1: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/1.jpg)
Machine Learning 10-601
Tom M. Mitchell Machine Learning Department
Carnegie Mellon University
Today:
• Computational Learning Theory
• Probably Approximately Correct (PAC) learning theorem
• Vapnik-Chervonenkis (VC) dimension
Recommended reading: • Mitchell: Ch. 7
Slides largely pilfered from:
![Page 2: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/2.jpg)
Computational Learning Theory
• What general laws constrain learning? What can we derive by analysis?
* See annual Conference on Computational Learning Theory
![Page 3: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/3.jpg)
Computational Learning Theory
• What general laws constrain learning? What can we derive by analysis?
• An successful analysis for computation: the time complexity of a program. – Predict running time as a function of input size
(actually – we upper bound it) – Can we predict error rate as a function of sample
size? (or upper bound it?) sample complexity – What sort of assumptions do we need to make to do
this?
* See annual Conference on Computational Learning Theory
![Page 4: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/4.jpg)
Sample Complexity
How many training examples suffice to learn target concept? (eg, a linear classifier, or a decision tree)? What exactly do we mean by this? Usually: • There is a space of possible hypotheses H (e.g., binary decision trees) • In each learning task, there is a target concept c (usually c is in H). • There is some defined protocol where the learner gets information.
![Page 5: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/5.jpg)
Sample Complexity
How many training examples suffice to learn target concept?
1. If learner proposes instances as queries to teacher? - learner proposes x, teacher provides c(x)
2. If teacher (who knows c) proposes training examples? - teacher proposes sequence {<x1, c(x1)>, … <xn, c(xn)>
3. If some random process (e.g., nature) proposes instances, and teacher labels them? - instances x drawn according to P(X), teacher provides label c(x)
![Page 6: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/6.jpg)
A simple setting to analyze
Problem setting: • Set of instances • Set of hypotheses • Set of possible target functions • Set of m training instances drawn from teacher provides noise-free label
Learner outputs any consistent Question: how big should m be (as a function of H) for the learner to succeed? But first - how do we define“success”?
Find c?
Find an h that approximates c?
![Page 7: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/7.jpg)
The true error of h is the probability that it will misclassify an example drawn at random from
Usually we can’t compute this but we might be able to prove things about it
![Page 8: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/8.jpg)
A simple setting to analyze
Problem setting: • Set of instances • Set of hypotheses • Set of possible target functions • Set of m training instances drawn from teacher provides noise-free label
Learner outputs any consistent Question: how big should m be (as a function of H) to guarantee that
errortrue(h) < ε ?
What if we happen to get a really bad set of examples?
![Page 9: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/9.jpg)
A simple setting to analyze
Problem setting: • Set of instances • Set of hypotheses • Set of possible target functions • Sequence of m training instances drawn from teacher provides noise-free label
Learner outputs any consistent Question: how big should m be to ensure that
ProbD[ errortrue(h) < ε ] > 1- δ
This criterion is called “probably approximately correct” (pac learnability)
![Page 10: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/10.jpg)
Valiant’s proof
![Page 11: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/11.jpg)
Valiant’s proof
• Assume H is finite, and let h* be learner’s output • Fact: for ε in [0,1], (1-ε) ≤ e-ε
• Consider a “bad” h, where errortrue(h) > ε • Prob[ h consistent with x1 < (1-ε) ≤ e-ε ] • Prob[ h consistent with all of x1 …, xm < (1-ε)m ≤ e-εm ] • Prob[ h* is “bad”] = Prob[ errortrue(h*) > ε ] < |H| e-εm
(union bound)
• How big does m have to be?
![Page 12: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/12.jpg)
Valiant’s proof
• Assume H is finite, and let h* be learner’s output • How big does m have to be? Let’s solve: • Prob[ errortrue(h*) > ε ] < |H| e-εm < δ
€
H e−εm < δ
e−εm <δH
−εm < ln δH
εm > ln Hδ
m >1εln Hδ
=1εlnH + ln 1
δ
%
& '
(
) *
![Page 13: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/13.jpg)
Pac-learnability result
Problem setting: • Set of instances • Set of hypotheses • Set of possible target functions • Set of m training instances drawn from P(X), with noise-
free labels If the learner outputs any h from H consistent with the data and then ProbD[ errortrue(h) < ε ] > 1- δ
€
m >1εlnH + ln 1
δ
$
% &
'
( ) logarithmic in |H|
linear in 1/ ε
![Page 14: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/14.jpg)
Stopped Tues 10/6
![Page 15: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/15.jpg)
E.g., X=< X1, X2, ... Xn >
Each h ∈ H constrains each Xi to be 1, 0, or “don’t care”
In other words, each h is a rule such as:
If X2=0 and X5=1
Then Y=1, else Y=0
Q:How do we find a consistent hypothesis? A: Conjoin all constraints that are true off all positive examples
Notation: VSH,D = { h | h consistent with D}
![Page 16: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/16.jpg)
![Page 17: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/17.jpg)
Pac-learnability result
Problem setting: • Let X = {0,1}n, H=C={conjunctions of features} • This is pac-learnable in polynomial time, with a sample
complexity of
![Page 18: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/18.jpg)
Comments on pac-learnability
• This is a worst case theory – We assumed the training and test distribution were the same – We didn’t assume anything about P(X) at all, expect that it
doesn’t change from train to test
• The bounds are not very tight empirically – you usually don’t need as many examples as the bounds
suggest
• We did assume clean training data – this can be relaxed
![Page 19: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/19.jpg)
Another pac-learnability result
Problem setting: • Let X = {0,1}n, H=C=k-CNF = {conjunctions of disjunctions
of up to k features}
– Example:
• This is pac-learnable in polynomial time, with a sample complexity of
• Proof: it’s equivalent to learning in a new space X’ with nk features, each corresponding to a length-k disjunction
(x1∨¬x7 )∧(x3∨ x5∨¬x6 )∧(¬x1∨¬x2 )<= k
m ≥1εnk ln3+ ln 1
δ
"
#$
%
&'
![Page 20: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/20.jpg)
Pac-learnability result
Problem setting: • Let X = {0,1}n, H=C=k-term DNF = {up to k disjunctions of conjunctions of any # of features}
– Example:
• The sample complexity is polynomial, but it’s NP-hard to find a consistent hypothesis!
• Proof of sample complexity: for any h, the negation of h is in our last H. (Conjunctions of length-k disjunctions). So H is not larger than k-CNF.
(x1∧ x3∧ x5∧¬x7 )∨(¬x2 ∧ x3∧ x5∧¬x6 )
In fact - k-term-DNF is a subset of k-CNF…
![Page 21: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/21.jpg)
Pac-learnability result
Problem setting: • Let X = {0,1}n, H=k-CNF, C=k-term-DNF
• This is pac-learnable in polynomial time, with a sample complexity of
• I.e. k-term NDF is pac-learnable by k-CNF
m ≥1εnk ln3+ ln 1
δ
"
#$
%
&'
An increase in sample complexity might be worth suffering if it decreases the time complexity
![Page 22: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/22.jpg)
Pac-learnability result
Problem setting: • Let X = {0,1}n, H = {boolean functions that can be written
down in a subset of Python} – Eg: (x[1] and not x[3]) or (x[7]==x[9]) – Hn = { h in H | h can be written in n bits or less}
• Learning algorithm: return the simplest consistent hypothesis h. – I.e.: if h in Hn, then for all j<n, no h’ in Hj is consistent
• This is pac-learnable (not in polytime), with a sample complexity of
m ≥1εn ln2+ ln 1
δ
"
#$
%
&'
![Page 23: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/23.jpg)
Pac-learnability result (Occam’s razor) Problem setting: • Let X = {0,1}n, H=C=F = {boolean functions that can be written
down in a subset of Python} – Fn = { f in F | f can be written in n bits or less}
• Learning algorithm: return the simplest consistent hypothesis h. • This is pac-learnable (not in polytime), with a sample complexity
of
m ≥1εn ln2+ ln 1
δ
"
#$
%
&'
H1 H2 H3 H4
![Page 24: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/24.jpg)
VC DIMENSION AND INFINITE HYPOTHESIS SPACES
![Page 25: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/25.jpg)
Pac-learnability result: what if H is not finite?
Problem setting: • Set of instances • Set of hypotheses • Set of possible target functions • Set of m training instances drawn from P(X), with noise-
free labels If the learner outputs any h from H consistent with the data and then ProbD[ errortrue(h) < ε ] > 1- δ
€
m >1εlnH + ln 1
δ
$
% &
'
( ) Can I find some property like ln|H|
for perceptrons, neural nets, … ?
![Page 26: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/26.jpg)
a labeling of each member of S as positive or negative
![Page 27: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/27.jpg)
VC(H)>=3
![Page 28: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/28.jpg)
Pac-learnability result: what if H is not finite?
Problem setting: • Set of instances • Set of hypotheses/target functions H=C where VC(H)=d • Set of m training instances drawn from P(X), with noise-
free labels If the learner outputs any h from H consistent with the data and then
ProbD[ errortrue(h) < ε ] > 1- δ
Compare:
![Page 29: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/29.jpg)
VC dimension: An example
Consider H = { rectangles in 2-D space } and X = {x1,x2,x3} below. Is this set shattered?
x1
x2
x3
![Page 30: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/30.jpg)
VC dimension: An example
Consider H = { rectangles in 2-D space } Is this set shattered?
x1
x2
x3
![Page 31: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/31.jpg)
VC dimension: An example
Consider H = { rectangles in 2-D space } Is this set shattered?
x1
x2
x3
Yes: every dichotomy can be expressed by some rectangle in H
![Page 32: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/32.jpg)
VC dimension: An example
Consider H = { rectangles in 2-D space } and X = {x1,x2,x3,4} below. Is this set shattered?
x1
x2
x3
x4
No: I can’t include {x1,x2,x3} in a rectangle without including x4.
![Page 33: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/33.jpg)
VC dimension: An example
Consider H = { rectangles in 2-D space } and X = {x1,x2,x3,4} below. Is this set shattered?
x1
x2
x3
x4
![Page 34: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/34.jpg)
VC dimension: An example
Consider H = { rectangles in 2-D space } and X = {x1,x2,x3,4} below. Is this set shattered?
x1
x2
x3
x4
Yes: every dichotomy can be expressed by some rectangle in H
![Page 35: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/35.jpg)
VC dimension: An example
Consider H = { rectangles in 2-D space }. Can you shatter any set of 5?
x1
x2
x3
x4
No: consider x1 = min(horizontal), x2=max(horizontal), x3=min(vertical), x4=max(vertical), and let x5 the last point. If you include x1…x4, you must include x5.
x5
So the VC(H)=4
![Page 36: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/36.jpg)
VC dimension: An example
Consider H = { rectangles in 3-D space }. What’s the VC dimension?
x1
x2
x3
x4
x5
![Page 37: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/37.jpg)
Pac-learnability result: what if H is not finite?
Problem setting: • Set of instances • Set of hypotheses/target functions H=C where VC(H)=d • Set of m training instances drawn from P(X), with noise-
free labels If the learner outputs any h from H consistent with the data and then
ProbD[ errortrue(h) < ε ] > 1- δ
m ≥O 1εd log 1
ε+ log 1
δ
"
#$%
&'(
)*
+
,-
compare: m ≥O 1ε
log H + log 1δ
"
#$%
&'(
)*
+
,-
![Page 38: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/38.jpg)
Intuition: why large VC dimension hurts
Suppose • you have labels of all but one of the instances in X, • you know the labels are consistent with something in H, and • X is shattered by H.
Can you guess label of the final element of X?
No: you need at least VC(H) examples before you can start learning
![Page 39: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/39.jpg)
Intuition: why small VC dimension helps
x1
x7
x3 x4
x9
x10 x6
x3
x8 x5
x14
x11
x15
x16
x13
x12 How many rectangles are there that are different on the sample S, if |S|=m? Claim: choose(m,4). Every 4 positive points defines a distinct-on-the-data h. This can be generalized: if VC(H)=d, then the number of distinct-on-the-data hypotheses grows as O(md).
We don’t need to worry about h’s that classify the data exactly the same.
![Page 40: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/40.jpg)
How tight is this bound?
How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-δ) approximately (ε) correct?
Tightness of Bounds on Sample Complexity
Lower bound on sample complexity (Ehrenfeucht et al., 1989):
Consider any class C of concepts such that VC(C) > 1, any learner L, any 0 < ε < 1/8, and any 0 < δ < 0.01. Then there exists a distribution and a target concept in C, such that if L observes fewer examples than
Then with probability at least δ, L outputs a hypothesis with
![Page 41: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/41.jpg)
AGNOSTIC LEARNING AND NOISE
![Page 42: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/42.jpg)
Here ε is the difference between the training error and true error of the output hypothesis (the one with lowest training error)
![Page 43: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/43.jpg)
Additive Hoeffding Bounds – Agnostic Learning
• Given m independent flips of a coin with true Pr(heads) = θ we can bound the error in the maximum likelihood estimate
• Relevance to agnostic learning: for any single hypothesis h
• But we must consider all hypotheses in H (union bound again!)
• So, with probability at least (1-δ) every h satisfies
![Page 44: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/44.jpg)
Agnostic Learning: VC Bounds
With probability at least (1-δ) every h ∈ H satisfies
[Schölkopf and Smola, 2002]
![Page 45: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/45.jpg)
Structural Risk Minimization
Which hypothesis space should we choose? • Bias / variance tradeoff
H1 H2 H3 H4
[Vapnik]
SRM: choose H to minimize bound on expected true error!
* this is a very loose bound...
![Page 46: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/46.jpg)
MORE VC DIMENSION EXAMPLES
![Page 47: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/47.jpg)
VC dimension: examples Consider X = <, want to learn c:Xà{0,1} What is VC dimension of • Open intervals:
• Closed intervals:
x
![Page 48: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/48.jpg)
VC dimension: examples Consider X = <, want to learn c:Xà{0,1} What is VC dimension of • Open intervals:
• Closed intervals:
x
VC(H1)=1
VC(H2)=2
VC(H3)=2
VC(H4)=3
![Page 49: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/49.jpg)
VC dimension: examples
What is VC dimension of lines in a plane? • H2 = { ((w0 + w1x1 + w2x2)>0 à y=1) }
![Page 50: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/50.jpg)
VC dimension: examples
What is VC dimension of • H2 = { ((w0 + w1x1 + w2x2)>0 à y=1) }
– VC(H2)=3 • For Hn = linear separating hyperplanes in n dimensions,
VC(Hn)=n+1
![Page 51: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/51.jpg)
For any finite hypothesis space H, can you give an upper bound on VC(H) in terms of |H| ?
(hint: yes)
![Page 52: Machine Learning 10-601wcohen/10-601/pac-learning.pdf · Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: • Computational Learning](https://reader034.vdocuments.mx/reader034/viewer/2022051814/603a0e8f007e557314560360/html5/thumbnails/52.jpg)
More VC Dimension Examples to Think About
• Logistic regression over n continuous features – Over n boolean features?
• Linear SVM over n continuous features
• Decision trees defined over n boolean features F: <X1, ... Xn> à Y
• Decision trees of depth 2 defined over n features • How about 1-nearest neighbor?