pac$learning - carnegie mellon school of computer science · pac$learning 1...

36
PAC Learning 1 10601 Introduction to Machine Learning Matt Gormley Lecture 28 May 1, 2016 Machine Learning Department School of Computer Science Carnegie Mellon University Learning Theory Readings: Murphy Bishop HTF Mitchell 7

Upload: others

Post on 29-Sep-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

PAC  Learning

1

10-­‐601  Introduction  to  Machine  Learning

Matt  GormleyLecture  28May  1,  2016

Machine  Learning  DepartmentSchool  of  Computer  ScienceCarnegie  Mellon  University

Learning  Theory  Readings:Murphy  -­‐-­‐Bishop  -­‐-­‐HTF  -­‐-­‐Mitchell  7

Page 2: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Reminders

• Homework 9:  Applications of  ML– Release:  Mon,  Apr.  24– Due:  Wed,  May 3  at  11:59pm

4

Page 3: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Outline• Statistical  Learning  Theory

– True  Error  vs.  Train  Error– Function  Approximation  View  (aka.  PAC/SLT  Model)– Three  Hypotheses  of  Interest

• Probably  Approximately  Correct  (PAC)  Learning– PAC  Criterion– PAC  Learnable– Consistent  Learner– Sample  Complexity

• Generalization  and  Overfitting– Realizable  vs.  Agnostic  Cases– Finite  vs.  Infinite  Hypothesis  Spaces– VC  Dimension– Sample  Complexity  Bounds– Empirical  Risk  Minimization– Structural  Risk  Minimization

• Excess  Risk

5

Page 4: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

LEARNING  THEORY

6

Page 5: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Questions  For  Today1. Given  a  classifier  with  zero  training  error,  what  

can  we  say  about  generalization  error?(Sample  Complexity,  Realizable  Case)

2. Given  a  classifier  with  low  training  error,  what  can  we  say  about  generalization  error?(Sample  Complexity,  Agnostic  Case)

3. Is  there  a  theoretical  justification  for  regularization  to  avoid  overfitting?(Structural  Risk  Minimization)

7

Page 6: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Statistical  Learning  Theory

Whiteboard:– Function  Approximation  View  (aka.  PAC/SLT  Model)

– True  Error  vs.  Train  Error– Three  Hypotheses  of  Interest

8

Page 7: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

PAC  /  SLT  Model

9

Labeled Examples

PAC/SLT models for Supervised Learning

Learning Algorithm

Expert / Oracle

Data Source

Alg.outputs

Distribution D on X

c* : X ! Y

(x1,c*(x1)),…, (xm,c*(xm))

h : X ! Y x1 > 5

x6 > 2

+1 -1

+1

+

-

+ + +

- -

- -

-

Slide  from  Nina  Balcan

Page 8: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

PAC  /  SLT  Model

10

Page 9: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Two  Types  of  Error

11

Train  Error  (aka.  empirical  risk)

True  Error  (aka.  expected  risk)

Page 10: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Three  Hypotheses  of  Interest

12

Page 11: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

PAC  LEARNING

13

Page 12: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Probably  Approximately  Correct  (PAC)  Learning

Whiteboard:– PAC  Criterion–Meaning  of  “Probably  Approximately  Correct”– PAC  Learnable– Consistent  Learner– Sample  Complexity

14

Page 13: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

PAC  Learning

15

Page 14: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

SAMPLE  COMPLEXITY  RESULTS

16

Page 15: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Sample  Complexity  Results

17

Realizable Agnostic

Four  Cases  we  care  about…We’ll  start  with  the  

finite  case…

Page 16: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Generalization  and  Overfitting

Whiteboard:– Realizable  vs.  Agnostic  Cases– Finite  vs.  Infinite  Hypothesis  Spaces– Sample  Complexity  Bounds  (Finite  Case)

18

Page 17: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Sample  Complexity  Results

19

Realizable Agnostic

Four  Cases  we  care  about…

Page 18: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Example:  ConjunctionsIn-­‐Class  Quiz:Suppose  H  =  class  of  conjunctions  over  x  in  {0,1}M

If  M  =  10,  𝜀 =  0.1,  δ =  0.01,  how  many  examples  suffice?

20

Realizable Agnostic

Page 19: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Sample  Complexity  Results

21

Realizable Agnostic

Four  Cases  we  care  about…

Page 20: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Sample  Complexity  Results

22

Realizable Agnostic

Four  Cases  we  care  about…

We  need  a  new  definition  of  “complexity”  for  a  Hypothesis  space  for  these  results  (see  VC  Dimension)

Page 21: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

VC  DIMENSION

23

Page 22: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

24

What if H is infinite?

E.g., linear separators in Rd + -

+ + + - -

- -

-

E.g., intervals on the real line

a b

+ - -

E.g., thresholds on the real line w

+ -

Page 23: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

25

Shattering, VC-dimension

A set of points S is shattered by H is there are hypotheses in H that split S in all of the 2|𝑆| possible ways; i.e., all possible ways of classifying points in S are achievable using concepts in H.

Definition:

The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H.

Definition:

If arbitrarily large finite sets can be shattered by H, then VCdim(H) = ∞

VC-dimension (Vapnik-Chervonenkis dimension)

H shatters S if |H S | = 2|𝑆|. H[S] – the set of splittings of dataset S using concepts from H.

Page 24: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

26

Shattering, VC-dimension

The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H.

Definition:

If arbitrarily large finite sets can be shattered by H, then VCdim(H) = ∞

VC-dimension (Vapnik-Chervonenkis dimension)

To show that VC-dimension is d:

– there is no set of d+1 points that can be shattered. – there exists a set of d points that can be shattered

Fact: If H is finite, then VCdim (H) ≤ log (|H|).

Page 25: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

27

Shattering, VC-dimension

E.g., H= Thresholds on the real line

VCdim H = 1 w

+ -

If the VC-dimension is d, that means there exists a set of d points that can be shattered, but there is no set of d+1 points that can be shattered.

E.g., H= Intervals on the real line + - -

+ -

VCdim H = 2

+ - +

Page 26: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

28

Shattering, VC-dimension If the VC-dimension is d, that means there exists a set of d points that can be shattered, but there is no set of d+1 points that can be shattered.

E.g., H= Union of k intervals on the real line

+ - -

VCdim H = 2k

+ - +

+ - + - …

VCdim H < 2k + 1

VCdim H ≥ 2k A sample of size 2k shatters (treat each pair of points as a separate case of intervals)

+

Page 27: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

29

E.g., H= linear separators in R2

Shattering, VC-dimension

VCdim H ≥ 3

Page 28: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

30

Shattering, VC-dimension

VCdim H < 4

Case 1: one point inside the triangle formed by the others. Cannot label inside point as positive and outside points as negative.

Case 2: all points on the boundary (convex hull). Cannot label two diagonally as positive and other two as negative.

Fact: VCdim of linear separators in Rd is d+1

E.g., H= linear separators in R2

Page 29: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

SAMPLE  COMPLEXITY  RESULTS

32

Page 30: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Sample  Complexity  Results

33

Realizable Agnostic

Four  Cases  we  care  about…

We  need  a  new  definition  of  “complexity”  for  a  Hypothesis  space  for  these  results  (see  VC  Dimension)

Page 31: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Sample  Complexity  Results

34

Realizable Agnostic

Four  Cases  we  care  about…

Page 32: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Generalization  and  Overfitting

Whiteboard:– Sample  Complexity  Bounds  (Infinite  Case)– Empirical  Risk  Minimization– Structural  Risk  Minimization

35

Page 33: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

EXCESS  RISK

36

Page 34: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Excess  Risk

37

Page 35: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Excess  Risk  Results

38

Page 36: PAC$Learning - Carnegie Mellon School of Computer Science · PAC$Learning 1 10.601$Introduction$to$Machine$Learning Matt%Gormley Lecture28 May%1,%2016 Machine%Learning%Department

Questions  For  Today1. Given  a  classifier  with  zero  training  error,  what  

can  we  say  about  generalization  error?(Sample  Complexity,  Realizable  Case)

2. Given  a  classifier  with  low  training  error,  what  can  we  say  about  generalization  error?(Sample  Complexity,  Agnostic  Case)

3. Is  there  a  theoretical  justification  for  regularization  to  avoid  overfitting?(Structural  Risk  Minimization)

39