nov 30th, 2001copyright 2001, andrew w. moore pac-learning andrew w. moore associate professor...

20

Click here to load reader

Upload: thomasine-hicks

Post on 19-Jan-2018

212 views

Category:

Documents


0 download

DESCRIPTION

PAC-learning: Slide 3 Copyright © 2001, Andrew W. Moore Example of a machine f(x,h) consists of all logical sentences about X1, X2.. Xm that contain only logical ands. Example hypotheses: X1 ^ X3 ^ X19 X3 ^ X18 X7 X1 ^ X2 ^ X2 ^ x4 … ^ Xm Question: if there are 3 attributes, what is the complete set of hypotheses in f?

TRANSCRIPT

Page 1: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Nov 30th, 2001Copyright © 2001, Andrew W. Moore

PAC-learningAndrew W. Moore

Associate ProfessorSchool of Computer ScienceCarnegie Mellon University

www.cs.cmu.edu/[email protected]

412-268-7599

Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Page 2: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 2

Probably Approximately Correct (PAC) Learning

• Imagine we’re doing classification with categorical inputs.

• All inputs and outputs are binary.• Data is noiseless.• There’s a machine f(x,h) which has H

possible settings (a.k.a. hypotheses), called h1, h2 .. hH.

Page 3: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 3

Example of a machine• f(x,h) consists of all logical sentences about

X1, X2 .. Xm that contain only logical ands.• Example hypotheses:• X1 ^ X3 ^ X19• X3 ^ X18• X7• X1 ^ X2 ^ X2 ^ x4 … ^ Xm• Question: if there are 3 attributes, what is

the complete set of hypotheses in f?

Page 4: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 4

Example of a machine• f(x,h) consists of all logical sentences about

X1, X2 .. Xm that contain only logical ands.• Example hypotheses:• X1 ^ X3 ^ X19• X3 ^ X18• X7• X1 ^ X2 ^ X2 ^ x4 … ^ Xm• Question: if there are 3 attributes, what is

the complete set of hypotheses in f? (H = 8)True X2 X3 X2 ^ X3X1 X1 ^ X2 X1 ^ X3 X1 ^ X2 ^ X3

Page 5: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 5

And-Positive-Literals Machine• f(x,h) consists of all logical sentences about

X1, X2 .. Xm that contain only logical ands.• Example hypotheses:• X1 ^ X3 ^ X19• X3 ^ X18• X7• X1 ^ X2 ^ X2 ^ x4 … ^ Xm• Question: if there are m attributes, how

many hypotheses in f?

Page 6: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 6

And-Positive-Literals Machine• f(x,h) consists of all logical sentences about

X1, X2 .. Xm that contain only logical ands.• Example hypotheses:• X1 ^ X3 ^ X19• X3 ^ X18• X7• X1 ^ X2 ^ X2 ^ x4 … ^ Xm• Question: if there are m attributes, how

many hypotheses in f? (H = 2m)

Page 7: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 7

And-Literals Machine• f(x,h) consists of all logical

sentences about X1, X2 .. Xm or their negations that contain only logical ands.

• Example hypotheses:• X1 ^ ~X3 ^ X19• X3 ^ ~X18• ~X7• X1 ^ X2 ^ ~X3 ^ … ^ Xm• Question: if there are 2

attributes, what is the complete set of hypotheses in f?

Page 8: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 8

And-Literals Machine• f(x,h) consists of all logical

sentences about X1, X2 .. Xm or their negations that contain only logical ands.

• Example hypotheses:• X1 ^ ~X3 ^ X19• X3 ^ ~X18• ~X7• X1 ^ X2 ^ ~X3 ^ … ^ Xm• Question: if there are 2

attributes, what is the complete set of hypotheses in f? (H = 9)

True

True

True

X2

True

~X2

X1 TrueX1 ^ X2X1 ^ ~X2~X1 True~X1 ^ X2~X1 ^ ~X2

Page 9: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 9

And-Literals Machine• f(x,h) consists of all logical

sentences about X1, X2 .. Xm or their negations that contain only logical ands.

• Example hypotheses:• X1 ^ ~X3 ^ X19• X3 ^ ~X18• ~X7• X1 ^ X2 ^ ~X3 ^ … ^ Xm• Question: if there are m

attributes, what is the size of the complete set of hypotheses in f?

True

True

True

X2

True

~X2

X1 TrueX1 ^ X2X1 ^ ~X2~X1 True~X1 ^ X2~X1 ^ ~X2

Page 10: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 10

And-Literals Machine• f(x,h) consists of all logical

sentences about X1, X2 .. Xm or their negations that contain only logical ands.

• Example hypotheses:• X1 ^ ~X3 ^ X19• X3 ^ ~X18• ~X7• X1 ^ X2 ^ ~X3 ^ … ^ Xm• Question: if there are m

attributes, what is the size of the complete set of hypotheses in f? (H = 3m)

True

True

True

X2

True

~X2

X1 TrueX1 ^ X2X1 ^ ~X2~X1 True~X1 ^ X2~X1 ^ ~X2

Page 11: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 11

Lookup Table Machine• f(x,h) consists of all truth

tables mapping combinations of input attributes to true and false

• Example hypothesis:

• Question: if there are m attributes, what is the size of the complete set of hypotheses in f?

X1 X2 X3 X4 Y0 0 0 0 00 0 0 1 10 0 1 0 10 0 1 1 00 1 0 0 10 1 0 1 00 1 1 0 00 1 1 1 11 0 0 0 01 0 0 1 01 0 1 0 01 0 1 1 11 1 0 0 01 1 0 1 01 1 1 0 01 1 1 1 0

Page 12: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 12

Lookup Table Machine• f(x,h) consists of all truth

tables mapping combinations of input attributes to true and false

• Example hypothesis:

• Question: if there are m attributes, what is the size of the complete set of hypotheses in f?

X1 X2 X3 X4 Y0 0 0 0 00 0 0 1 10 0 1 0 10 0 1 1 00 1 0 0 10 1 0 1 00 1 1 0 00 1 1 1 11 0 0 0 01 0 0 1 01 0 1 0 01 0 1 1 11 1 0 0 01 1 0 1 01 1 1 0 01 1 1 1 0

m

H 22

Page 13: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 13

A Game• We specify f, the machine• Nature choose hidden random hypothesis h*• Nature randomly generates R datapoints

•How is a datapoint generated?1.Vector of inputs xk = (xk1,xk2, xkm) is

drawn from a fixed unknown distrib: D2.The corresponding output yk=f(xk , h*)

• We learn an approximation of h* by choosing some hest for which the training set error is 0

Page 14: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 14

Test Error Rate• We specify f, the machine• Nature choose hidden random hypothesis

h*• Nature randomly generates R datapoints

•How is a datapoint generated?1.Vector of inputs xk = (xk1,xk2, xkm) is

drawn from a fixed unknown distrib: D2.The corresponding output yk=f(xk , h*)

• We learn an approximation of h* by choosing some hest for which the training set error is 0

• For each hypothesis h ,• Say h is Correctly Classified (CCd) if h has

zero training set error• Define TESTERR(h )

= Fraction of test points that h will classify correctly

= P(h classifies a random test point correctly)

• Say h is BAD if TESTERR(h) >

Page 15: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 15

Test Error Rate• We specify f, the machine• Nature choose hidden random hypothesis

h*• Nature randomly generates R datapoints

•How is a datapoint generated?1.Vector of inputs xk = (xk1,xk2, xkm) is

drawn from a fixed unknown distrib: D2.The corresponding output yk=f(xk , h*)

• We learn an approximation of h* by choosing some hest for which the training set error is 0

• For each hypothesis h ,• Say h is Correctly Classified (CCd) if h has

zero training set error• Define TESTERR(h )

= Fraction of test points that i will classify correctly

= P(h classifies a random test point correctly)

• Say h is BAD if TESTERR(h) >

)bad is |),( Set, Training()bad is | CCd is (

hyhxfkPhhP

kk

R)1(

Page 16: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 16

Test Error Rate• We specify f, the machine• Nature choose hidden random hypothesis

h*• Nature randomly generates R datapoints

•How is a datapoint generated?1.Vector of inputs xk = (xk1,xk2, xkm) is

drawn from a fixed unknown distrib: D2.The corresponding output yk=f(xk , h*)

• We learn an approximation of h* by choosing some hest for which the training set error is 0

• For each hypothesis h ,• Say h is Correctly Classified (CCd) if h has

zero training set error• Define TESTERR(h )

= Fraction of test points that i will classify correctly

= P(h classifies a random test point correctly)

• Say h is BAD if TESTERR(h) >

)bad is |),( Set, Training()bad is | CCd is (

hyhxfkPhhP

kk

R)1( ) bad alearn we( hP

hh

P bad a contains

s' CCd ofset the

bad is ^ CCd is . hhhP

bad) is ^ CCd is (:

bad) is ^ CCd is (bad) is ^ CCd is (

22

11

HH hh

hhhh

P

H

iii hhP

1

bad is ^ CCd is

H

iii hhP

1

bad is | CCd is

Rii HhhPH )1(bad is | CCd is

Page 17: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 17

PAC Learning• Chose R such that with probability less than

we’ll select a bad hest (i.e. an hest which makes mistakes more than fraction of the time)

• Probably Approximately Correct• As we just saw, this can be achieved by

choosing R such that

• i.e. R such that

RHhP )1() bad alearn we(

1loglog69.0

22 HR

Page 18: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 18

PAC in actionMachine Example

HypothesisH R required to PAC-

learnAnd-positive-literals

X3 ^ X7 ^ X8

2m

And-literals X3 ^ ~X7 3m

Lookup Table

And-lits or And-lits

(X1 ^ X5) v (X2 ^ ~X7 ^ X8)

X1 X2 X3 X4 Y0 0 0 0 00 0 0 1 10 0 1 0 10 0 1 1 00 1 0 0 10 1 0 1 00 1 1 0 00 1 1 1 11 0 0 0 01 0 0 1 01 0 1 0 01 0 1 1 11 1 0 0 01 1 0 1 01 1 1 0 01 1 1 1 0

m22

mm 22 3)3(

1log69.0

2m

1log)3(log69.0

22 m

1log269.0

2m

1log)3log2(69.0

22 m

Page 19: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 19

PAC for decision trees of depth k

• Assume m attributes• Hk = Number of decision trees of depth k

• H0 =2• Hk+1 = (#choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = m * Hk * Hk

• Write Lk = log2 Hk

• L0 = 1• Lk+1 = log2 m + 2Lk

• So Lk = (2 k-1)(1+log2 m) +1• So to PAC-learn, need

1log1)log1)(12(69.0

22 mR k

Page 20: Nov 30th, 2001Copyright  2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University

Copyright © 2001, Andrew W. Moore PAC-learning: Slide 20

What you should know• Be able to understand every step in the math

that gets you to

• Understand that you thus need this many records to PAC-learn a machine with H hypotheses

• Understand examples of deducing H for various machines

RHhP )1() bad alearn we(

1loglog69.0

22 HR