start of day 1 reading: chap. 1 & 2. introduction

55
START OF DAY 1 Reading: Chap. 1 & 2

Upload: magdalene-wilkinson

Post on 14-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

START OF DAY 1Reading: Chap. 1 & 2

Page 2: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Introduction

Page 3: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

How Do We Learn?

• By being told– Content transfer

• By analogy– Context transfer

• By induction– Knowledge construction

Page 4: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Learning by Induction

Induction is a process that "involves intellectual leaps from the particular to

the general"

Orley’s Experience

Page 5: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Group These Objects

Write down your answer – Keep it to yourself

Page 6: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

UGLY PRETTY

Find the Rule(s)

Write down your answer – Keep it to yourself

Page 7: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Requirements for Induction

• A language, Lp, to represent the particular (i.e., specific instances)

• A language, Lg, to represent the general (i.e., generalizations)

• A matching predicate, match(G,i), that is true if G is correct about i

• A set, I, of particulars of some unknown general G*

Page 8: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

What Does Induction Do?

• Observes the particulars in I• Produces a generalization, G, such that:

– G is consistent:• For all i I, match(G,i)

– G generalizes beyond I:• For many i I, match(G,i)

– G resembles G*:• For most i, G(i)=G*(i)

Page 9: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Two Different Contexts

• Picture 1 is an example of Unsupervised Learning– There is no predefined label for the instances

• Picture 2 is an example of Supervised Learning– There is a predefined label (or class assignment)

for the instances

Page 10: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Unsupervised Learning

• Consistency is (vacuously) guaranteed since you pick G (=G*) and then assign instances to groups based on G

• Generalization accuracy is also guaranteed, for the same reason

• Need a mechanism to choose among possible groupings (i.e., what makes one grouping better than another?)– Internal metrics (e.g., compactness)– External metrics (e.g., labeled data – assumes G*)

Page 11: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Supervised Learning

• Must try to achieve consistency– Why is that desirable?– Is it always possible?

• Noise (i.e., mislabeled instances)• Limitation of Lg (i.e., cannot represent a consistent G)

• Generalization accuracy can be measured directly• Need a mechanism to choose among possible

generalizations (i.e., what makes one generalization “better” than another? Remember, we do not know G*)

Page 12: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Revisiting Our Examples

• Picture 1– What groupings did you come up with?– How good are your groupings?

• Example 2– What generalizations did you come up with?– Are they consistent?– How well do you think your generalizations will do

beyond the observed instances?

Page 13: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Zooming in on Picture 2

• There are obviously several generalizations that are consistent:– If red or green then class 1 otherwise class 2– If less than 3 edges then class 1 otherwise class 2

• So: why did you choose the one you did?

Suppose that I now tell you that the PRETTY class is the set of complex polygons (i.e., with 4 sides or more). Would that change your preference? Why?

BIAS

Page 14: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

The Need for and Role of Bias

Page 15: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Recall: Concept Learning Given:

A language of observations/instances A language of concepts/generalizations A matching predicate A set of observations

Find generalizations that:1. Are consistent with the observations, and2. Classify instances beyond those observed

Page 16: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Working Example (I) Observations are characterized by a fixed set of features (or

attributes) and each instance corresponds to a specific assignment of values to the features

For example: Features:

color [red, blue, green, yellow] shape [square, triangle, circle] size [large, medium, small]

Observation: O1 <red,square,small> O2 <blue,square,large> O3 <yellow,circle,small>, etc.

Language of instances = set of attribute-value pairs

Attribute-value Language (AVL)

Page 17: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Working Example (II) Generalizations are used to represent N≥1 instances

For example: G1 <red,*,*> represents all red objects (independent of color and size) G2 <blue,square∨triangle,small> represents small blue objects that are either

squares or triangles Each generalization can be viewed as a set, namely of the

instances it represents (or matches) In the above example:

G1={<red,square,large>,<red,square,medium>,<red,square,small>,<red,triangle,large>,…,<red,circle,small>}

G2={<blue,square,small>,<blue,triangle,small>} Language of generalizations = designer chooses

Page 18: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Our First Learning Algorithm

• Definitions:– T = set of training instances– S = set of maximally specific generalizations consistent with T– G = set of maximally general generalizations consistent T

• Version Space Algorithm:– To the smallest extent necessary to maintain consistency with T

– S keeps generalizing to accommodate new positive instances– G keeps specializing to avoid new negative instances

– G remains as general as possible and S remains as specific as possible

Page 19: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Version Space Learning• Initialize G to the most general concept in the space• Initialize S to the first positive training instance• For each new positive training instance p

– Delete all members of G that do not cover p– For each s in S

• If s does not cover p– Replace s with its most specific generalizations that cover p

– Remove from S any element more general than some other element in S– Remove from S any element not more specific than some element in G

• For each new negative training instance n– Delete all members of S that cover n– For each g in G

• If g covers n– Replace g with its most general specializations that do not cover n

– Remove from G any element more specific than some other element in G– Remove from G any element more specific than some element in S

• If G=S and both are singletons– A single concept consistent with the training data has been found

• If G and S become empty– There is no concept consistent with the training data

Page 20: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Visualizing the Version Space

- --

-

- -

-

--

++?

?

??

Boundary of GBoundary of S

Page 21: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

VS Example (I)

• Task:– Two-object diagrams classification

• Instance Language:– AVL: Size x Color x Shape

• Generalization Language– AVL U {*}

• Training Instances– P1: {(Large Red Triangle) (Small Blue Circle)}– P2: {(Large Blue Circle) (Small Red Triangle)}– N1: {(Large Blue Triangle) (Small Blue Triangle)}

Page 22: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

VS Example (II)

• Initialization– G0 = [{(* * *) (* * *)}] most general– S0 = most

specific

Page 23: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

VS Example (III)(L R T) (L R T) (L R T) (S R T) (S R T) (S R T)

(L R T) (L R C) (L R T) (S R C) (S R T) (S R C)

(L R T) (L B T) (L R T) (S B T) (S R T) (S B T)

(L R T) (L B C) (L R T) (S B C) (S R T) (S B C)

(L R C) (S R T)

(L R C) (L R C) (L R C) (S R C) (S R C) (S R C)

(L R C) (L B T) (L R C) (S B T) (S R C) (S B T)

(L R C) (L B C) (L R C) (S B C) (S R C) (S B C)

(L B T) (S R T)

(L B T) (S R C)

(L B T) (L B T) (L B T) (S B T) (S B T) (S B T)

(L B T) (L B C) (L B T) (S B C) (S B T) (S B C)

(L B C) (S R T)

(L B C) (S R C)

(L B C) (S B T)

(L B C) (L B C) (L B C) (S B C) (S B C) (S B C)

Page 24: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

VS Example (IV)

• After P1– G1 = [{(* * *) (* * *)}]

no change– S1 = [{(L R T) (S B C)}]

minimum generalization to cover P1

Page 25: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

VS Example (V)(L R T) (L R T) (L R T) (S R T) (S R T) (S R T)

(L R T) (L R C) (L R T) (S R C) (S R T) (S R C)

(L R T) (L B T) (L R T) (S B T) (S R T) (S B T)

(L R T) (L B C) (L R T) (S B C) (S R T) (S B C)

(L R C) (S R T)

(L R C) (L R C) (L R C) (S R C) (S R C) (S R C)

(L R C) (L B T) (L R C) (S B T) (S R C) (S B T)

(L R C) (L B C) (L R C) (S B C) (S R C) (S B C)

(L B T) (S R T)

(L B T) (S R C)

(L B T) (L B T) (L B T) (S B T) (S B T) (S B T)

(L B T) (L B C) (L B T) (S B C) (S B T) (S B C)

(L B C) (S R T)

(L B C) (S R C)

(L B C) (S B T)

(L B C) (L B C) (L B C) (S B C) (S B C) (S B C)

Page 26: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

VS Example (VI)

• After P2– G2 = [{(* * *) (* * *)}]

no change– S2 = [{(L * *) (S * *)} {(* R T) (* B C)}]

minimum generalization to cover P2S extends its boundary to cover the new positive instance, but only as far as needed – No zeal!

Page 27: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

VS Example (VII)(L R T) (L R T) (L R T) (S R T) (S R T) (S R T)

(L R T) (L R C) (L R T) (S R C) (S R T) (S R C)

(L R T) (L B T) (L R T) (S B T) (S R T) (S B T)

(L R T) (L B C) (L R T) (S B C) (S R T) (S B C)

(L R C) (S R T)

(L R C) (L R C) (L R C) (S R C) (S R C) (S R C)

(L R C) (L B T) (L R C) (S B T) (S R C) (S B T)

(L R C) (L B C) (L R C) (S B C) (S R C) (S B C)

(L B T) (S R T)

(L B T) (S R C)

(L B T) (L B T) (L B T) (S B T) (S B T) (S B T)

(L B T) (L B C) (L B T) (S B C) (S B T) (S B C)

(L B C) (S R T)

(L B C) (S R C)

(L B C) (S B T)

(L B C) (L B C) (L B C) (S B C) (S B C) (S B C)

Page 28: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

VS Example (VIII)

• After N1– G3 = [{(* R *) (* * *)} {(* * C) (* * *)}]

minimum specialization to exclude N1

– S3 = [{(* R T) (* B C)}]remove inconsistent generalization

G contracts its boundary to exclude the new negative instance, but only as far as needed – No zeal!

Page 29: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

VS Example (IX)(L R T) (L R T) (L R T) (S R T) (S R T) (S R T)

(L R T) (L R C) (L R T) (S R C) (S R T) (S R C)

(L R T) (L B T) (L R T) (S B T) (S R T) (S B T)

(L R T) (L B C) (L R T) (S B C) (S R T) (S B C)

(L R C) (S R T)

(L R C) (L R C) (L R C) (S R C) (S R C) (S R C)

(L R C) (L B T) (L R C) (S B T) (S R C) (S B T)

(L R C) (L B C) (L R C) (S B C) (S R C) (S B C)

(L B T) (S R T)

(L B T) (S R C)

(L B T) (L B T) (L B T) (S B T) (S B T) (S B T)

(L B T) (L B C) (L B T) (S B C) (S B T) (S B C)

(L B C) (S R T)

(L B C) (S R C)

(L B C) (S B T)

(L B C) (L B C) (L B C) (S B C) (S B C) (S B C)

Page 30: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Predicting with the Version Space

• A new instance is classified as positive if and only if it is covered by every generalization in the version space

• A new instance is classified as negative if and only if no generalization in the version space covers it

• If some, but not all, of the generalizations in the version space cover the new instance, then the instance cannot be classified with certainty (An estimated classification based on the proportion of generalizations within the version space, that cover and do not cover the new instance, could be given)

Page 31: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

VS Example (X)

• The Version Space is:– G3 = [{(* R *) (* * *)} {(* * C) (* * *)}]– S3 = [{(* R T) (* B C)}]

• Predicting:– {(Small Red Triangle) (Small Blue Circle)}: +– {(Small Blue Triangle) (Large Red Circle)}: -– {(Large Red Triangle) (Large Blue Triangle)}: ?

Page 32: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Taking Stock

• Does VS solve the Concept Learning problem?• It produces generalizations• The generalizations are consistent with T• The generalizations extend beyond T

• Why/how does it work?• Let’s make some assumptions and replay

Page 33: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Unbiased Generalization Language

• A language such that every possible subset of instances can be represented• AVL U {*} is not unbiased

• Every replacement by * causes the representation of ALL of the values of the corresponding attribute

• AVL U {∨} is better but still not unbiased• It cannot represent {<red,square,small>,<blue,square,large>}

• In UGL, all subsets must have a representation• I.e., UGL = power set of the given instance language

Page 34: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Unbiased Generalization Procedure

• Uses Unbiased Generalization Language • Computes Version Space (VS) relative to UGL

• VS = set of all expressible generalizations consistent with the training instances

(in case that was not quite clear still)

Page 35: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Claim

VS with UGL cannot solve part 2 of the Concept Learning problem, i.e., learning is limited to

rote learning

Page 36: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Lemma 1

Any new instance, NI, is classified as positive if and only if NI is identical to some observed positive instance

Proof:() If NI is identical to some observed positive instance, then NI is classified as positive

– Follows directly from the definition of VS

() If NI is classified as positive, then NI is identical to some observed positive instance

– Let g={p: p is an observed positive instance}• UGL gVS• NI matches all of VS NI matches g

Page 37: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Lemma 2

Any new instance, NI, is classified as negative if and only if NI is identical to some observed negative instance

Proof:() If NI is identical to some observed negative instance, then NI is classified as negative

– Follows directly from the definition of VS

() If NI is classified as negative, then NI is identical to some observed negative instance

– Let G={all subsets containing observed negative instances}• UGL GVS=UGL • NI matches none in VS NI was observed

Page 38: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Lemma 3

If NI is any instance which was not observed, then NI matches exactly one half of VS and cannot be classified Proof:() If NI was not observed, then NI matches exactly one half of VS, and so cannot be classified

– Let g={p: p is an observed positive instance}– Let G’={all subsets of unobserved instances}

• UGL VS={gg’: g’G’}• NI was not observed NI matches exactly ½ of G’ NI matches exactly ½ of

VS

Page 39: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Theorem

It follows directly from Lemmas 1-3 that:

An unbiased generalization procedure can never make the inductive leap necessary to classify

instances beyond those it has observed

Page 40: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Another Way to Look at It…

There are 22n Boolean functions of n inputs

x1 x2 x3 Class Possible Consistent Function Hypotheses0 0 0 1 1 1 1

1 1 1 1 1 1 11 1 1 1 1 1

0 0 1 1 1 1 11 1 1 1 1 1 11 1 1 1 1 1

0 1 0 1 1 1 11 1 1 1 1 1 11 1 1 1 1 1

0 1 1 1 1 1 11 1 1 1 1 1 11 1 1 1 1 1

1 0 0 0 0 00 0 0 0 0 1 11 1 1 1 1 1

1 0 1 0 0 00 1 1 1 1 0 00 0 1 1 1 1

1 1 0 0 0 11 0 0 1 1 0 01 1 0 0 1 1

1 1 1 ? 0 1 01 0 1 0 1 0 10 1 0 1 0 1

What do we predict for 1 1 1?There are as many consistent functions predicting 0 as there are consistent functions predicting 1!

Page 41: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Yet Another Way to Look at it…

• If there is no bias, the outcome of the learner is highly dependent on the training data, and thus there is much variance among the models induced from different sets of observations– Learner memorizes (overfits)

• If there is a strong bias, the outcome of the learner is much less dependent on the training data, and thus there is little variance among induced models– Learner ignores observation

• Formalized as:bias-variance decomposition of error

Semmelweis’ Experience

Page 42: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

What is Bias?

BIASAny basis for choosing one decision over

another, other than strict consistency with past observations

Page 43: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Going Back to Our Question

• Our example worked BECAUSE the generalization language (AVL U {*}) was not unbiased!

• In fact, we have just showed that:

If a learning system is to be useful, it must have some form of bias

Page 44: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Humans as Learning Systems

• Do we have biases?– All kinds!!!

• Our/the representation language cannot express all possible classes of observations

• Our/the generalization procedure is biased– Domain knowledge (e.g., double bonds rarely break)– Intended use (e.g., ICU – relative cost)– Shared assumptions (e.g., crown, bridge – dentistry)– Simplicity and generality (e.g., white men can’t jump)– Analogy (e.g., heat vs. water flow, thin ice)– Commonsense (e.g., social interactions, pain, etc.)

Survey Exercise

Page 45: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Our First Lesson

• The power of a generalization system follows directly from its biases• Absence of bias = rote learning

• Progress towards understanding learning mechanisms depends upon understanding the sources of, and justification for, various biases

We will consider these issues for every algorithm we will study

Page 46: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Are There Better Biases?

Page 47: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

No Free Lunch Theorem

A.k.a Law of Conservation for Generalization Performance (LCG)

GP = Accuracy – 50%

When taken across all learning tasks, the generalization performance of any learner

sums to 0

Page 48: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

NFL Intuition (I)

Page 49: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

NFL Intuition (II)

Page 50: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

NFL Intuition (III)

Page 51: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Second Lesson

Whenever a learning algorithm performs well on some function, as measured by OTS

generalization, it must perform poorly on some other(s)

In other words, there is no universal learner, or best bias!

Page 52: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Impact on Users

Page 53: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Towards a Solution

Page 54: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

Taking Stock

• We will study a number of learning algorithms• We promise to:

– Discuss their language and procedural biases– Always remember NFL

Page 55: START OF DAY 1 Reading: Chap. 1 & 2. Introduction

END OF DAY 1Homework: Thought Questions