bayesian models of inductive learning

74
Bayesian models of inductive learning Tom Griffiths UC Berkeley Josh Tenenbaum MIT Charles Kemp MIT

Upload: fayola

Post on 25-Feb-2016

59 views

Category:

Documents


2 download

DESCRIPTION

Bayesian models of inductive learning. Tom Griffiths UC Berkeley. Josh Tenenbaum MIT. Charles Kemp MIT. What to expect. What you’ll get out of this tutorial: Our view of what Bayesian models have to offer cognitive science. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bayesian models of  inductive learning

Bayesian models of inductive learning

Tom GriffithsUC Berkeley

Josh TenenbaumMIT

Charles KempMIT

Page 2: Bayesian models of  inductive learning

What to expect• What you’ll get out of this tutorial:

– Our view of what Bayesian models have to offer cognitive science.– In-depth examples of basic and advanced models: how the math works &

what it buys you. – Some (not extensive) comparison to other approaches.– Opportunities to ask questions.

• What you won’t get:– Detailed, hands-on how-to. – Where you can learn more:

• http://bayesiancognition.com• Trends in Cognitive Sciences, July 2006, special issue on “Probabilistic Models of

Cognition”.

Page 3: Bayesian models of  inductive learning

Outline• Morning

– Introduction: Why Bayes? (Josh) – Basic of Bayesian inference (Josh)– Graphical models, causal inference and learning (Tom)

• Afternoon– Hierarchical Bayesian models, property induction, and

learning domain structures (Charles)– Methods of approximate learning and inference,

probabilistic models of semantic memory (Tom)

Page 4: Bayesian models of  inductive learning

Why Bayes?• The problem of induction

– How does the mind form inferences, generalizations, models or theories about the world from impoverished data?

• Induction is ubiquitous in cognition– Vision (+ audition, touch, or other perceptual modalities)– Language (understanding, production)– Concepts (semantic knowledge, “common sense”)– Causal learning and reasoning– Decision-making and action (production, understanding)

• Bayes gives a general framework for explaining how induction can work in principle, and perhaps, how it does work in the mind….

Page 5: Bayesian models of  inductive learning

VerbVPNPVPVP

VNPRelRelClauseRelClauseNounAdjDetNP

VPNPS

][][][

Phrase structure S

Utterance U

Grammar G

P(S | G)

P(U | S)

P(S | U, G) ~ P(U | S) x P(S | G)

Bottom-up Top-down

(P

Page 6: Bayesian models of  inductive learning

VerbVPNPVPVP

VNPRelRelClauseRelClauseNounAdjDetNP

VPNPS

][][][

Phrase structure

Utterance

Speech signal

Grammar

“Universal Grammar” Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG)

P(phrase structure | grammar)

P(utterance | phrase structure)

P(speech | utterance)

(c.f. Chater and Manning, 2006)

P(grammar | UG)

Page 7: Bayesian models of  inductive learning

The approach• Key concepts

– Inference in probabilistic generative models– Hierarchical probabilistic models, with inference at all levels of

abstraction– Structured knowledge representations: graphs, grammars, predicate

logic, schemas, theories– Flexible structures, with complexity constrained by Bayesian Occam’s

razor– Approximate methods of learning and inference: Expectation-

Maximization (EM), Markov chain Monte Carlo (MCMC)

• Much recent progress!– Computational resources to implement and test models that we could

dream up but not realistically imagine working with– New theoretical tools let us develop models that we could not clearly

conceive of before.

Page 8: Bayesian models of  inductive learning

(Han and Zhu, 2006)

Vision as probabilistic parsing

Page 9: Bayesian models of  inductive learning
Page 10: Bayesian models of  inductive learning

Word learning on planet Gazoob“tufa”

“tufa”

“tufa”

Can you pick out the tufas?

Page 11: Bayesian models of  inductive learning

Principles

Structure

Data

Whole-object principleShape biasTaxonomic principleContrast principleBasic-level bias

Learning word meanings

Page 12: Bayesian models of  inductive learning

Causal learning and reasoning

Principles

Structure

Data

Page 13: Bayesian models of  inductive learning

Goal-directed action (production and comprehension)

(Wolpert et al., 2003)

Page 14: Bayesian models of  inductive learning

Marr’s Three Levels of Analysis

• Computation: “What is the goal of the computation, why is it

appropriate, and what is the logic of the strategy by which it can be carried out?”

• Algorithm: Cognitive psychology

• Implementation:Neurobiology

Page 15: Bayesian models of  inductive learning

Alternative approaches to inductive learning and inference

• Associative learning• Connectionist networks • Similarity to examples• Toolkit of simple heuristics• Constraint satisfaction• Analogical mapping

Page 16: Bayesian models of  inductive learning

Summary: Why Bayes? • A unifying framework for explaining cognition.

– How people can learn so much from such limited data.– Strong quantitative models with minimal ad hoc assumptions. – Why algorithmic-level models work the way they do.

• A framework for understanding how structured knowledge and statistical inference interact.– How structured knowledge guides statistical inference, and may itself be acquired

through statistical means. – What forms knowledge takes, at multiple levels of abstraction.– What knowledge must be innate, and what can be learned. – How flexible knowledge structures may grow as required by the data, with

complexity controlled by Occam’s razor.

Page 17: Bayesian models of  inductive learning

Outline• Morning

– Introduction: Why Bayes? (Josh) – Basic of Bayesian inference (Josh)– Graphical models, causal inference and learning (Tom)

• Afternoon– Hierarchical Bayesian models, property induction, and

learning domain structures (Charles)– Methods of approximate learning and inference,

probabilistic models of semantic memory (Tom)

Page 18: Bayesian models of  inductive learning

Bayes’ rule

Hhhphdp

hphdpdhp)()|(

)()|()|(

Posteriorprobability

Likelihood Priorprobability

Sum over space of alternative hypotheses

For any hypothesis h and data d,

Page 19: Bayesian models of  inductive learning

Bayesian inference

• Bayes’ rule:• An example

– Data: John is coughing – Some hypotheses:

1. John has a cold2. John has emphysema3. John has a stomach flu

– Prior favors 1 and 3 over 2– Likelihood P(d|h) favors 1 and 2 over 3– Posterior P(d|h) favors 1 over 2 and 3

ihii hdPhP

hdPhPdhP)|()(

)|()()|(

Page 20: Bayesian models of  inductive learning

Coin flipping• Basic Bayes

– data = HHTHT or HHHHH– compare two simple hypotheses:

P(H) = 0.5 vs. P(H) = 1.0• Parameter estimation (Model fitting)

– compare many hypotheses in a parameterized familyP(H) = : Infer

• Model selection– compare qualitatively different hypotheses, often varying in

complexity:P(H) = 0.5 vs. P(H) =

Page 21: Bayesian models of  inductive learning

Coin flipping

HHTHTHHHHH

What process produced these sequences?

Page 22: Bayesian models of  inductive learning

Comparing two simple hypotheses

• Contrast simple hypotheses:– h1: “fair coin”, P(H) = 0.5

– h2:“always heads”, P(H) = 1.0

• Bayes’ rule:

• With two hypotheses, use odds form

ihii hdPhP

hdPhPdhP)|()(

)|()()|(

Page 23: Bayesian models of  inductive learning

Comparing two simple hypotheses

D: HHTHTH1, H2: “fair coin”, “always heads”

P(D|H1) = 1/25 P(H1) = ?

P(D|H2) = 0 P(H2) = 1-?

)()(

)|()|(

)|()|(

2

1

2

1

2

1HPHP

HDPHDP

DHPDHP

Page 24: Bayesian models of  inductive learning

Comparing two simple hypotheses

D: HHTHTH1, H2: “fair coin”, “always heads”

P(D|H1) = 1/25 P(H1) = 999/1000

P(D|H2) = 0 P(H2) = 1/1000

infinity1

9990321

)|()|(

2

1 DHPDHP

)()(

)|()|(

)|()|(

2

1

2

1

2

1HPHP

HDPHDP

DHPDHP

Page 25: Bayesian models of  inductive learning

Comparing two simple hypotheses

D: HHHHHH1, H2: “fair coin”, “always heads”

P(D|H1) = 1/25 P(H1) = 999/1000

P(D|H2) = 1 P(H2) = 1/1000

301

999321

)|()|(

2

1 DHPDHP

)()(

)|()|(

)|()|(

2

1

2

1

2

1HPHP

HDPHDP

DHPDHP

Page 26: Bayesian models of  inductive learning

Comparing two simple hypotheses

D: HHHHHHHHHHH1, H2: “fair coin”, “always heads”

P(D|H1) = 1/210 P(H1) = 999/1000

P(D|H2) = 1 P(H2) = 1/1000

)()(

)|()|(

)|()|(

2

1

2

1

2

1HPHP

HDPHDP

DHPDHP

11

9991024

1)|()|(

2

1 DHPDHP

Page 27: Bayesian models of  inductive learning

The role of intuitive theories

The fact that HHTHT looks representative of a fair coin and HHHHH does not reflects our implicit theories of how the world works. – Easy to imagine how a trick all-heads coin

could work: high prior probability.– Hard to imagine how a trick “HHTHT” coin

could work: low prior probability.

Page 28: Bayesian models of  inductive learning

Coin flipping• Basic Bayes

– data = HHTHT or HHHHH– compare two hypotheses:

P(H) = 0.5 vs. P(H) = 1.0• Parameter estimation (Model fitting)

– compare many hypotheses in a parameterized familyP(H) = : Infer

• Model selection– compare qualitatively different hypotheses, often varying in

complexity:P(H) = 0.5 vs. P(H) =

Page 29: Bayesian models of  inductive learning

• Assume data are generated from a parameterized model:

• What is the value of ?– each value of is a hypothesis H– requires inference over infinitely many hypotheses

Parameter estimation

d1 d2 d3 d4

P(H) =

Page 30: Bayesian models of  inductive learning

• Assume hypothesis space of possible models:

• Which model generated the data?– requires summing out hidden variables – requires some form of Occam’s razor to trade off complexity

with fit to the data.

Model selection

d1 d2 d3 d4

Fair coin: P(H) = 0.5

d1 d2 d3 d4

P(H) =

d1 d2 d3 d4

Hidden Markov model: si {Fair coin, Trick coin}

s1 s2 s3 s4

Page 31: Bayesian models of  inductive learning

Parameter estimation vs. Model selection across learning and development

• Causality: learning the strength of a relation vs. learning the existence and form of a relation

• Language acquisition: learning a speaker's accent, or frequencies of different words vs. learning a new tense or syntactic rule (or learning a new language, or the existence of different languages)

• Concepts: learning what horses look like vs. learning that there is a new species (or learning that there are species)

• Intuitive physics: learning the mass of an object vs. learning about gravity or angular momentum

• Intuitive psychology: learning a person’s beliefs or goals vs. learning that there can be false beliefs, or that visual access is valuable for establishing true beliefs

Page 32: Bayesian models of  inductive learning

A hierarchical learning framework

model

parameters

data

M

w

D

)|(),|(),|( MwpMwDpMDwp

Parameter estimation:

Page 33: Bayesian models of  inductive learning

A hierarchical learning framework

model

parameters

data

M

w

D

)|()|(),|( CMpMDpCDMp

Model selection:

)|(),|(),|( MwpMwDpMDwp

Parameter estimation:

model class C w

MwpMwDpMDp )|(),|()|(

Page 34: Bayesian models of  inductive learning

• Assume data are generated from a model:

• What is the value of ?– each value of is a hypothesis H– requires inference over infinitely many hypotheses

Bayesian parameter estimation

d1 d2 d3 d4

P(H) =

Page 35: Bayesian models of  inductive learning

• D = 10 flips, with 5 heads and 5 tails. • = P(H) on next flip? 50%• Why? 50% = 5 / (5+5) = 5/10.• Why? “The future will be like the past”

• Suppose we had seen 4 heads and 6 tails.• P(H) on next flip? Closer to 50% than to 40%.• Why? Prior knowledge.

Some intuitions

Page 36: Bayesian models of  inductive learning

• Posterior distribution P( | D) is a probability density over = P(H)

• Need to work out likelihood P(D | ) and specify prior distribution P()

Integrating prior knowledge and data

)()()|()|(

DppDpDp

Page 37: Bayesian models of  inductive learning

Likelihood and prior

• Likelihood: Bernoulli distributionP(D | ) = NH (1-) NT

– NH: number of heads– NT: number of tails

• Prior: P() ?

Page 38: Bayesian models of  inductive learning

• D = 10 flips, with 5 heads and 5 tails. • = P(H) on next flip? 50%• Why? 50% = 5 / (5+5) = 5/10.• Why? Maximum likelihood:

• Suppose we had seen 4 heads and 6 tails.• P(H) on next flip? Closer to 50% than to 40%.• Why? Prior knowledge.

Some intuitions

)|(maxargˆ

DP

Page 39: Bayesian models of  inductive learning

A simple method of specifying priors• Imagine some fictitious trials, reflecting a

set of previous experiences– strategy often used with neural networks or

building invariance into machine vision.

• e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair

• In fact, this is a sensible statistical idea...

Page 40: Bayesian models of  inductive learning

Likelihood and prior

• Likelihood: Bernoulli() distributionP(D | ) = NH (1-) NT

– NH: number of heads– NT: number of tails

• Prior: Beta(FH,FT) distributionP() FH-1 (1-) FT-1

– FH: fictitious observations of heads– FT: fictitious observations of tails

Page 41: Bayesian models of  inductive learning

Shape of the Beta prior

Page 42: Bayesian models of  inductive learning

Shape of the Beta prior FH = 0.5, FT = 0.5 FH = 0.5, FT = 2

FH = 2, FT = 0.5 FH = 2, FT = 2

Page 43: Bayesian models of  inductive learning

• Posterior is Beta(NH+FH,NT+FT)– same form as prior!– expected P(H) = (NH+FH) / (NH+FH+NT+FT)

Bayesian parameter estimation

P( | D) P(D | ) P() = NH+FH-1 (1-) NT+FT-1

Page 44: Bayesian models of  inductive learning

Conjugate priors• A prior p() is conjugate to a likelihood function

p(D | ) if the posterior has the same functional form of the prior.– Parameter values in the prior can be thought of as a

summary of “fictitious observations”. – Different parameter values in the prior and posterior

reflect the impact of observed data.– Conjugate priors exist for many standard models (e.g.,

all exponential family models)

Page 45: Bayesian models of  inductive learning

Some examples• e.g., F ={1000 heads, 1000 tails} ~ strong

expectation that any new coin will be fair• After seeing 4 heads, 6 tails, P(H) on next flip =

1004 / (1004+1006) = 49.95%

• e.g., F ={3 heads, 3 tails} ~ weak expectation that any new coin will be fair

• After seeing 4 heads, 6 tails, P(H) on next flip = 7 / (7+9) = 43.75%

Prior knowledge too weak

Page 46: Bayesian models of  inductive learning

But… flipping thumbtacks• e.g., F ={4 heads, 3 tails} ~ weak expectation that tacks

are slightly biased towards heads• After seeing 2 heads, 0 tails, P(H) on next flip = 6 /

(6+3) = 67%

• Some prior knowledge is always necessary to avoid jumping to hasty conclusions...

• Suppose F = { }: After seeing 1 heads, 0 tails, P(H) on next flip = 1 / (1+0) = 100%

Page 47: Bayesian models of  inductive learning

Origin of prior knowledge

• Tempting answer: prior experience• Suppose you have previously seen 2000

coin flips: 1000 heads, 1000 tails

Page 48: Bayesian models of  inductive learning

Problems with simple empiricism

• Haven’t really seen 2000 coin flips, or any flips of a thumbtack– Prior knowledge is stronger than raw experience justifies

• Haven’t seen exactly equal number of heads and tails– Prior knowledge is smoother than raw experience justifies

• Should be a difference between observing 2000 flips of a single coin versus observing 10 flips each for 200 coins, or 1 flip each for 2000 coins– Prior knowledge is more structured than raw experience

Page 49: Bayesian models of  inductive learning

A simple theory• “Coins are manufactured by a standardized procedure

that is effective but not perfect, and symmetric with respect to heads and tails. Tacks are asymmetric, and manufactured to less exacting standards.” – Justifies generalizing from previous coins to the present

coin.– Justifies smoother and stronger prior than raw experience

alone. – Explains why seeing 10 flips each for 200 coins is more

valuable than seeing 2000 flips of one coin.

Page 50: Bayesian models of  inductive learning

A hierarchical Bayesian model

d1 d2 d3 d4

FH,FT

d1 d2 d3 d4

1

d1 d2 d3 d4

~ Beta(FH,FT)

Coin 1 Coin 2 Coin 200...2200

physical knowledge

• Qualitative physical knowledge (symmetry) can influence estimates of continuous parameters (FH, FT).

• Explains why 10 flips of 200 coins are better than 2000 flips of a single coin: more informative about FH, FT.

Coins

Page 51: Bayesian models of  inductive learning

Stability versus Flexibility• Can all domain knowledge be represented with conjugate

priors?• Suppose you flip a coin 25 times and get all heads.

Something funny is going on …• But with F ={1000 heads, 1000 tails}, P(heads) on next

flip = 1025 / (1025+1000) = 50.6%. Looks like nothing unusual.

• How do we balance stability and flexibility?– Stability: 6 heads, 4 tails ~ 0.5– Flexibility: 25 heads, 0 tails ~ 1

Page 52: Bayesian models of  inductive learning

• Higher-order hypothesis: is this coin fair or unfair?

• Example probabilities:– P(fair) = 0.99– P(|fair) is Beta(1000,1000)– P(|unfair) is Beta(1,1)

• 25 heads in a row propagates up, affecting and then P(fair|D)

d1 d2 d3 d4

P(fair|25 heads) P(25 heads|fair) P(fair) P(unfair|25 heads) P(25 heads|unfair) P(unfair) = = 9 x 10-5

FH,FT

fair/unfair?

dpDPDP )fair|()|()fair|(1

0

A hierarchical Bayesian model

Page 53: Bayesian models of  inductive learning

• Learning the parameters of a generative model as Bayesian inference.

• Conjugate priors– an elegant way to represent simple kinds of prior

knowledge. • Hierarchical Bayesian models

– integrate knowledge across instances of a system, or different systems within a domain.

– can represent richer, more abstract knowledge

Summary: Bayesian parameter estimation

Page 54: Bayesian models of  inductive learning

Some questions

• Learning isn’t just about parameter estimation– How do we learn the functional form of a

variable’s distribution? – How do we learn model structure, or theories

with the expressiveness of predicate logic?• Can we “grow” levels of abstraction?

Page 55: Bayesian models of  inductive learning

A hierarchical learning framework

model

parameters

data

M

w

D

)|()|(),|( CMpMDpCDMp

Model selection:

)|(),|(),|( MwpMwDpMDwp

Model fitting:

model class C w

MwpMwDpMDp )|(),|()|(

Page 56: Bayesian models of  inductive learning

Bayesian model selection

• Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = ?

d1 d2 d3 d4

Fair coin, P(H) = 0.5

vs. d1 d2 d3 d4

P(H) =

Page 57: Bayesian models of  inductive learning

• P(H) = is more complex than P(H) = 0.5 in two ways:– P(H) = 0.5 is a special case of P(H) = – for any observed sequence D, we can choose

such that D is more probable than if P(H) = 0.5

Comparing simple and complex hypotheses

Page 58: Bayesian models of  inductive learning

Comparing simple and complex hypothesesPr

obab

ility

nNnDP )1()|(

= 0.5

D = HHHHH

Page 59: Bayesian models of  inductive learning

Comparing simple and complex hypothesesPr

obab

ility

nNnDP )1()|(

= 0.5

= 1.0

D = HHHHH

Page 60: Bayesian models of  inductive learning

Comparing simple and complex hypothesesPr

obab

ility

D = HHTHT

nNnDP )1()|(

= 0.5 = 0.6

Page 61: Bayesian models of  inductive learning

• P(H) = is more complex than P(H) = 0.5 in two ways:– P(H) = 0.5 is a special case of P(H) = – for any observed sequence X, we can choose such that

X is more probable than if P(H) = 0.5• How can we deal with this?

– Some version of Occam’s razor?– Bayes: automatic version of Occam’s razor follows from

the “law of conservation of belief”.

Comparing simple and complex hypotheses

Page 62: Bayesian models of  inductive learning

P(h1|D) P(D|h1) P(h1)

P(h0|D) P(D|h0) P(h0) = x

Comparing simple and complex hypotheses

1

011 )|()|()|( dhpDPhDP

The “evidence” or “marginal likelihood”: The probability that randomly selected parameters from the prior would generate the data.

NnNnhDP 2/1)2/11()2/1()|( 0

Page 63: Bayesian models of  inductive learning

)|()|(log

0

1

hDPhDP

1

011 )|()|()|( dhpDPhDP

NhDP 2/1)|( 0

Page 64: Bayesian models of  inductive learning

Bayesian Occam’s Razor

All possible data sets d

p(D

= d

| M

) M1

M2

1)|( all

MdDpDd

For any model M,

Law of “conservation of belief”: A model that can predict many possible data sets must assign each of them low probability.

Page 65: Bayesian models of  inductive learning

Ockham’s Razor in curve fitting

Page 66: Bayesian models of  inductive learning
Page 67: Bayesian models of  inductive learning

D

p(D

= d

| M

)

M1

M2

M3

Observed data

M1

M2

M3

1)|( all

MdDpDd

M1: A model that is too simple is unlikely to generate the data.M3: A model that is too complex can generate many possible data sets, so it is unlikely to generate this particular data set at random.

Page 68: Bayesian models of  inductive learning

D

p(D

= d

| M

)

M1

M2

M3

Observed data

M1

M2

M3

1)|( all

MdDpDd

),|()|( MxypMDp

dMpMxyp )|(),,|(

[assume Gaussian parameter priors, Gaussian likelihoods (noise)]

Page 69: Bayesian models of  inductive learning

M1

M2

M3

Prior Likelihood

For best fitting version of each model:

high low

medium high

very very veryvery low

very high

Page 70: Bayesian models of  inductive learning

(assuming Gaussian noise, and Gaussian priors on parameters)

(Ghahramani)

Page 71: Bayesian models of  inductive learning

(Ghahramani)

Page 72: Bayesian models of  inductive learning

Hierarchical Bayesian learning with flexibly structured models

• Learning context-free grammars for natural language (Stolcke & Omohundro; Griffiths and Johnson; Perfors et al.).

• Learning complex concepts.

“fruit”

Navarro (2006): Nonparametric model

Goodman et al. (2006): Probabilistic context-free grammar

for rule-based concepts

“fruit” <= or( and( color > 0.2, color < 0.4, size > 1, size < 4 ), and( color > 0.5, color < 0.65, size > 2, size < 7), …)

Page 73: Bayesian models of  inductive learning

The “blessing of abstraction”• Often easier to learn at higher levels of abstraction

– Easier to learn that you have a biased coin than to learn its bias. – Easier to learn causal structure than causal strength.– Easier to learn that you are hearing two languages (vs. one), or to learn

that language has a hierarchical phrase structure, than to learn how any one language works.

• Why? Hypothesis space gets smaller as you go up. – But the total hypothesis space gets bigger when we add levels of

abstraction (e.g., model selection). – Can make better (more confident, more accurate) predictions by

increasing the size of the hypothesis space, if we introduce good inductive biases.

Page 74: Bayesian models of  inductive learning

Summary• Three kinds of Bayesian inference

– Comparing two simple hypotheses– Parameter estimation

• The importance and subtlety of prior knowledge– Model selection

• Bayesian Occam’s razor, the blessing of abstraction

• Key concepts– Probabilistic generative models– Hierarchies of abstraction, with statistical inference at all levels – Flexibly structured representations