april 3 2014 slides mayo

34
1 April 3, 2014 Phil 6334 The Howson (1997) paper was invited as a discussion on the general topic of a paper of mine (and one other paper): “Duhem's Problem, the Bayesian Way, and Error Statistics or “What’s Belief Got to Do With It?” The physicist can never submit an isolated hypothesis to the control of experiment, but only a whole group of hypotheses. When experiment is in disagreement with his predictions, it teaches him that one at least of the hypotheses that constitute this group is wrong and must be modified. But experiment does not show him the one that must be changed (Duhem 1954, p.185).

Upload: jemille6

Post on 05-Dec-2014

5.295 views

Category:

Education


3 download

DESCRIPTION

Phil 6334, D. Mayo: Duhem's Problem and more...(April 3 2014 slides)

TRANSCRIPT

Page 1: April 3 2014 slides mayo

1

April 3, 2014 Phil 6334

The Howson (1997) paper was invited as a discussion on the general

topic of a paper of mine (and one other paper): “Duhem's Problem, the

Bayesian Way, and Error Statistics or “What’s Belief Got to Do With It?”

The physicist can never submit an isolated hypothesis to the control

of experiment, but only a whole group of hypotheses. When

experiment is in disagreement with his predictions, it teaches him

that one at least of the hypotheses that constitute this group is wrong

and must be modified. But experiment does not show him the one

that must be changed (Duhem 1954, p.185).

Page 2: April 3 2014 slides mayo

2

“...is to point out that the Bayesian personalist approach to

scientific inference provides a ...solution to this [Duhem] puzzle by

telling us exactly when [disregarding unsuccessful predictions] can

be reconstructed as rational and when it has to be deemed

irrational. Rationality here, for the Bayesian, simply means

conformity with [Bayes'] theorem” (Dorling 1979, p. 177).

Bayes' Theorem (one form):

P(H | e) = P(e | H) P(H)

P(e | H) P(H) + P(e | not-H) P(not-H)

The Bayesian “catchall factor”: P(e | not-H).

Page 3: April 3 2014 slides mayo

3

1. Dorling’s Homework Problem:

(I). The components:

Hypothesis H: Newton’s theory of motion and gravitation

e: the predicted secular acceleration of the moon

e’: the observed acceleration of the moon--the anomalous result

Auxiliary hypothesis A: the effects of tidal friction are not of a

sufficient order of magnitude to affect the acceleration of the

moon.

H and A entail e, but e’ is observed.

Page 4: April 3 2014 slides mayo

4

(II) An informal (and neutral) description of a situation where

anomaly e’ indicates (or is best explained by) auxiliary A being in

error:

there is a great deal of evidence in favor of a theory or

hypothesis H, whereas

there is little evidence for the truth of auxiliary A, say hardly

more evidence for its truth than for its falsity, and

unless A is false, there is no other plausible way to explain e’.

A Bayesian rendering may be obtained by inserting

"agent X believes that"

prior to assertions (1), (2), and (3).

Page 5: April 3 2014 slides mayo

5

(III) The Numerical Solution to the Homework Problem:

ASSUME:

(i) P(H) = 0.9 and P(A) = 0.6

(ii) “The agent contemplates auxiliary A being true”.

The probability of e’, given A and not-H is very small. Let this

very small value be .

(Dorling takes to be 0.001.)

(iii)“The agent contemplates auxiliary A being false”

(a) The probability of e’, given H and not-A, is 50.

(b) The probability of e’, given not-H and not-A, is 50.

(iv) H and A are probabilistically independent:

P(H and A) = P(H)P(A)

Page 6: April 3 2014 slides mayo

6

1.1. THE RESULTS

The Bayesian Catchall factor: P(e’|not-H) = 20.6

(.0206)

The posterior probabilities: P(H|e’) = 0.897

P(A|e’) = 0.003

H hasn’t gone down much, the blame for the anomaly is placed on

auxiliary A.

Of course, the opposite assignment could have been given thus

putting the blame on the theory.

Page 7: April 3 2014 slides mayo

7

Two key features of the error statistical approach (of central relevance

to Duhem's problem):

1. A Piecemeal Approach.

Two contrasts with the Bayesian Way may be noted:

I. Gets beyond a single probability pie

II. Gets beyond a white-glove analysis

Page 8: April 3 2014 slides mayo

8

2. The Fundamental Use of Error Probabilities of Tests

▪ The question of whether data provides good evidence for a

hypothesis is regarded as an objective (though empirical)

one, not a subjective, one.

▪ Data count as good evidence for H just to the extent that H

passes a severe test.

Page 9: April 3 2014 slides mayo

9

Statistical Significance Tests

The null (or “test”) hypothesis: H0, there is no increased risk of

R (in a given population)

[H0 says it is an error to suppose a genuine increased risk is

responsible for any observed difference in risk rates.]

The alternative hypothesis: J, there is an increased risk of R (in a

given population)

Here e is anomalous for J.

Page 10: April 3 2014 slides mayo

10

The significance question: how often would a (positive)

difference in R rates as high as (or even higher than) the one

observed (e) occur, if in fact H0 were true? The answer is called

the statistical significance level of the data.

Page 11: April 3 2014 slides mayo

11

Error Probabilities Register Illegitimate Ways to “Save”

Hypotheses From Anomalies

H0: there is no increased risk of R (in a given population)

J: there is an increased risk of R (in a given population)

e: a 0 (or a statistically insignificant) difference

Way #1: J + “compensation hypothesis” (there really is a risk

but something compensated for it in this data)

Way #2: J’: there is an increased risk of R’ (in a given

population)

(There’s some other risk, found through searching the data)

Page 12: April 3 2014 slides mayo

12

The hypotheses erected to accord with the evidence fail to

pass severe tests.

The probability of erroneously finding some alleged

compensating factor or other, Way #1, and the probability of

erroneously finding one or another excess in risk, Way #2, are

no longer the low .01 level as at the start, but are instead higher,

in extreme cases, maximal (i.e., 1)

Page 13: April 3 2014 slides mayo

13

skip

Two Types of Strategies in the Error Statistical Approach to

Duhem's Problem:

(1) “blocker” strategies:

We criticize attempts to explain away anomalies (e.g., as due to

H-saving factors) on the grounds that

they fail to pass severe tests

their denials pass severe tests.

(2) Show anomaly e’ may be blamed on an auxiliary hypothesis A

by showing A’ (the denial of A), passes a severe test.

Page 14: April 3 2014 slides mayo

14

In Dorling's illustration, a result e’ that is anomalous for H is

taken to provide positive grounds for discrediting A and

confirming its denial A'. (The degree of belief in A' went from

0.4 to 0.99—by dint of anomaly e’.)

But the error statistician wants to know if the test is severe!

this requires positive evidence that the alleged extraneous factor is

responsible for the anomaly.

strong belief in H together with low enough degree of belief in the

Bayesian catchall factor do not suffice for showing A' has passed a

severe test.

going from satisfying the Bayesian conditions to declaring strong

evidence for A'--is a very unreliable one (makes it too easy to blame

auxiliary hypothesis A even if A is true).

such an appeal to A’ would thereby be blocked.

Page 15: April 3 2014 slides mayo

15

Mini overview:

Severity Requirement: Data x provides good evidence for

inferring H only if it results from a procedure which, taken as a

whole, constitutes H having passed a severe test — that is, a

procedure which would have (at least with very high probably)

uncovered the falsity of, or errors in H, and yet H emerged

unscathed.

Inductive learning, in this view, proceeds by testing hypotheses and

inferring those which pass probative or severe tests — tests which

very probably would have unearthed some error in the hypothesis H,

were such an error present.

Page 16: April 3 2014 slides mayo

16

A methodology for induction, accordingly, is a methodology for

arriving at severe tests, and for scrutinizing inferences by considering

the severity with which they have passed tests.

Methodological rules and strategies are claims about how to avoid

mistakes and learning from different types of errors; their appraisal

turns on understanding how methods enable avoidance of specific

errors.

Hence an inductive methodology of severe testing will focus on

understanding the properties of tools for generating, modeling and

analyzing data so as to learn about some aspect of the data-

generating mechanism.

These properties, while empirical, are objective.

Page 17: April 3 2014 slides mayo

17

Highly Probable vs. Highly Probed Hypotheses

The Criticism: H may pass with high severity (with data x) even

though the (Bayesian) posterior probability for H (given x) is low.

All such “funny Bayesian examples” need to assume prior

probability assignments to an exhaustive set of hypotheses, while for

a frequentist error statistician, a hypothesis could only be given a

probability assignment if its truth is the outcome of a random trial

(but “events” to not also serve as statistical hypotheses).

Subjective degree of belief assignments will not ensure the error

probability, and thus the severity, assessments we need.

Examples with frequentist priors, however, commit the fallacy of

probabilistic instantiation.

Page 18: April 3 2014 slides mayo

18

The Fallacy of Probabilistic Instantiation

Hypothesis H is true of p% of the populations (bags) in this urn of

populations U,

1. P(H is true of a randomly selected bag from an urn of bags U) = p

2. The randomly selected bag that was drawn is the bag used in test

T1 is b1,

Therefore:

(*) P(H is true of b1) = p.

For the frequentist: either H is true of b1 or not — the probability in

(*) is fallacious and results from an unsound instantiation.

Page 19: April 3 2014 slides mayo

19

Students from the Wrong Side of Town

Isaac, has passed comprehensive tests of mastery of high school

subjects regarded as indicating college readiness…

Since such high scores s could rarely result among high school

students who are not sufficiently prepared to be deemed ‘college

ready” we regarded s as good evidence for

H(I): Isaac is college ready.

And let the denial be H’:

H’(I): Isaac is not college ready (i.e., he is deficient).

The probability for such good results, given a student is college ready,

is extremely high:

P(s | H(I)) is practically 1,

Page 20: April 3 2014 slides mayo

20

while very low assuming he is not college ready.

P(s | H’(I)) =.05.

But imagine Isaac was randomly selected from the population of

students in, let us say, Fewready Town—where college readiness is

extremely rare, say one out of one thousand. The critic infers that the

prior probability of Isaac’s college-readiness is therefore .001:

(*) P(H(I)) = .001.

If so, then the posterior probability that Isaac is college ready, given

his high test results, would be very low:

p(H(I)|s) is very low,

even though the posterior probability has increased from the prior in

(*).

Page 21: April 3 2014 slides mayo

21

This is supposedly problematic for testers because we’d say this was

evidence for H(I) (readiness).

Actually I would want degrees of readiness to make my inference, but

these are artificially excluded here.

But, even granting his numbers, the main fallacy here is fallacious

probabilistic instantiation.

Although the probability of a randomly selected student taken from

high schoolers in Fewready Town is .001, it does not follow that

Isaac, the one we happened to select, has a probability of .001 of

being college ready

Page 22: April 3 2014 slides mayo

22

Achinstein says he will grant the fallacy…but only for frequentists:

“My response to the probabilistic fallacy charge is to say that

it would be true if the probabilities in question were construed

as relative frequencies. However, …I am concerned with

epistemic probability.”

He is prepared to grant the following instantiations:

▪ P% of the hypotheses in a given pool of hypotheses are true (or a

character holds for p%).

▪ The particular hypothesis Hi was randomly selected from this pool.

▪ Therefore, the objective epistemic probability P(Hi is true) = p.

Page 23: April 3 2014 slides mayo

23

Of course, epistemic probabilists are free to endorse this road to

posteriors—this just being a matter of analytic definition.

But the consequences speak loudly against the desirability of doing

so.

No Severity. The example considers only two outcomes: reaching the

high scores s, or reaching lower scores, ~s.

Clearly a lower grade gives even less evidence of readiness;

that is, P(H’(I)| ~s) > P(H’(I)|s). Therefore, whether Isaac scored as

high as s or lower, ~s, the epistemic probabilist is justified in having

high belief that Isaac is not ready.

The probability of finding evidence of Isaac’s readiness even if in fact

he is ready (H(I) is true) is low if not zero.

Page 24: April 3 2014 slides mayo

24

Bayesian B-boosters might interpret things differently, noting that

since the posterior for readiness has increased, the test scores provide

at least some evidence for H(I)—but then the invocation of the

example to demonstrate a conflict between a frequentist and Bayesian

assessment would seem to diminish or evaporate.

Reverse Discrimination? To push the problem further, suppose that

the epistemic probabilist receives a report that Isaac was in fact

selected randomly, not from Fewready Town, but from a population

where college readiness is common, Fewdeficient Town.

The same scores s now warrant the assignment of a strong objective

epistemic belief in Isaac’s readiness (i.e., H(I)).

A high-school student from Fewready Town would need to have

scored quite a bit higher on these same tests than a student selected

from Fewdeficient Town for his scores to be considered evidence of

his readiness.

Page 25: April 3 2014 slides mayo

25

When we move from hypotheses like “Isaac is college ready” to

scientific generalizations, the difficulties become even more serious.

We need not preclude that H(I) has a legitimate frequentist prior; the

frequentist probability that Isaac is college ready might refer to

generic and environmental factors that determine the chance of his

deficiency—although I do not have a clue how one might compute it.

The main thing is that this probability is not given by the probabilistic

instantiation above.

These examples, repeatedly used in criticisms, invariably shift the

meaning from one kind of experimental outcome—a randomly

selected student has the property “college ready”—to another—a

genetic and environmental “experiment” concerning Isaac in which

the outcomes are ready or not ready.

This also points out the flaw in trying to glean reasons for epistemic

belief with just any conception of “low frequency of error.”

Page 26: April 3 2014 slides mayo

26

If we declared each student from Fewready to be “unready,” we

would rarely be wrong, but in each case the “test” has failed to

discriminate the particular student’s readiness from his unreadiness.

Moreover, were we really interested in the probability of the event

that a student randomly selected from a town is college ready, and had

the requisite probability model (e.g., Bernouilli), then there would be

nothing to stop the frequentist error statistician from inferring the

conditional probability.

However, there seems to be nothing “Bayesian” in this relative

frequency calculation.

Bayesians scarcely have a monopoly on the use of conditional

probability! But even here it strikes me as a very odd way to talk

about evidence.

(Howson says it shows unsoundness because he identifies a p-value

with a posterior probability in a hypothesis)

Page 27: April 3 2014 slides mayo

27

A Common Variant on the Criticisms: p-values vs. posterior

probabilities):

Certain choices for prior probabilities in the null and

alternative hypothesis shows that a small p-value is consistent with a

much higher posterior probability in null hypothesis,

The alternative hypothesis would, in such cases, pass severely,

even though the null hypothesis has a high posterior (Bayesian)

probability.

Page 28: April 3 2014 slides mayo

28

A statistically significant difference from H0 can correspond to large

posteriors in H0

From the Bayesian perspective, it follows that p-values come up short

as a measure of inductive evidence,

the significance testers balk at the fact that the recommended priors

result in highly significant results being construed as no evidence

against the null — or even evidence for it!

Page 29: April 3 2014 slides mayo

29

The conflict often considers the two sided T(2 test

H0: = versus H1: ≠ .

(The difference between p-values and posteriors are far less marked

with one-sided tests).

“Assuming a prior of .5 to H0, with n = 50 one can classically ‘reject H0

at significance level p = .05,’ although P(H0|x) = .52 (which would

actually indicate that the evidence favors H0).”

This is taken as a criticism of p-values, only because, it is assumed the

.51 posterior is the appropriate measure of the belief worthiness.

As the sample size increases, the conflict becomes more

noteworthy.

Page 30: April 3 2014 slides mayo

30

If n = 1000, a result statistically significant at the .05 level leads

to a posterior to the null of .82!

SEV (H1) = .95 while the corresponding posterior has gone

from .5 to .82. What warrants such a prior?

n (sample size)

______________________________________________________

p t n=10 n=20 n=50 n=100 n=1000

.10 1.645 .47 .56 .65 .72 .89

.05 1.960 .37 .42 .52 .60 .82

.01 2.576 .14 .16 .22 .27 .53

.001 3.291 .024 .026 .034 .045 .124

Page 31: April 3 2014 slides mayo

31

(1) Some claim the prior of .5 is a warranted frequentist assignment:

H0 was randomly selected from an urn in which 50% are true

(*) Therefore P(H0) = p

H0 may be 0 change in extinction rates, 0 lead concentration, etc.

What should go in the urn of hypotheses?

For the frequentist: either H0 is true or false the probability in (*) is

fallacious and results from an unsound instantiation.

We are very interested in how false it might be, which is what we

can do by means of a severity assessment of.

Page 32: April 3 2014 slides mayo

32

(2) Subjective degree of belief assignments will not ensure the error

probability, and thus the severity, assessments we need.

(3) Some suggest an “impartial” or “uninformative” Bayesian prior gives

.5 to H0, the remaining .5 probability being spread out over the alternative

parameter space, Jeffreys.

This “spiked concentration of belief in the null” is at odds with the

prevailing view “we know all nulls are false”.

Page 33: April 3 2014 slides mayo

33

Upshot: However severely I might wish to say that a hypothesis H

has passed a test, the Bayesian critic assigns a sufficiently low prior

probability to H so as to yield a low posterior probability in H.

But this is no argument about why this counts in favor of, rather than

against, their Bayesian computation as an appropriate assessment of

the warrant to be accorded to hypothesis H.

To begin with, in order to use techniques for assigning frequentist

probabilities to events, their examples invariably involve

“hypotheses” that consist of asserting that a sample possesses a

characteristic, such as “having a disease” or “being college ready” or,

for that matter, “being true.”

This would not necessarily be problematic if it were not for the fact

that their criticism requires shifting the probability to the particular

sample selected

Page 34: April 3 2014 slides mayo

34

Bayesians sometimes tell us they will cure the significance tester’s

tendency to exaggerate the evidence against the null (in two-sided

testing) by using some variant on a spiked prior.

But the result of their “cure” is that outcomes may too readily be

taken as no evidence against, or even evidence for, the null

hypothesis, even if it is false.

We actually don’t think we need a cure.

Faced with conflicts between error probabilities and Bayesian

posterior probabilities, the error statistician may well conclude that

the flaw lies with the latter measure.