uneducated guesses: three examples of how mistreating missing data yields misguided educational...

87
Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners An Invited Talk Given to the Institute of Education Science in the Graduate School of Education of the University of Pennsylvania February 13, 2012

Upload: johana-sizer

Post on 14-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Uneducated Guesses: Three examples of how mistreating

missing data yields misguided educational policy

Howard WainerNational Board of Medical Examiners

An Invited Talk Given to the Institute of Education Science in the Graduate School of Education

of the University of PennsylvaniaFebruary 13, 2012

Page 2: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners
Page 3: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

“In general we look for a new law by the following process. First we guess it. Then we compute the consequences of the guess

to see what would be implied if this law that we guessed is right. Then we compare the result of the computation to

nature, with experiment or experience, compare it directly with observation, to see if it works. If it disagrees with experiment it

is wrong. In that simple statement is the key to science.

It does not make any difference how beautiful your guess is. It does not make any difference how smart you are, who made

the guess, or what his name is - if it disagrees with experiment it is wrong. That is all there is to it.”

Richard P. Feynman (1964)

Page 4: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Outline

I. Introduction – Mistreating missing data can have a huge effectA. Lombard’s most dangerous professionB. Getting younger in Princeton’s cemeteryC. Wald’s model for armoring planes

II. Case 1. What happens if the SAT is made Optional: Bowdoin College as an example.

III. Case 2. Allowing choice on examsA. Some history – especially 1921 EnglishB. The mystery of 1968 AP ChemistryC. Women suffer in 1988 US HistoryD. The only unambiguous solution to missing dataE. Indiana Jones and a wonderful workaround 1. 1989 Chemistry as proof of concept.

IV. Case 3. Using student test scores to evaluate teachers: Value-Added ModelsA. VAM and missing scores - Gaming the system by using missing data imputations.B. VAM and Counterfactuals – How would Freddy have done if he hadn’t had Ms. Jones?

V. Conclusions

Page 5: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

I will illustrate my talk today with three principal examples:

1. A September 2008 report published by the National Association for College Admission Counseling in which one of the principal recommendations was for colleges and universities to reconsider requiring the SAT or the ACT for applicants.

2. Increasingly often ‘standardized’ exams provide a set of possible questions and allow the examinee to pick which ones to answer.

3. “Race to the Top” provides funds to states that amend their educational system in specific ways. But all must somehow use the change in student test scores to evaluate teachers.

Page 6: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

In all three of these, the issue of missing data looms large

The issue of missing data is too often assumed to be a small technical one that is not likely to

have any serious effect;

even by people who ought to know better.

How we understand and treat missing data can have an enormous effect on the conclusions

we draw.

Page 7: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

MD1. The most dangerous profession

Page 8: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

MD2. The 20th Century was a dangerous time

Page 9: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

MD3. Bullet holes and a model for missing data

From Abraham Wald

Page 10: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Example 1.

National Association for College Admission Counseling’s September 2008 report on admissions testing

On September 22, 2008, the New York Times carried the first of three articles about a report, commissioned by the National

Association for College Admission Counseling, that was critical of the current, widely used, college admissions

exams, the SAT and the ACT. The commission was chaired by William R. Fitzsimmons, the dean of admissions and

financial aid at Harvard.

The report was reasonably wide-ranging and drew many conclusions while offering alternatives.

Although well-meaning, many of the suggestions only make sense

if you say them very fast.

Page 11: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Among their conclusions were:

1. Schools should consider making their admissions “SAT optional,” that is allowing applicants to submit their SAT/ACT scores if they wish, but they should not be mandatory. The commission cites the success that pioneering schools with this policy have had in the past as proof of concept.

2. Schools should consider eliminating the SAT/ACT altogether and substituting instead achievement tests. They cite the unfair effect of coaching as the motivation for this – they were not naïve enough to suggest that because there was no coaching for achievement tests now that, if they became more high stakes coaching for them would not be offered, but rather that such coaching would be directly related to schooling and hence more beneficial to education that coaching that focuses on test-taking skills.

3. That the use of the PSAT with a rigid qualification cut-score for such scholarship programs as the Merit Scholarships be immediately halted.

Page 12: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Recommendation 1. Make SAT optional:

It is useful to examine those schools that have instituted “SAT Optional” policies and see if the admissions process been hampered in those schools.

The first reasonably competitive school to institute such a policy was Bowdoin College, in 1969.

Bowdoin is a small, highly competitive, liberal arts college in Brunswick, Maine.A shade under 400 students a year elect to matriculate at Bowdoin, and roughly

a quarter of them choose to not submit SAT scores.

In the following table is a summary of the classes at Bowdoin and five other institutions whose entering freshman class had approximately the same

average SAT score.

At the other five institutions the students who didn’t submit SAT scores used ACT scores instead.

Page 13: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Table 1 : Six Colleges/Universities with similar observed mean SAT scores for the entering class of 1999.

Page 14: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

To know how Bowdoin’s SAT policy is working we will need to know two things:

1. How did the students who didn’t submit SAT scores do at Bowdoin in comparison to those students that did submit them?

2. Would the non-submitters’ performance at Bowdoin have been better predicted by their SAT scores, had the admissions office had access to them?

Page 15: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

The first question is easily answered by looking at their first year grades at Bowdoin.

Page 16: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners
Page 17: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

But would their SAT scores have provided information missing from other submitted information?

This would depend on why these students chose to not submit their scores. Some possibilities are:

1. If I don’t need to submit them, why bother to take them?

2. I took them, and did really well, but so what?

3. I took them, but did worse than the typical student who was accepted by Bowdoin in the past. Submitting them wouldn’t help my cause.

Page 18: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Although we may have some opinions on the likelihood of each of these options, under

typical circumstances we have no data to help us decide, for these students did not submit

their SAT scores.

Page 19: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

However all of these students actually took the SAT, and through a special data-gathering

effort at the Educational Testing Service, we found that the students who didn’t submit

these scores behaved sensibly.

They realized that their lower-than-average scores would not help their cause at Bowdoin,

and hence chose not to submit them.

Here is the distribution of SAT scores for those who submitted them as well as those who did

not.

Page 20: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

As it turns out, the SAT scores for the students who did not submit them would have accurately predicted their lower performance at Bowdoin.

In fact, the correlation between grades and SAT scores was higher for those who didn’t submit them (0.9) than for those who did (0.8).

Page 21: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

So not having this information does not improve the academic performance of

Bowdoin’s entering class – on the contrary it diminishes it.

Why would a school opt for such a policy?Why is less information preferred to

more?

Page 22: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

There are surely many answers to this, but one is seen in an augmented version of the earlier table 1:

We see that if all of the students in Bowdoin’s entering class had their SAT scores included, the average SAT at

Bowdoin would shrink from 1323 to 1288, and instead of being second among these six schools

they would have been tied for next to last.

Page 23: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Since mean SAT scores are a key component in school rankings, a school can game those rankings by allowing their lowest scoring students to not be included in average.

I believe that Bowdoin’s adoption of this policy pre-dates US News & World Report’s rankings,

so that was unlikely to have been their motivation,

but I cannot say the same thing for schools that have chosen such a policy more recently.

Page 24: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Is inferring such a nefarious goal just the paranoid ravings of an aging cynic?

Or are colleges actively engaged in trying to game college rankings?

Page 25: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Some evidence:1. The January 31, 2012 NY Times reported that Richard C.

Vos, VP and dean of admissions of Claremont McKenna College has, for the past six years, been adding points to

the mean SAT scores that the school reported to USN&WR.

2. The February 1, 2012 NY Times reported that Iona College “has lied for years about test scores, graduation rates,

freshman retention, student-faculty ratio, acceptance rates and alumni giving.”

3. “Baylor University paid admitted students to retake the SATs in hopes of increasing scores.” This seems like an inefficient approach -- easier, cheaper and more sure to use Claremont’s approach and just

falsify them.

Page 26: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Case 2. Allowing choice on exams

If you allow choice, you will regret it; if you don't allow choice, you will regret it;

whether you allow choice or not, you will regret both.(Søren Kierkegaard, 1986, p. 24)

Page 27: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

It is common practice to allow choice on exams

Why? If a test is made up of multiple choice questions answering any one of them takes very little time and so there can be lots of them.

If we ask essay questions, or other kinds of big problems, it is impractical to ask more than a few of them, and so some students may be disadvantaged by the specific topic selected.

Page 28: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

So we offer a choice, “Answer 2 of the following 6”

Is this a a good idea?

Historically, such an approach was most common almost a century ago, but its popularity rapidly

declined.It is currently enjoying a resurgence.

Page 29: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Number of possible test forms generated by

examinee choice patterns in College Entrance Exams

Year Chemistry Physics English German

1905 54 81 64 1

1909 18 108 60 1

1913 8 144 7,260 1

1917 252 1,620 1,587,600

1

1921 252 216 2,960,100

1

1925 126 56 48 6

1929 20 56 90 1

1933 20 10 24 1

1937 15 2 1 1

1941 1 1 1 1

Page 30: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

How did they arrive at the unlikely number of test forms for the 1921

English exam?

Section I - Answer 1 of 3 questions; 3 forms.Section II - Answer 5 of 26 questions; 65,780

forms. Section III - 1 of 15; 15 forms.

3x65,780x15 = 2,960,100

Voila!

Page 31: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Are choice items of equal difficulty?

Average Scores on AP Chemistry 1968

While their scores on the common multiple-choice (MC) section were about the same (11.7 vs. 11.2 out of a possible 25), their scores on the choice problem were very different (8.2 vs. 2.7 on a 10-point scale).

Page 32: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

There are several possible conclusions to be drawn from this; four among them are:

1. Problem 5 is a good deal more difficult than problem 4.

2. Small differences in performance on the multiple-choice section translate into much larger differences on the free response questions.

3. The proficiency required to do the two problems is not strongly related to that required to do well on the multiple-choice section.

4. Item 5 is selected by those who are less likely to do well on it.

Page 33: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

1988 AP United States History Exam

Page 34: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

The only unambiguous data on choice and difficulty

Xiang-bo Wang and his colleagues repeatedly presented examinees with a

choice of two items, but then required them to answer both

Page 35: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

The proportion of students getting each item correct shown conditional

on which item they preferred to answer

Page 36: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

The conclusion drawn from many results like this is that:

As examinees’ ability increases they tend to choose more wisely – they know enough to be able to determine which choices are likely to be the least difficult.

As ability declines choice becomes closer and closer to random.

On average, lower ability students, when given choice are more likely to choose more difficult items than their competitors at the higher end of the proficiency scale.

Thus allowing choice will tend to exacerbate group differences.

Page 37: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

How can we allow choice?

• Adjust for differential difficulty after administering items to random samples of examinees - equate(but that makes the examinee’s job more difficult).

And, if we are successful, it renders choice unnecessary.

OR

Page 38: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

What if we make the choice part of the test?

But choose wisely, for while the true Grail will bring you life, the false Grail will take

it from you.

– Grail Knight in Indiana Jones and the Last Crusade, 1989

Page 39: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

The alternative to trying to make all examinee-selected choices within a choice question of

equal difficulty is to consider the entire set of questions with choices as a single item.

Thus the choice is part of the item.

If you make a poor choice and select an especially difficult option to respond to, that is considered in exactly the same way as if you

wrote a poor answer.

Page 40: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Under what circumstances is this a plausible and fair approach?

1. We must believe that choosing wisely uses the same knowledge and skills that are required for answering the question.

2. That the choice is being made by the examinees and not by their teachers.

Page 41: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

If we agree to adopt this strategy a remarkable result

ensues!Let us consider data from Section D of the

1989 Advanced Placement Examination in Chemistry.

Section D has five problems (Problems 1, 2, 3, 4 and 5)

of which the examinee must answer just three.

ETS calculates the reliability of Section D as 0.60.

Page 42: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Scores of examinees as a function of the problems they chose

Page 43: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Suppose we think of Section D as a single ‘item’ with an examinee falling into one of ten possible categories, and the estimated score of each examinee is the

mean score of everyone in their category.

How reliable is this one item test?

Page 44: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

After doing the appropriate calculation we discover that the reliability of this

‘choice item’ is .15.

While .15 is less than .60, it is also larger than zero, and it is easier to obtain.

We don’t have to score the examinees’ answers, we just note which problems

they chose.

In fact, they don’t even have to answer them -- just indicate which three they would answer, if they were forced to.

Page 45: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Of course with a reliability of only .15, this is not much of a test.

But suppose we had two such items, each with a reliability of .15? This new test would have a

reliability of .26.

And, to get to the end of the story, if we had eight such ‘items’ it would have a reliability of .60,

the same as the current form.

Such a test would be easier on examinees and much cheaper for the testing company.

A win-win.

Page 46: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

This is what I like best about science, with only a small investment in fact, we can

garner such huge dividends in conjecture.

Page 47: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Case 3. Using student test scores to evaluate teachers

“Some professors are justly renowned for their bravura performances as Grand Expositor on the podium, Agent

Provocateur in the preceptorial, or Kindly Old Mentor in the corridors. These familiar roles in the standard faculty

repertoire, however, should not be mistaken for teaching, except as they are validated by the transformation of the

minds and persons of the intended audience.”

“Good teachers evaluate themselves with a pitiless gaze and measure their successes not by their virtuosity as performers but by their contribution to the transformation of students.”

(Marvin Bressler, 1991)

Page 48: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Value Added Models (VAMs)

yi1 = m1 + q1+ ei1 (1)

yi2 = m2 + q1 +q2+ ei2 (2)

Hence the change, the value-added, is simply the difference between the scores from

year 1 to year 2, or

yi2 -yi1 =( m2 -m1) +q2+ (ei2 - ei1) (3)

Page 49: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners
Page 50: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

“The child in me was delighted.The adult was skeptical.”

Saul Bellow, 1977

“I was impressed, not because it did it well, but that it could do it at all.”

Samuel Johnson after watching a dog walk on its hind legs

Page 51: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

There are many challenges to be overcome before such models are ready for widespread use.

Principal among them are:

(i) psychometric issues in both the construction and scoring of tests that allow comparisons over large ranges and across different subjects;

(ii) statistical issues dealing with stability of estimates and biases introduced by missing data;

(iii) epistemological issues associated with drawing causal conclusions without the need for heroic assumptions.

Page 52: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Today I will focus on only two missing data issues:

(i) When, in the ordinary course of school examinations, a student’s score is missing for either the pre-test, the post-test, or both.

(ii) The counterfactual data that are always missing; how the student would have performed had she had a different teacher.

Page 53: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

There are two approaches to the first missing data problem currently being used with VAMs.

The first is to only use those students with complete data and then to assume that the estimates of the teacher effects thus computed are OK (e.g. assume missing data are missing-at-random). If Abraham Wald had used this assumption he would’ve arrived at exactly the opposite conclusion -- add more armor where there were holes.

Page 54: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

The more common method is to impute a score for those missing one based on the mean of those who have them (conditioned on some

covariates).

This does not change the marginal means (but higher moments are wrong), and it can be

successfully gamed.

Field trips!

Page 55: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

A sub-problem of missing scores is when students choose not to answer some items.

This happens frequently when a student’s performance on a test has no direct impact on the student (e.g. NAEP

or new teacher evaluation exams like those just adopted in NY).

If the test has no immediate impact on students (and for HS seniors, even if it does)

they tend not to try very hard.

Page 56: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

For evidence look at non-response rates in NAEP: Non-response increases with student age (younger students try

harder).Non-response varies with item type (multiple choice items are

answered much more frequently than constructed response/essay type questions).

Non-response varies with ethnicity (Asian and Jewish students are less likely to omit).

Non-response varies with location (students in South Dakota answer more often than those in California and Hawaii).

Non-response rates can run as high as 70% for 11th grade essay items in Hawaii.

Page 57: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Imagine the scenario -- students are told that they may leave when they are done, and that the test doesn't count toward their grade.

Then they are asked to write an essay or two.

You can see that if the surf’s up theyare unlikely to hang around long.

Page 58: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

This issue came up in California some years ago in the "Cash for CAP” program.

One HS senior class asked the principal for a share of themoney that would come to the school, if they did well, to subsidize their prom. (I believe in Ojai).

The principal rejected this and indicated that the money had been slated for improved computing resources.

Most of the senior class handed in blank essays coupled with randomly selected options to the multiple choice items, and left early.

Page 59: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

And finally, the biggest challenge, causal inference.

VAM is not interested in descriptive statements like:

“Freddy gained 10 points when he was in Ms. Smith’s class.”

No, the goal is to make causal statements like:

“Freddy gained 10 points because he was in Ms. Smith’s class.”

Page 60: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

To understand the challenge of causal inference we first need some epistemology:

“Counterfactual conditional” is a term that refers to any expression of the general form:

“If A were the case, then B would be the case.”

This is the conditional part.

The counterfactual part is that A must be false or untrue in the world.

Page 61: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Some examples:

1. “If kangaroos had no tails, they would topple over.”

2. “If an hour ago I had taken two aspirins instead of just a glass of water, my headache would now be gone.”

Page 62: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

And, perhaps the most obnoxious counterfactuals are those of the form:

3. “If I were you, I would. . . .”

Page 63: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Hume’s famous discussion of causation,

“we may define a cause to be an object followed by another, and where all the objects, similar to the first, are followed by objects similar to the second,

and,

where, if the first object had not been, the second would never have existed.”

Page 64: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Let us return to student testing.

Suppose that we find that a student’s test performance changes from a score of X to a score of Y after some educational intervention.

We might then be tempted to attribute the pretest-posttest change, Y – X to the intervening educational experience—i.e., to use the gain score as a measure of the improvement due to the intervention.

This is the essence of VAM.

Page 65: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

There are many other possible explanations of the gain, Y – X.

Some of the more obvious are:

i. simple maturation (e.g. Freddy grew 5 inches when he was in Ms. Smith’s class)

ii. other educational experiences occurring during the relevant time period, and

iii. differences in either the tests or the testing conditions at pre- and post-tests.

Page 66: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

From Hume we see that what is important is what the value of Y would have been if the student not had the

educational experiences that the intervention entailed.

Call this score value, Y*.

Thus enter counterfactuals.

Page 67: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Y* is not directly observed for the student, i.e., she did have the educational intervention of interest,

so asking for what her post-test score would have been had she not had it is asking for information

collected under conditions that are contrary to fact.

Hence, it is not the difference Y – X that is of causal interest,

but the difference Y – Y*,

and the gain score has a causal significance only if X can serve as a substitute for the counterfactual Y*.

Page 68: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Conclusions1. Missing data is an unavoidable complication.2. Ignoring them (assuming missing-at-random)

doesn’t often lead to a happy outcome.3. The best solution, if possible, involves a

special data gathering effort (e.g. Bowdoin’s SAT scores or examinees’ performance on the items they did not choose to answer). This may not be practical on a large scale -- so we must pay careful attention to those complete data sets when we have them.

Page 69: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Conclusions(2)

4. When gathering the missing data is not possible (e.g. counterfactual performance of students had they had a different teacher) we must use all the tools at our disposal (randomization if we can, the various techniques of good observational studies when we can’t).

5. And ALWAYS remember that we must be modest in our claims when the uncertainty induced by what we did not observe is of the same order of magnitude as the phenomena suggested by what we did observe.

Page 70: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

We must remember the wisdom of Sir Josiah Charles Stamp (1880-1941)

“The government [is] extremely fond of amassing great quantities of statistics. These are raised to the nth degree, the cube roots are extracted, and the results are arranged into elaborate and impressive displays. What must be kept ever in mind, however, is that in every case, the figures are first put down by a village watchman, and he puts down anything he damn well pleases.”

Page 71: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

"For a successful technology, reality must take precedence over public relations,

for nature cannot be fooled."Richard P. Feynman

Page 72: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners
Page 73: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

In their search for the Holy Grail, both Walter Donovan and Indiana Jones arrived at the Canyon of the Crescent Moon with great

anticipation.

But after all of the other challenges had been met, the last test involved choice.

The unfortunate Mr. Donovan chose first, and in the words of the Grail Knight,

“He chose poorly”

The consequences were severe.

Page 74: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Recommendation 2. Using Achievement Tests Instead

Driving the Commission’s recommendations was the notion that the differential availability of commercial coaching made admissions testing unfair.

They recognized that the 100 point gain (on the 1200 point SAT scale) coaching schools often tout as a typical outcome was hype and agreed with the estimates from more neutral sources of about 20 points was more likely.

But, they deemed even 20 points too many.

The Commission pointed out that there was no wide-spread coaching for achievement tests, but agreed that should the admissions option shift to achievement tests the coaching would likely follow.

This would be no fairer to those applicants who could not afford extra coaching, but at least the coaching would be of material more germane to the subject matter and less related to test-taking strategies.

Page 75: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

One can argue with the logic of this – that a test that is less subject oriented and related more to the estimation of a general aptitude might have greater generality.

And that a test that is less related to specific subject matter might be fairer to those students whose schools have more limited resources for teaching a broad range of courses.

I find these arguments persuasive, but I have no data at hand to support them.

So instead I will take a different, albeit more technical, tack – the psychometric reality associated with replacing general aptitude tests with achievement tests means that making the kinds of comparisons that schools need among different candidates impossible.

Page 76: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

When all students take the same tests we can compare their scores on the same basis.

The SAT and ACT were constructed specifically to be suitable for a wide range of curricula.

SAT–Math is based on mathematics no more advanced than 8th grade.

Contrast this with what would be the case with achievement tests.

There would need to be a range of tests and students would chose a subset of them that best displayed both the coursework they have had

and the areas they felt they were best in.

Some might take chemistry, others physics; some French, others music.

The current system has students typically taking three achievement tests (SAT-II).

How can such very different tests be scored so that the outcome on different tests can be compared?

Page 77: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Do you know more French than I know physics?

Was Mozart a better composer than Einstein was a physicist?

How can admissions officers make sensible decisions through

incomparable scores?

Page 78: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

How are SAT-II exams scored currently?Or more specifically, how they had been scored for decades when I left the employ

of ETS nine years ago – I don’t know if they have changed anything in the interim.

They were all scored on the familiar 200-800 scales, but similar scores on two different tests are only vaguely comparable.

How could they be comparable?

What is currently done is that tests in mathematics and science are roughly equated using the SAT-Math, the aptitude test that

everyone takes, as an equating link.

In the same way tests in the humanities and social sciences are equated using the SAT-Verbal.

This is not a great solution, but is the best that can be done in a very difficult situation.

Comparing history with physics is not worth doing for even moderately close comparisons.

Page 79: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

One obvious approach would be to norm reference each test, so that someone who scores average for all those who take a

particular test gets a 500 and someone a standard deviation higher gets a 600, etc.

This would work if the people who take each test were, in some sense, of equal ability.

But that is not only unlikely, it is empirically false.

The average student taking the French achievement test could starve to death on the Boulevard Raspail, whereas the average person who takes the Hebrew achievement test, if dropped onto the streets of Tel Aviv in the middle of the night would do fine.

Happily the latter students also do much better on the SAT-VERBAL test and so the equating helps.

This is not true for the Spanish test, where a substantial portion of those taking it come from Spanish speaking homes.

Page 80: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Substituting achievement tests is not a feasible option unless

admissions officers are prepared to have subject matter quotas.

Too inflexible for the modern world I reckon.

Page 81: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Recommendation 3. Halt the use of a cut-score on the PSAT to qualify for Merit

ScholarshipsOne of the principal goals of the Merit Scholarship program is to

distribute a limited amount of money to highly deserving students without regard to their sex, ethnicity, or geographic

location.

This is done by first using a very cheap and wide ranging screening test.

The PSAT is dirt-cheap and is taken by about 1.5 million students annually.

The Commission objected to a rigid cut-off on the screening test.

They believed that if the cut-off was, say, at a score of 50, we could not say that someone who scored 49 was different

enough to warrant excluding them from further consideration.

They suggested replacing the PSAT with a more thorough and accurate set of measures for initial screening.

Page 82: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

The problem with a hard and fast cut score is one that has plagued testing for more than a

century.

The Indian Civil Service system, on which the American Civil Service system is based, found

a clever way around it.

The passing mark to qualify for a civil service position was 20.

But if you received a 19 you were given one ‘honor point’ and qualified.

If you scored 18 you were given two honor points, and again qualified.

If you scored 17, you were given three honor points, and you qualified.

But if you scored 16 you did not qualify, for you were four points away.

Page 83: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

I don’t know exactly what the logic was behind this system,

but I might guess that experience had shown that anyone scoring below 17 was sufficiently unlikely to be

successful in obtaining a position,

that it was foolish to include them in the competition.

But having a sharp break at 16 might have been thought too abrupt and so the method of honor points was

concocted.

Page 84: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

How does this compare with the Merit Scholarship program?

The initial screening selects 15,000 (top 1%) from the original pool.

These 15,000 are then screened much more carefully using both the SAT and ancillary information to select down to the 1,500

winners (the top 10% of the 15,000 semi-finalists).

Page 85: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

Once this process is viewed as a whole several things become obvious:

1. Since the winners are in the top 0.1% of the population it is dead certain these are all likely to be enormously talented individuals.

2. There will surely be many worthy individuals that were missed, but that is inevitable if there is only money for 1,500 winners.

3. Expanding the initial semifinal pool by even a few points will expand the pool of semi-finalists enormously (the normal curve grows exponentially), and those given the equivalent of some PSAT “honor points” are extraordinarily unlikely to win anyway, given the strength of the competition.

Page 86: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

What about making the screening a more rigorous process – rather than just using the PSAT scores?

Such a screening must be more expensive, and to employ it as widely would, I suspect, use up much more of the

available resources leaving little or nothing for the actual scholarships.

The irony is that utilizing a system like that proposed by the Commission would either have to be much more limited in its initial reach, or it would have to content

itself with giving out many fewer scholarships.

Of course, one could argue that more money should be raised to do a better job in initial screening.

I would argue that if more money was available the same method of allocating should be continued and used to

either give out more scholarships or bigger ones.

Page 87: Uneducated Guesses: Three examples of how mistreating missing data yields misguided educational policy Howard Wainer National Board of Medical Examiners

This completes more of the reasoning behind my initial conclusion that some of the

recommendations of the Commission only made sense if

you said them fast.

I tried to slow things down a bit.