rr-00-15 research r grades and test scores: e … · accounting for observed differences ... why do...

241
RR-00-15 GRADES AND TEST SCORES: ACCOUNTING FOR OBSERVED DIFFERENCES Warren W. Willingham Judith M. Pollack Charles Lewis September 2000 R E S E A R C H R E P O R T Princeton, New Jersey 08541

Upload: vocong

Post on 02-May-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

RR-00-15

GRADES AND TEST SCORES:ACCOUNTING FOR OBSERVED DIFFERENCES

Warren W. WillinghamJudith M. Pollack

Charles Lewis

September 2000

RESEARCH R

EPORT

Princeton, New Jersey 08541

Page 2: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Grades and Test Scores:Accounting for Observed Differences

Warren W. Willingham , Judith M. Pollack, and Charles Lewis

September 2000

Page 3: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Research Reports provide preliminary and limiteddissemination of ETS research prior to publication. They areavailable without charge from the

Research Publications OfficeMail Stop 07-REducational Testing ServicePrinceton, NJ 08541

Page 4: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Abstract

Why do grades and test scores often differ? A framework of possible differences was

proposed. An approximation of the framework was tested with data on 8454 high school

students. Individual and group differences in grade versus test performance were substantially

reduced by focusing the two measures on similar academic subjects, correcting for grading

variations and unreliability, and adding teacher ratings and other information about students.

Concurrent prediction of high school average was thus increased from .62 to .90; differential

prediction was reduced to .02 letter-grades. Grading variation was a major source of discrepancy

between grades and test scores. The analysis suggested Scholastic Engagement as a promising

organizing principle in understanding student achievement. It was defined by three types of

observable behavior: employing school skills, demonstrating initiative, and avoiding competing

activities. Groups differed in average achievement, but group performance was generally similar

on grades and tests. If artifactual differences between the two measures are not corrected,

common statistical estimates of test validity and fairness are unduly conservative. Different

characteristics give grades and test scores complementary strengths in high-stakes assessment.

(Key words: validity, school achievement, scholastic engagement, group differences, grading,

differential prediction)

Page 5: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Contents

Introduction 1

Differences in Grades and Test Scores 7A framework 7A five-factor approximation 16

Previous Research 23Grading practices 23Student characteristics 31Teacher ratings 43Implications for this study 46

Study Design 49The sample 50Tests and grade averages 52Other variables in the analysis 56

Statistical analysis 61Adjusting for grading variations 61Estimating reliability 68

Results of the Analyses 71The effects of factors 1 to 5 71Differential prediction 90A condensed analysis of major factors 94Gender, ethnicity, and school program 98

On Four Noteworthy Findings 105Accounting for grade-test score differences 107The problematic variation in school grading 111Scholastic engagement as an organizing principle 115Group performance: Similar dynamics, different levels 123

The Merits of Grades and Tests 131Validity and fairness 131Differential strengths 136

Summary 149

References 159Author Note 178Figures and TablesAppendices: A. Descriptive statistics: Tables A-1 to A-8

B. Student variables: Acronyms and specifications C. Notes

Page 6: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

1

Introduction

Many of the most important educational decisions we make about young people

concern those summative, often irreversible, judgments regarding student entry or exit from

programs or institutions. Who will be placed in a slow or fast track in grade school, or earn a

high school diploma, or be accepted in a selective college, or advance to the upper division, or

flunk out, or be admitted to a demanding graduate or professional program? Grade averages

and test scores are the two types of evidence most commonly used in supporting these high-

stakes judgments.

Tests are routinely evaluated for such educational purposes, grades less systematically.

When a cumulative grade record is used in reaching important educational decisions, it

becomes, in effect, a predictor or criterion. In this capacity grades take on an assessment

function both broader and different from the teachers’ original evaluations of their students’

acquired proficiency in a particular subject in a given class. In serving the broader function,

grade averages have virtues as well as limitations. Understanding such characteristics of

grades is important to the valid use of test scores as well as grade averages because, in

practice, the two measures are often intimately connected.

The use of grades and tests are interdependent in a number of ways. Teachers use

classroom tests in assigning grades. Administrators use standardized tests in monitoring

grading standards and in evaluating grade differences between students and among groups of

students. On the other hand, we use grades to validate tests, and we use grades to judge the

fairness of tests. In many situations test sponsors urge the use of the grade record and the test

score together in order to enhance the validity and fairness of important educational decisions.

Page 7: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

2

Despite these obvious interdependencies, there are odd contradictions in the ways we

view tests and grades. Note the paradox: In some educational contexts we use tests to keep

grade scales honest or because we do not fully understand or trust grades to be an accurate

indicator of educational outcomes. But we also reverse those sentiments and use grades to

demonstrate the usefulness of tests and to justify their use. One likely source of the

contradiction is the tendency, for educators and measurement specialists alike, to assume that

a grade average and a test score are, in some sense, mutual surrogates; i.e., measuring much

the same thing, even in the face of obvious differences.1

One manifestation of that implicit assumption is the common inclination to treat an

improvement in grade prediction as the dominant, if not the sole, basis for validating a high-

stakes admissions test and justifying its use. For example, it may be hard to sell the

substitution of a new test predictor that has important educational advantages unless it is

clearly equal to or preferably stronger than a current measure in predicting GPA (i.e.,

normally, the surrogate of primary interest). Similarly, debate over the added value of an

admissions test often focuses only on its incremental predictive validity over the already

available prior grade record, overlooking other important educational considerations that may

hinge on intrinsic differences between the grade record and test scores.

A more telling instance of the implicit assumption that grade criteria and test scores

are mutual surrogates lies in the formal professional definition of test bias. With few

qualifications, a test is considered biased for a group of examinees if it predicts a mean

criterion score any different from the actual criterion mean (American Educational Research

Association, American Psychological Association, National Council on Measurement in

Education, 1985; American Psychological Association, American Educational Research

Page 8: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

3

Association, National Council on Measurement in Education, 1999). This interpretation

leaves little room for differential prediction due to the test predictor and the grade criterion

incorporating somewhat different construct relevant components or due to technical artifacts

such as unreliable predictors (Linn & Werts, 1971). In their defense, measurement specialists

may read "predictive bias" simply as “different result.” The more common interpretation is

"something wrong with the test,” regardless of why the results are different.

Presumably, both grades and test scores have characteristic strengths. It would be

useful to have a better understanding of those strengths, when one measure might be superior

to the other, and in what ways their joint use might be advantageous. Insuring the validity and

fairness of one requires an appreciation of issues concerning the validity and fairness of the

other. Despite the widespread use of both grades and tests as indicators of educational

achievement, we have quite different habits and expectations regarding the standards to which

we hold these two measures. National agencies and special commissions give careful

attention to the technical quality and proper use of tests in high-stakes decisions, but seldom

grades (Gifford & O’Connor, 1992; Heubert & Hauser, 1999; Office of Civil Rights, 1999;

Wigdor & Garner, 1982).

A substantial body of professionals, with varied interests and agendas, devote most of

their time and attention to tests. We study what tests measure, how to evaluate and improve

their quality, and how to insure the validity and fairness of test scores. We have extensive

debates, scholarly literature, and textbooks on test theory, standards, and practice. All of this

is as it should be. Testing is a public enterprise. To be sure, researchers have carried out

useful studies of grades and grading, but nothing to compare with the systematic attention

devoted to tests. This is not to say that tests are fine and grades are a mess.

Page 9: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

4

Testing has a long history of public controversy (Cronbach, 1975; Linn, 1982b).

Many critics—both within and without the profession—have discussed the technical

shortcomings and the social concerns that testing engages (Crouse & Trusheim, 1988;

Frederiksen, 1984; Jencks, 1998; Lemann, 1999; Madaus, 1994; Shepard, 1992b). These

issues notwithstanding, objective measures of school achievement have obvious benefit—

especially for high-stakes selection (Beatty, Greenwood, & Linn, 1999, pp. 20-22) and as

policy instruments to foster educational accountability (Heubert & Hauser, 1999, pp. 33-40).

Heubert and Hauser (p.1) described the current interest in testing to promote accountability.

The use of large-scale achievement tests as instruments of educational policy is

growing. In particular, states and school districts are using such tests in making high-

stakes decisions with important consequences for individual students. Three such

high-stakes decisions involve tracking (assigning students to schools, programs, or

classes based on their achievement levels), whether a student will be promoted to the

next grade, and whether a student will receive a high school diploma. These policies

receive widespread public support and are increasingly seen as a means of raising

academic standards, holding educators and students accountable for meeting those

standards, and boosting public confidence in the schools.

This is not a new development. Linn (2000) described the use of tests as key elements

in five waves of educational reform during the past fifty years. These included tracking and

selection in the 1950s, program accountability in the 1960s, minimum competency programs

in the 1970s, school and district accountability in the 1980s, and standards-based

accountability in the 1990s. The most recent reform effort is accompanied by a strong

emphasis on improving teaching and learning through improved assessment.

Page 10: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

5

One goal in assessment reform is to establish more direct linkages between

instructional objectives and the content and process of testing (Frederiksen, J. & Collins,

1989; Frederiksen, N., Glaser, Lesgold, & Shafto, 1990; and Resnick & Resnick, 1992).

Another goal is to focus assessment on established educational standards—in both the

educational system and the individual classroom (Baker & Linn, 1997; Shepard, 2000). A

third goal is to broaden the range of assessment formats and the skills thereby engaged

(Bennett & Ward, 1993; Wiggins, 1989).

Assessment reform seeks measures that will better inform teaching and learning and

provide more useful feedback regarding the outcomes. Current initiatives imply

dissatisfaction with both grading and testing—and the need, one might say, to better realize

the strengths of each. To that end, this study endeavors to enhance our understanding of some

of the main ways in which grades and tests differ.

Our premise was that it should be possible to account for much of the differences

observed between grades and test scores. This study had several purposes: to suggest a

framework that might help to explain major differences between grades and scores, to

evaluate an approximation of the framework in a national database, to test its generality

among different groups of students, and to examine possible implications of the findings as to

the respective merits of grades and scores in high-stakes decisions.

Page 11: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

6

Page 12: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

7

Differences in Grades and Test Scores

Explaining why students often perform somewhat differently on classroom grades and

standardized tests calls for some form of framework. Developing a framework poses two

challenges. One task is to describe how such a framework might look in theory. Another,

somewhat different, task is to devise an approximation that can be evaluated with real data.

We address those tasks in turn.

A Framework

In considering a framework to compare two generic measures like grades and test

scores, it is useful to consider two basic aspects of educational measurement. First, any

measure that is used in making high-stakes decisions about individual students serves two

overlapping but distinguishable functions: selective and educative. Either of these functions

may be the primary purpose of the measure. A school-leaving test may be one basis for

deciding who from graduates high school, but its primary purpose may be to further certain

educational objectives. A college admissions test may influence high school instruction, but

its primary purpose is normally to facilitate selection.

Since our presenting question is why individual students or groups of students often

perform somewhat differently on grades and test scores, we are concerned with factors that

bear on the selective function. That is, what factors cause the selective function of grades and

test scores to identify somewhat different high or low scorers. Such factors may affect the

educational quality of grades or tests as well, but it is important to bear in mind that additional

features of grades and tests, not considered here, will also affect the quality of the measures

and the effects of their use.

Page 13: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

8

Both a grade record and a test may have good or poor educative qualities, independent

of whether they rank students similarly or differently. Other features may primarily

determine what knowledge and skills are being acquired and how grading and testing affect

teaching and learning. This study focuses on why students often rank differently on the two

measures. Thus, the analysis is based on patterns of individual and group differences in

assessment outcomes rather than content differences in the measures themselves.

The corollary measurement consideration is how the specific differences between

grades and tests that result in different ranking of students may also influence the validity, and

necessarily the fairness, of each measure. Two critical aspects of the validity of a high-stakes

measure can be usefully contrasted as content relevance and fidelity—does the measure assess

desirable aspects of achievement and does it do so accurately? These two qualities play an

important role in evaluating results in any framework proposed, because content relevance

and fidelity provide the basis for drawing validity and fairness implications from specific

sources of grade-test score differences.2

The relevance of a measure refers to how well it represents the domain of pertinent

knowledge and skills, and does not include knowledge and skills that are irrelevant to the

measurement objective (Messick, 1989; 1995). Relevance determines not only the short-term

usefulness of a measure for high-stakes decisions, but also its long-term importance in

developing human resources and its antecedent effects on the priorities of teachers and

learners who know that the test is coming. A measure's fidelity includes its reliability from

one testing to another, its comparability from one situation to another, and its security from

cheating and compromise--all of these being socially demanded aspects of accuracy and

dependability in the actual use of a high-stakes measure. With these measurement

Page 14: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

9

distinctions in mind, we turn now to the problem of identifying those factors that are most

likely to cause the selective function of grades and test scores to operate differently.

For many years researchers have sought to understand what factors are associated with

favorable educational outcomes. Research on this question readily lends itself to causal

modes of thinking; viz., what family circumstances and values promote achievement in

school, what student characteristics, habits, and attitudes lead to good grades? It is common

to include test scores in a longitudinal analysis of student development or prediction of future

grades, because the purpose is to "explain," in a pragmatic statistical sense, what accounts for

the educational gain or the achievement above or below expectation.

But this reasonable concern and line of inquiry does not mean that causal logic is the

most useful way to view the relationship between test scores and grades. Test scores do not

cause individual differences in grades, nor vice versa. If a test and a grade are intended to

represent much the same achievement, then individual differences in both measures

presumably result from much the same learning processes, influenced by much the same

environmental and genetic factors, and channeled by much the same cognitive differences and

personal interests. From that perspective, attaining a better understanding of the wellsprings

of achievement will not necessarily help in understanding differences in performance on

grades and test scores.

For the purposes of this study, we pose a somewhat different question, "How does the

composition of grades and test scores differ and what are the implications of those

differences?" The two measures are presumably somewhat similar composites of skill and

knowledge that are generally relevant to the achievement construct of interest plus some other

sources of construct-irrelevant variance. From this view, grades and test scores are correlated

Page 15: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

10

only moderately because the elements of the two composites overlap only partly. It is

important to bear in mind that, for our purposes here, a different composition is important

only because the different components affect individuals and groups differently. That is,

components interact with other individual characteristics such as behavior and background.

Figure 1 suggests a framework of possible sources of difference between grades and

test scores. Categories A and C refer to the composition of grades and test scores—the former

to content differences between the two measures, the latter to different types of error in both.

Category B refers to related individual differences that play an important role in converting

the content differences of Category A into score differences. Category D refers to situational

differences that increase the possibility of content differences and divergent patterns of

individual and group differences.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Figure 1 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

Figure 1 is not intended to be comprehensive or to represent fully the goals or the

outcomes of education. The sources of difference overlap, and the framework focuses on

major areas of difference, not details. Furthermore, we are concerned here with possible

sources of the observed differences between the two measures as we normally encounter

them, not with noticeably improved grades or tests that we might sometimes encounter or

hope to develop. From a measurement perspective and on the basis of what is widely known

about grades and test scores, most of the sources here suggested are commonsensical. In

some cases, a substantial body of pertinent research literature can be consulted for clues as to

how the source of difference actually works. It is reasonable to assume, however, that these

Page 16: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

11

various types of grade-test score differences may work differently for different groups of

students or academic programs.

If it is expected that a grade and a test score should rank students similarly, perhaps

the most obvious implicit assumption is that the two measures encompass knowledge and

skills based on a generally similar academic domain as implied by Category A.1 in Figure 1.

Similar broad areas of competence might be sufficient to place students in much the same

order if the domain were broad, but a closer correspondence of specific subject matter and

skills would be required if the domain were narrow. For example, consider comparing

student performance on a grade average and a test battery, each based on a number of

academic subjects. For that particular comparison, having free-response word problems on

one and multiple-choice equation solving on the other might constitute a sufficiently similar

representation of mathematics. Those elements would not be sufficiently similar if one were

only comparing performance in mathematics.

Differences between internal and external tests must also be considered. The

construct-relevant knowledge and skills typically represented on external standardized tests

are surely not identical to those typically found in the local classroom tests on which teachers

base their grading. It is reasonable to expect more difference between grades and

performance on external tests than on local tests that more closely reflect the local syllabus as

well as the teacher’s particular view of the subject and how it should be assessed. Some

teachers may be performance oriented in their testing and grading and, for that reason, tend to

stress written or oral presentation. Skills involved in such assessment are not frequently

represented on external tests. To be sure, some teachers will lean to assessing knowledge and

problem solving skills more similar to those typically represented on an objective test.

Page 17: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

12

Nonetheless, assessment format is, like subject matter, a likely source of relevant or irrelevant

skills that differ somewhat between grades and test scores.

Category A.2 suggests an inherent distinction between a teacher’s grade and an

external test score that stems from different purposes of the two measures. To aid learning

and instruction, grades reflect specific knowledge and skills stressed in the particular classes

that a given student takes. To foster accountability, standardized tests provide outcome

assessments that are comparable across schools (Shepard, 2000). This distinction will result

in different performance on grades and test scores among students with different schooling

because students will therefore experience somewhat different curricula and learning

situations, and also because students respond to school differently. Normally, teachers assign

grades largely on the basis of quizzes and examinations on the lessons that they assign in

class. A student may know a good deal about the subject, but if poorly motivated to study the

material assigned, he or she is less likely to correctly answer the teacher's specific questions

about that particular material and will be graded accordingly.

An external test in a given subject area does not represent the specific knowledge and

skills that characterize a particular learning experience, but tends to represent content typical

of a generic course in the subject, wherever it is taught. Some students might make a

reasonably good score on such a generic course-based test due to having learned applicable

knowledge and skills outside of school or some years earlier. Other students may have a

mediocre command of the course but earn a good grade in the class by working hard on the

particular material and exercises presented by their teacher.

Different patterns of individual and group differences will depend, of course, on the

interaction of content differences and other sources of individual differences. Thus, the extent

Page 18: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

13

to which students do relatively better on classroom-based grades as compared to course-based

tests will tend to depend on their total learning experiences (Category B.1), their dedication to

schoolwork (B.2), and their teacher’s judgment as to how well they have performed in class

(B.3).

Category A.2.c in Figure 1 represents a related fundamental difference between grades

and test scores. To some degree, all students individualize their academic pursuits, but most

educators ascribe to common learning goals and standards within an educational jurisdiction.

As a matter of principle, non-traditional education places high value on the individual

assuming responsibility for his or her learning. From this perspective, personal development

for each student is a more important goal of education than is mastery of prescribed

knowledge and skills by all students (Keeton and Associates, 1976). In theory, grades or

some parallel form of individualized assessment can readily recognize special learning and

accomplishment. To the extent that education is individualized, a standardized test will tend

to yield results somewhat different from an individualized grade assessment (Whitaker, 1989).

Category A.3 refers to other elements that may be represented in the grade or the test

score but are not formally part of the knowledge and skills that define the subject domain. A

social objective of education such as enhancing good citizenship would be one example

(A.3.a), though it seems doubtful that either grades or tests are much influenced by such

outcomes. In the case of grades, students may receive credits or deductions for particular

behavior that is more directly connected with schoolwork. Examples include attendance,

class participation (or disruption), turning in homework assignments, other evidence of

dilligence and progress, contributions to the learning environment, and so on (A.3.b). These

elements may or may not influence subject mastery. Their construct relevance in grading

Page 19: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

14

stems from their pertinence to a broader definition of education that includes personal

development and conative aspects of learning like volition and effort.

When teachers assess learning outcomes for individual students or consider awarding

credits or making deductions from grades, the teacher’s judgment is an additional source of

variation (Category B.3). Positive halo or negative bias may play an unconscious role—

clearly a potential construct-irrelevant component of grading. Other construct irrelevant

components may be connected with the test format or the assessment process, whether it is an

external test or a classroom test or graded exercise (Category A.3.c). Examples include test

wiseness and anxiety in testing or performing situations. Such elements can influence grades

or scores either positively or negatively, and the effects are likely to vary considerably with

the specific situation.

In Category A of Figure 1, one could also include other types of knowledge and skill

that may be quite relevant to a specific syllabus but are not routinely considered academic

subject matter; for example, cooperative learning, physical development, or religious

teaching. In a particular program, such educational outcomes may well be represented in

either grades or tests.

Error in grades and scores. Two types of error cause discrepancies between observed

grades and test scores: systematic and unsystematic. Noncomparability is the important form

of systematic error (C.1.a). An extensive literature over several decades has documented

substantial differences in grading standards from school to school and college to college. The

corresponding problem with test scores (C.1.b) can apply if, for example, a practitioner

compares scores from different tests that have the same names, or uses percentile scores based

on the wrong norm group, or makes some consequential equating error (Hartocollis, 1999).

Page 20: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

15

Another serious, and evidently more prevalent, type of noncomparability is the

possibility of a change in test difficulty over time due to increased familiarity with a particular

test form following its repeated use (Cannell, 1988). Cheating (C.1.c) can be another form of

noncomparability—evidently a common student practice in school, but often notorious when

it involves a high-stakes test. When stakes go up, schools can also engage in questionable

practices whether it involves tests or grades (Saslow, 1989; Wilgoren, 2000)

Unreliability is, by definition, unsystematic measurement error (C.2). Like all

educational measures, both grades and test scores are subject to such error. While there are

many sources of measurement error (Thorndike, 1951), variation in the likelihood that a

student will know the answer to a particular question is probably the most important in the

context of this study. Unreliability is independent of any noncomparability among grades or

test scores, though both attenuate the observed relationship between grades and test scores.

Finally, if grades and tests are expected to reflect a similar level of performance, they

need to be based on concurrent learning in a similar context (Category D). Otherwise,

performance differences may be due to situational variation in the student’s motivation or

other differences associated with the particular learning experience.

A particular characteristic that distinguishes grades and test scores may look different

in different situations. For example, assume th*at one is considering whether the use of an

essay versus a multiple-choice test has any differential effect on grades and test scores. That

choice of assessment format might look like a difference in test-taking skills (A.3.c) in a

course such as physics where writing is often incidental to knowing the correct answer. On

the other hand, writing would be a construct-relevant cognitive skill in English (A.1.b), where

Page 21: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

16

compositional expertise is likely to be a critical element among the intended learning

outcomes.

In a given assessment such details may loom large. But for the purposes of this study,

the immediate question is how to delineate and quantify the major sources of discrepancy so

as to study their effects, preferably all at the same time. More specifically, how might the

most important factors in Figure 1 be represented with sufficient validity in real data available

in a large database.

A Five-Factor Approximation

Any proposal to study the effects of the various sources of grade-test score

discrepancy poses some nontrivial problems. Of the sources listed in Figure 1, some can only

be partly estimated, some can only be estimated indirectly, some are not included in the

database we proposed to use, and for some there are simply no data available. Fortunately,

the possible sources overlap and the ones that can be approximated are likely to be the more

important ones. Consequently, the analysis reported here is an approximation based upon the

following five factors:

• Factor 1. Subjects Covered

• Factor 2. Grading Variations

• Factor 3. Reliability

• Factor 4. Student Characteristics

• Factor 5. Teacher Ratings

These five factors are described below. It is first necessary to comment briefly on the

analytic model and the database. Principal aims of this study were to determine to what

extent the proposed factors can account for differences in the rank order of students on

Page 22: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

17

corresponding grades and test scores and to evaluate what role each factor plays in that

regard. This is why the study examines differences between grades and test scores by

analyzing patterns of individual and group differences rather than analyzing content or

structural differences between the two measures.

The five-factor approximation derives from the assumption that grades and test scores

have different constituent parts, which, in turn, have different effects on the observed

performance of individual students. The grades and test scores of students should correspond

more closely—that is, the measures should be more highly correlated—if one could alter the

components or statistically adjust the two composite measures so that they are similarly

constituted. Prediction is a useful analytic framework with which to initiate the inquiry

because it asks, simply, "What must one add to one measure in order to account for variation

in the other?"

Consequently, there is considerable emphasis throughout this report on the size of the

grade-test correlation, the magnitude of group differences, and other such statistical

characteristics of the two measures. It is important to remember the limited objectives of this

analysis. Any attempt to assess the overall validity of grades and tests would also stress the

character of the measures, their educational and psychometric qualities, and the effect of their

use in high-stakes decisions. Toward the end of this account, we return to that thought.

The five-factor approximation, the statistical procedure, and the aims of the study all

call for an unusual database. No available database can fully fill the bill, but the National

Education Longitudinal Study of 1988 (NELS) affords remarkably rich information about a

national sample of students who graduated from high school in 1992. NELS provides critical

information of several types: student background and personal characteristics, test scores in

Page 23: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

18

four basic academic areas, full course-grade transcripts throughout high school, teacher

ratings of individual students, plus other useful data from school records. NELS has some

shortcomings, which will be explained, but compared to any other database, it is particularly

well suited to examining differences between grades and test scores.

While NELS provides useful information that bears upon each of the five factors, the

data available do not map exactly on the factors postulated. In order to evaluate the effect of

the factors, the task is to approximate the role of each in a manner that is compatible with

multiple regression analysis. There are three means of doing so. The most obvious

approach—not always possible—is to index a factor; that is, represent it as a score for each

student that can be included as a variable in the analysis. Another method is to define or

correct the grade or test score so as to reduce apparent differences in their constitution.

Finally, some factors can be handled as statistical adjustments. Thus, representing each factor

in the analysis as accurately as possible involves somewhat different steps and assumptions.

The following paragraphs are intended to indicate only briefly the general manner in

which the five factors are here approximated. Specific procedures are described in more

detail in two subsequent sections concerning Study Design and Statistical Analysis. Factors

1, 2, and 3 are somewhat different from Factors 4 and 5. The former three concern

mismatched material and error components that make grades and test scores less comparable.

The latter two concern student behavior and other characteristics that were postulated have a

heavier bearing on grade performance than on test performance.

As previously observed, the mutual validity of grades and test scores as equivalent

measures can be compromised either by lapses in fidelity or in content relevance. Factors 2

and 3 bear upon the fidelity of the measures, because they represent some degree of inevitable

Page 24: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

19

error in both grades and test scores. Factors 1, 2, and 5 bear upon the relevance of the two

measures, because they represent potential content differences between grade and tests.

Factor 1. Subjects Covered. The NELS survey provides a complete transcript for

each student including traditional academic subjects, vocational subjects, and service courses

like driver training and physical education. The NELS tests cover a more restricted range of

academic subjects: reading, mathematics, science and social studies. A reasonably good

match is attained between an overall grade and a test composite for each student by defining

the measures as follows: a) restrict the grade average to the four “New Basics” subject areas

(Ingels et al, 1995) that correspond most closely to the four NELS tests—courses in English,

mathematics, science, and social studies, and b) weight the four test components so as to

optimally reproduce the students’ rank order on the grade average.

Factor 2. Grading Variations. In cross-sectional data grading standards can vary

across situations associated with different instructors, sections, courses, programs, schools,

and several possible interactions among those. The NELS transcript database permits analysis

of grading variations across schools and courses within schools, which likely account for the

more consequential errors due to differences in grading. Variations in school grading can be

corrected by carrying out regression analyses within schools and adjusting the pooled results

for restriction in range. In this manner school differences in either grades or test scores do not

come into play. The effect of course-grading differences can be indexed for each student

according to the leniency or strictness of grading (i.e., average grades in relation to average

test scores) in the courses that he or she took. The resulting index is then used to correct the

student’s grade average. Variation in test score scales is not likely to be significant in the

NELS data because all students took parallel and equated forms of the same test.

Page 25: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

20

Factor 3. Reliability. Measurement error in both grades and test scores is

independent of any systematic scale differences due to grading variations. This source of

error cannot be indexed by student but can be taken into account by traditional corrections for

attenuation. Test reliabilities are available from NELS. Grade reliabilities can be estimated

through appropriate analysis of course-grade records.

Factor 4. Student Characteristics. It is assumed that both grades and test scores

reflect, in varying degree, the learning that students acquire in their particular school

programs. There is no practical way to estimate the difference in specific knowledge and

skills that are represented in the NELS Test and the grade averages (A.2 in Figure 1).

Detailed comparison of the actual subject matter in the test and each student’s courses is not a

realistic alternative. It is possible, however, to index these differences indirectly by focusing

on student characteristics (B.2 in Figure 1) that help to predict the academic performance of

individual students. Adding this factor advances the goal of accounting for grades earned

because such characteristics are likely to be more highly related to grades than to test scores.

Research on why students often make grades higher or lower than expected on the

basis of test performance has a long history, which is briefly reviewed in the following

section. Student characteristics that appear to be most promising include family background,

personal attitudes, academic history, activities in school, and a variety of behaviors often

found to be related to school achievement. A number of such measures can be constructed

from the NELS database. The common objective is to include measures that may signify

higher or lower performance in school due to greater or lesser commitment to school. Some

of these self-reported characteristics are also related to grading pluses or minuses that students

often receive for certain behavior in school (A.3.b in Figure 1).

Page 26: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

21

Factor 5. Teacher Ratings. Teachers are a primary source of information about the

academic achievement of students as well as their behavior in school. Information concerning

such matters as class participation and completing homework were also obtained from student

questionnaires, but teachers observe the behavior directly and may be more objective than the

students themselves. More important, teachers decide what counts, and teachers assign the

grades. Teacher ratings may also provide some indication of performance skills more likely

reflected in grades than in standardized test scores. Lastly, teachers’ ratings of students may

reflect two other types of performance more likely to be represented in grades than in standard

tests of subject knowledge and skill: educational objectives in the individual classroom and

individualized learning outcomes (A.2.b and A.2.c, respectively in Figure 1).

Page 27: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

22

Page 28: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

23

Previous Research

In reviewing previous research, our main objective was to inform and improve the

design of the proposed study. The most critical design questions concerned the selection and

definition of useful measures and the best modes of analysis in order to examine the effects of

the five factors proposed. Two of the factors posed limited options. Matching the subjects

covered by test and grades (Factor 1) was simply a matter of using measures already available

from NELS. Correcting for unreliability (Factor 3) required application of mostly standard

procedures. On the other hand, Factors 2, 4, and 5 presented many options and uncertainties.

A substantial research literature proved quite helpful in addressing design issues in the three

pertinent areas: grading practices, student characteristics, and teacher ratings.

Grading Practices

Grading has a rich and sometimes quirky past. In an entertaining account of that

history, Cureton (1971) cites instances to illustrate that current grading problems are not all

that new. There was a time, for example, when enforcing standards called for corporal

punishment. Cureton cites the Stuttgart Schulordnung of 1505 as specifying that, for this

purpose, school children should be instructed to bring in fresh rods from the forest each week.

Similarly, grade inflation has manifested itself in a manner fitting to the times. In the

19th century, a Virginia academy graded its students with these clear categories: optimus,

melior, bonus, malus, pejor, and pessimus. Cureton (1971, p. 2) quotes the president, Henry

Ruffner, regarding the continual tendency of teachers to mark inferior students too high.

“While optimus ought to have been reserved for students of the highest merit, [it] was

commonly bestowed on all who rose above mediocrity.” To counter this problem, the

president modernized the grading system to three categories—disapproved, approved, and

Page 29: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

24

distinguished. Nevertheless, he reportedly mourned that “within two or three years, some bad

scholars were approved, and good scholars were nearly all distinguished” (p. 2).

Troublesome and often controversial issues swirl about grading because grades have

consequences, and diverse grading practices are often not regulated by any clear consensus

regarding principles and values. On some occasions, as in the student protest movement of

the 1960’s, grading was a lightening-rod social issue. But in every decade, basic assumptions

about grading are regularly debated: the purposes of grading, the basis on which grades are

assigned, the standards imposed, the effects on teaching and learning, and the social

consequences. Such issues have regularly spawned scholarly and popular articles and books

bemoaning what’s wrong with grading and offering advice on what to do about it (Hills,

1981; Kirschenbaum, Simon, & Napier, 1971; Loyd, B. & Loyd, D., 1997; Milton, Pollio, &

Eison, 1986; Terwilliger, 1989; Vickers, 2000; and Warren, 1971).

Meanwhile, a variety of technical and research topics are periodically reviewed:

evaluation methods, criterion vs. normative standards, marking systems, as well as favored

reforms like pass-fail or contract grading (Geisinger, 1982; Natriello, 1992; and Thorndike,

1969). Similarly, regional or national surveys tally the latest evidence as to what schools and

colleges are actually doing regarding such matters (College Board, 1998; National School

Public Relations Association, 1972; Robinson & Craver, 1989). It is useful to distinguish two

aspects of grading: components and standards.

Grading components. For the purposes of this study, “What counts?” in the eyes of

teachers is an especially germane aspect of grading practice. More specifically, what

additional components are represented in grades other than demonstrated mastery of

knowledge and skills that are pertinent to the objectives of the course? What extra factors

Page 30: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

25

might be expected to result in discrepancies between a grade and a test primarily concerned

only with subject knowledge and skills? The other factors that count and when they come

into play are likely influenced by the multiple purposes that grades are often intended to

serve.

Geisinger (1982) listed the diverse purposes of grading based on his review of

research and writing on the topic. His list agrees rather well with the opinions of college

faculty collected a decade later by Ekstrom and Villegas (1994). The following objectives

appear to be most prominent in educators’ minds: provide feedback useful to students in their

studies, help colleges make decisions about students and maintain standards, provide

information about students’ performance to other institutions and employers, motivate

students academically, and help students learn discipline for later work and adult life.

Several studies provide information concerning teachers’ opinions as to what

influences grade assignment. Frary, Cross, and Weber (1993) surveyed a random sample of

high school teachers in five academic areas in the state of Virginia. Tests and quizzes easily

had the most influence on grades, but several additional elements were considered to be

important or should be taken into account in determining final grades. Teachers endorsed

these factors with the following frequency: projects/papers—71%; daily homework—71%;

class participation—51%; exceptionally high or low effort—66%; laudatory or disruptive

classroom behavior—31%.

In varying degree, each of these added considerations implies behavior for which the

student may win or lose grade points irrespective of demonstrated knowledge and skill in the

subject. The practice of rewarding appropriate behavior applies particularly to special

projects and homework because such assignments help to serve broader pedagogical ends like

Page 31: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

26

encouraging effort and initiative, or learning skills critical to the management of complex

tasks . The validity of such grades as a measure of the individual’s competence may be

compromised because work out of class often benefits from the cooperation and talents of

fellow students—either by instructional design or unsanctioned collusion. In any event, the

points thus won or lost represent one source of difference between grades and test scores.

A national survey by Robinson and Craver (1989) solicited information from 832 high

school districts on the influence of behavioral factors in grading. Regarding the role of

student effort, the districts reported as follows: Effort did not enter into grading–34%, was

included in the course grade–33%, or there was no uniform district policy–33%. Presumably

the absence of a district policy left the matter up to the teachers. In a more recent survey,

approximately half of the teachers reported that colleagues in their school pass a student who

has tried hard even if he or she had not learned what was expected (Public Agenda, 2000).

Some data suggest that in higher education formal policy is rarer than individual

teachers taking extenuating factors into account when they assign grades. Table 1 (Ekstrom

& Villegas, 1994) shows the frequency with which faculty reported that certain student

behaviors influence grades. On average, 30% of respondents said that it was informally

expected that faculty would take into account such factors as effort and timeliness, but only

7% reported any official policies regarding such practice.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 1 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

How important are different factors in assigning grades? Ekstrom and Villegas

summarized their data as follows. Among 15 types of evidence, tests and papers were

considered of great or very great importance in introductory courses by most of the seven

Page 32: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

27

departments. Four other factors—subject-specific skills, meeting due dates, creativity,

attitude/effort, and class participation—were considered of moderate to great importance in at

least three of the seven departments. Two differences were distinguished with respect to

grading in advanced courses: faculty placed more importance on the four factors just

mentioned, and they considered the quality of students’ papers more critical than their

classroom test performance.

Another study illustrates that grading can be a complex value-laden process when

different purposes of grading come into play. Brookhart (1993) posed to teachers such

grading questions as the following, “What do you do when a student with an A test average

fails to turn in a report that counts for 25% of the grade?” The answer varied among teachers

and circumstances and often turned on the importance attached to different objectives that

might be served.

In summary, grading procedures depend upon local policy and practice, and they have

varied over time. Surveys of schools and teachers clearly indicate that graders often consider

factors in addition to the student’s acquired knowledge and skill in order to serve other

educational objectives. To what degree such considerations directly affect grades is hard to

say. The factors that seem especially worth attention in the present analysis include

attendance, class participation, disruptive behavior, and completing work assignments.

Grading standards. The preceding section cited evidence of variation in the

components of grades, or the factors that teachers consider in grading. Researchers have also

accumulated considerable evidence of variation in grading standards; that is, the level of the

grades typically assigned for presumably comparable work by individual teachers or teachers

in a given subject, school, etc. Different types of evidence indicate that grading standards

Page 33: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

28

vary: variation in the grades that different instructors assign to the same papers, variation

over time in the average grade assigned to a population of students that has not ostensibly

changed, variation in the average grade for groups of students with comparable scores on a

relevant test, variation in the average grade earned by the same group of students in different

courses.

Early in the last century, Starch and Elliott (1912, 1913) published perhaps the most

widely quoted studies of “alarming” variation in the grades that different instructors assign to

the same papers—in subjectively graded subjects, and even in mathematics. Much of such

disagreement is no doubt just a matter of inconsistent judgment, but some variation comes

from different instructors assigning grades that are systematically higher or lower than the

norm for their particular subject (Kelly, 1914). Instructors’ grading habits are common fodder

for student lore.

Shifts in the apparent level of grades in schools and colleges across the country are

also frequently reported. Such shifts are typically characterized as grade inflation, probably

because they always seem to go up. In fact, grading practices are also periodically redefined

or differently enforced so as to cut down on the number of students receiving an A, 95, 4.0, or

“optimus.” There are years of inflation, deflation, and stagnation (see Willingham & Cole,

1997, p. 305 for periods and references).

Periods of grade inflation and deflation provide a sign of the ease with which teachers

can shift their grading standards. Contrary to what one might assume, such broad shifts

within the range we have experienced do not appear likely to affect either the correlation

between test scores and grades (Bejar & Blew, 1981) or the observed discrepancies between

Page 34: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

29

grades and test scores. What does have such effects is the comparability of grades; i.e.,

variation in grading standards from course to course and from school to school.

A number of studies have demonstrated that grades are not comparable from course to

course (Elliott & Strenta, 1988; Goldman & Hewitt, 1975; Goldman, Schmidt, Hewitt, &

Fisher, 1974; Goldman & Slaughter, 1976; Goldman & Widawski, 1976; Juola, 1968; Strenta

& Elliott, 1987; Willingham, 1963c, 1985). These analyses are typically based on a

comparison of average grades across groups after average test differences are taken into

account. Similar results are obtained without any reference to test scores by simply

comparing the grades earned by the same group of students in different courses (Goldman &

Widawski, 1976).

Grading standards tend to be stricter in courses like mathematics and science that often

attract stronger students; grading tends to be more lenient in courses like education and

sociology that often attract weaker students (Bridgeman, McCamley-Jenkins, & Ervin, 2000;

Goldman & Hewitt, 1975; Ramist, Lewis, & McCamley, 1990). That particular pattern of

noncomparability has been replicated across studies and appears to be similar from one

institution to another (Elliott & Strenta, 1988). Goldman and Hewitt (1975) proposed a

theory of grading variations based on “adaptation level.” They suggested that observed

patterns of apparently discrepant grading standards result from the tendency of faculty to

adapt their grading level to the ability level of the students that they typically encounter. In

this view, teachers have a tendency to assign a generally similar spread of As, Bs, Cs, etc.

despite significant differences in the average level of competence of students in different

courses and institutions.

Page 35: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

30

Numerous studies have demonstrated that noncomparable course grades will lower the

correlation between test scores and grades within an institution and that adjusting the grades

to make them more comparable will improve the correlation (Elliott & Strenta, 1988;

Goldman & Slaughter, 1976; Pennock-Roman, 1994; Ramist et al., 1990; Strenta & Elliott,

1987; Stricker, Rock, Burton, Muraki, & Jirele, 1994; Willingham, 1985; Young, 1990).

Raising the correlation between test scores and grades by using a more comparable

grade criterion is simply the result of identifying and removing one source of grade-test score

discrepancy. The same principle applies to groups of students. A number of studies have

demonstrated that underprediction of women’s grades is due partly to somewhat different

grading standards in courses that are typically taken by males and females. Furthermore, it

has been shown that using a grade criterion that is more comparable for women and men will

reduce that prediction error (Clark & Grandy, 1984; Elliott & Strenta, 1988; Gamache &

Novick, 1985; Hewitt & Goldman, 1975; Leonard & Jiang, 1995; McCornack & McLeod,

1988; Pennock-Roman, 1994; Ramist, Lewis, McCamley-Jenkins, 1994; Young, 1991.

Stricker, Rock, & Burton [1993] is an exception.)

A similar principle applies to variation in grading standards from school to school.

College admission officers have long known that a B from one high school is not necessarily

equivalent to a B from another school. Students with equal test scores tend to make higher

grades in schools with low average test scores than in schools with high average scores. Early

on, individual colleges were interested in adjusting the grades of applicants from different

high schools in order to improve the correlation between school grades and grades that

students earn after enrollment in college (Burnham, 1954; Dressel, 1939). Bloom and Peters

(1961) stimulated great interest in this topic. They purported to show that college grade

Page 36: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

31

prediction and college selection decisions could be improved substantially by rescaling the

grades of individual high schools.

A flurry of research on grade adjustment ensued; see Linn (1966) for a thorough

review. The main result of this research was to show that the book by Bloom and Peters had

raised a false hope. Subsequent work soon showed that adjusting school grades did not

improve prediction of college grades when admissions tests were included and regression

equations were cross-validated (Lindquist, 1963; Willingham, 1963a). More elaborate

statistical models have not shown promise of changing that result (Tucker, 1960).

Numerous studies in this period did document in some detail, however, that grading

standards do vary from school to school. Analyses of college grades yielded similar results

(Astin, 1971; Braun & Szatrowski, 1984; Rock & Evans, 1982; Willingham, 1961). An

important implication of this work is that institutional grading variations can be an important

source of discrepancies between grade averages and test scores if the grade averages come

from different schools or different colleges.

Student Characteristics

Understanding the characteristics of students and their environment that are most

associated with achievement in school is a challenge that has fascinated researchers for many

years. It is a complex topic with major sub-themes concerning academic work habits, the role

of activities outside of class, attitudes, influence of family and peers, and so on. The purpose

of this section is to review briefly the results of previous research on the major types of

variables that may hold some promise for understanding differences in grade performance and

test performance.

Page 37: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

32

One can think broadly of such variables as either behavioral or contextual; the latter

referring to the backdrop of attitudes, family situation, and peer relationships that often

influence a student’s behavior. Two other important sources of variation—subgroup

membership and school differences—are beyond the scope of this review, partly because they

are not used in this study as explanatory variables. Ethnic and gender identification are not

treated as independent variables in order to see how and to what extent other explanatory

factors account for observed results in differential prediction among such groups. In our

analytic model, school origin is associated with the possibility of grading variations, holding

test performance constant. The data will allow a test of that proposition. Historically,

analysis of school effects on achievement has focused on different issues, especially the

effects of school organization, financing, and educational practices on test performance

(Coleman et al., 1966; Jencks et al., 1972; Wenglinsky, 1997.)

This review includes studies that use both grades and test scores as performance

criteria in order to locate any variables that might provide a better understanding of student

achievement. Behavior or conditions that foster achievement will likely enhance both test and

grade performance. Note, however, that the ultimate interest is differential grade

performance; that is, high or low grades in relation to test scores covering similar subject

matter. Shedding light on that question requires a demanding combination of grades, scores,

and other useful measures from the same sample in the same situation. For the moment, we

attend to a broad range of potentially useful studies. They fall into several distinct topics,

only generally related.

College prediction. The practical interests of college admissions officers has, for

many years, encouraged researchers to search for characteristics that might account for

Page 38: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

33

students making higher or lower college grades than their admissions tests would lead one to

expect (Fishman & Pasanella, 1960). Such characteristics include a variety of information

often referred to as biographical factors (background, activities, attitudes, academic

behaviors, etc.). For the purposes of the present study, prediction research is hampered by

several limitations. First, self-reported biographical data collected in connection with

admissions may not be entirely trustworthy. Second, such data are limited by what a college

considers appropriate to ask of applicants. Third, variables may be limited in their usefulness

in predicting college achievement because student attitudes and habits do not necessarily

transfer to a new situation. Finally, the useful information that a variable might offer about

positive or negative achievement tendencies may already be represented in the high school

average that is routinely used as a predictor along with admissions tests. Nonetheless, such

research can provide useful clues for the present investigation.

Reviewing studies undertaken during the height of prediction research, Freeberg

(1967) remarked on the somewhat conflicting results but noted a frequent finding that higher

grades are associated with positive attitudes about education and good study habits.

Willingham (1963b, 1965) coded a number of items in a college application blank and found

15 that were significantly related to college grades. Some were especially notable: a strong

record in school (but not community) activities, expressions of confidence, a willingness to

undertake demanding academic work, and (negatively) any sign of hedging in the school’s

recommendation.

Astin (1971) obtained a variety of information, including self-ratings, from a large

national sample of college matriculants. Among the 13 personal characteristics that were

useful in predicting college grades, confidence was also a prominent factor in this study.

Page 39: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

34

Several types of behavior reported by students were associated with poorer grades than

expected on the basis of test scores and the prior academic record: turning in work late,

coming late to class, making wisecracks in class, and going to movies frequently.

To what extent is such information helpful in accounting for errors in predicting

grades in college? In his review of biographical inventories and their usefulness in

admissions, Breland (1981) cautioned that many of the apparently positive results are not

promising because studies are often based on small samples or were not cross-validated on a

new class. In Astin’s large study of matriculants, personal characteristics raised the multiple

correlation with college GPA from .56 to .60 (Astin, 1971, p. 12). In Willingham’s study of

applicants, an application composite based on promising personal characteristics had a cross-

validated correlation of .48 with freshman GPA; corresponding validity coefficients were .46

for the SAT and .49 for High School Average (Willingham, 1963b). The application

composite added .10 to the correlation between the SAT and freshman grade average, but

only .02 when both SAT and High School Average (HSA) were used as predictors of

freshman GPA. This pattern suggests that the HSA already included much of the useful

information that might be gleaned from evidence of attitudes and behavior in high school.

Another college prediction study is interesting because it included information from

high school as well as concurrent information about academic behavior in college. Stricker et

al. (1991) identified several types of behavior that correlated with college GPA from .11 to

.32 with SAT partialled out. In order of merit, they were: attendance, completing

assignments, taking tests on schedule, a study skills scale, taking notes in class, and average

years that key high school subjects were studied. The overall picture is that of the serious,

conscientious student.

Page 40: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

35

High school performance. Several analyses of large databases in the past decade have

demonstrated the variety of variables that might help to account for differences in grade and

test performance among high school students. In Ekstrom’s (1994) analysis of High School

and Beyond (HSB) data, school attitudes and behavior contributed significantly to test scores

in accounting for English grades. School behaviors showing significant relationships with

grades included hours of homework, attendance, discipline problems, and coming to class

unprepared. A study of 2500 ninth-grade students in Indiana also demonstrated positive

relationships between school attitudes and grades, but is more notable for arguing the

influence of parent education and expectations on student aspiration and attainment (Hossler

& Stage, 1992)

Ekstrom, Goertz, and Rock (1988) reported a more detailed analysis of the HSB data.

This study identified six student characteristics that showed some independent contribution in

accounting for differential grade performance. Among these, two aspects of student behavior

showed the strongest effects: behavior problems and time on homework. Other significant

contributors were school activities, parent aspirations, parent involvement in program

planning, and locus of control. Hanson and Ginsburg (1988) reported similar results from

their analysis of the sophomore HSB cohort. These authors examined student performance

from a value perspective and stressed the notion of “responsibility” in explaining the

predictive value of this nest of variables. Finn (1993) showed that a pattern of participation in

school was related to test performance among eighth grade NELS students. Positive school

behavior has also been referred to as effective “studenting” (Cole, 1997).

Homework is always a hot topic in the parent’s mind and the public press. Keith

(1982) examined the role of time on homework in explaining grade performance in the HSB

Page 41: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

36

sample. He showed that, within ability strata, time on homework had a quite linear additive

effect on grades earned, and was not subject to diminishing returns as other writers had

suggested (Frederick & Walberg, 1980). Some evidence indicates that the portion of

homework completed may have a stronger effect on grades earned than does time on

homework. Cooper, Lindsay, Nye, & Greathouse (1998) provide data suggesting that this

effect may operate through teachers’ evaluation of homework assignments rather than

homework influencing acquisition of knowledge and skill, which in turn affect grades.

One theme is particularly evident in the results of these studies. The students who

make good grades in relation to test performance are those who behave like serious students.

They come to class, they participate positively rather than disrupt, and they do their

homework. Three other factors that deserve attention are the activities in which students

engage, the attitudes they bring to school, and the influence of family and peers on those

attitudes. We consider each of those topics in turn.

Activities. The relationship between so-called extracurricular activities to academic

performance has long been a topic of great interest to many educators. Some have studied

student activities with a view to broadening the public view of talent, admissions criteria, and

useful outcomes of education (Richards, Holland, & Lutz, 1967; Taber & Hackman, 1976;

Willingham, 1974, 1985; Wing & Wallach, 1971). Richards et al. (1967) concluded that

nonacademic accomplishments in high school can be assessed with moderate reliability, are

related to similar achievements in college, but are largely independent of academic

achievement.

Werts (1967) challenged that interpretation, and among other arguments, pointed to a

correlation of .37 between high school average and a composite of 18 extracurricular

Page 42: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

37

achievement items. Hanks and Eckland (1976) also reported strong connections between

academic and nonacademic achievement but distinguished sharply between social and athletic

activities. Achievement in these two domains correlated .38 and .05, respectively, with high

school grades. Spady (1970) argued that while athletics can be a major source of peer status,

it is only through other service and leadership activities that students achieve success in

academic and later life.

Such views were part of a continuing debate as to whether extracurricular activities

have a causal relationship to beneficial outcomes of education (Brown, 1988; Holland &

Andre, 1988; Steinberg, Brown, Cider, Kaczmarek & Lazzaro, 1988). A more limited

question that is more germane to this investigation is whether engaging in activities outside

the classroom is associated with the more strictly academic outcomes and with grades

specifically. To see extracurricular achievement as a potentially useful variable in identifying

possible sources of grade-test score discrepancies, it is only necessary to assume, along with

Marsh (1992b), that such achievement represents “commitment-to-school.”

Marsh found that a composite measure of extracurricular achievement had positive

relationships with a number of secondary and postsecondary outcomes (e.g., correlated .23

with high school grades). Since this composite included athletic as well as community

activities, Marsh (1992b, p. 560) interpreted his results to be inconsistent with Coleman’s

(1961) zero-sum model wherein different domains of activity compete for the student’s time.

On the other hand, in a second study Marsh (1991) found that a number of school outcomes

were negatively related to total hours of employment—evidently a competing activity.

Marsh noted a positive relationship between academic achievement and employment

that students undertook in order to earn money for college. He explained the apparent

Page 43: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

38

contradiction in the two studies by arguing that the values represented in activities are

important, not the hours. Nevertheless, it does seem likely that activities—even ostensibly

positive activities—can have negative effects on academic performance if they significantly

reduce time or commitment from schoolwork. Some investigators have found that

employment has adverse effects on schoolwork only when students work 20 or more hours a

week (Steinberg et al., 1988). We do not yet know how consequential the zero-sum reality

might be if a number of competing activities are considered together (e.g., employment,

childcare, community work, athletics, socializing, gangs, TV and video games).

A quite recent study further supports the proposition that after-school activities have a

positive or negative effect on differential grade performance, depending upon whether they

contribute to or compete with a student’s school responsibilities. Cooper, Valentine, Nye, and

Lindsay (1999) found that residual grade performance (i.e., test performance controlled) was

positively correlated with extracurricular activities and amount of homework finished, but

negatively correlated with watching TV and (marginally) number or hours employed.

Another technical issue suggests that activity measures cannot always be taken at face

value. One might suppose that students have greater opportunity for a wider range of

activities in a large high school but face more competition due to limited spaces available,

especially regarding leadership positions. There is some evidence to support the latter

assumption. Students in small schools evidently perceive ample extracurricular opportunity

(National Center for Education Statistics, 1995) and participate at a higher frequency than do

students at large schools. In a national sample of seniors, Lindsay (1982, 1984) reported that

participation correlated –.22 with school size.

Page 44: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

39

Student attitudes. Teachers probably need no hard evidence to know that students’

attitudes about themselves and their education can have a marked effect on their behavior and

performance in school. In recent years the “academic self-concept” has provided an active

research focus for this common-sense observation. Self-concept instruments typically

correlate positively with performance in school. But in reviewing the nature of the academic

self-concept and its relationship to academic performance, Byrne (1996) observed that such

correlations are widely discrepant. Research on several technical issues in recent years has

helped to clarify the nature of academic self-concept and its relationship to performance (see

Byrne, 1996; Marsh, Byrne, & Yeung, 1999; and Marsh & Yeung, 1997 for useful reviews).

This work is applicable to the present investigation for its implications regarding design of the

proposed analysis.

Through the work of several investigators in recent years, it has become progressively

clearer that self-concept is best viewed as a complex hierarchical structure. Researchers have

distinguished general self-concept, academic self-concept, and subareas of self-concept based

on broad skill areas (especially quantitative and verbal) as well as particular academic

subjects (Byrne, 1986; Marsh, 1990a; Marsh, Byrne, & Shavelson, 1988; Marsh & Shavelson,

1985). Marsh (1992a) has emphasized that the more specifically the self-concept refers to a

particular academic subject, the stronger its effects on performance in that area.

This empirical observation raises a caution in studying the relative strength of self-

concept measures to grade vs. test performance. Self-ratings of confidence in very specific

academic areas may exaggerate the role of self-concept as a source of grade-test score

discrepancy. In fact, self-ratings in a particular subject may be much dependent on grades in

that academic area. The issue of domain specificity exacerbates what Byrne (1996, p. 302)

Page 45: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

40

called “the most perplexing and illusory” issue in studying the relationship of self-concept to

performance in school. Does self-concept cause grades, or do grades cause self-concept? In

the former case, the argument that attitudes play a functional role in helping to distinguish

grades and test scores is more convincing. If attitudes do influence effort, then a relatively

strong correlation between attitudes and grades reflects a greater sensitivity of grades to the

level of student effort than would normally be case with an external test score.

In a pioneering study, Byrne (1986) concluded that neither academic achievement nor

self-concept had causal dominance. Using a more rigorous multi-wave design than had

previous investigators, Marsh (1990b) concluded that grade averages in Grades 11 and 12

were affected by academic self-concept measured the previous year, whereas prior grades did

not affect subsequent measures of self-concept. However, in a subsequent study within three

academic subjects, Marsh and Yeung (1997) found that the effects of achievement on self-

concept were somewhat larger and more systematic than were the effects of self-concept on

achievement. The authors saw the results as a contribution to “the growing body of

research—particularly at the high school level—in support of the reciprocal effects model”

(Marsh & Yeung, 1997, p. 50; see also Marsh et al., 1999). As Byrne (1996) lamented, the

evidence leans both ways.

Another technical problem has received considerable attention. It is called the big-

fish-little-pond effect and derives from the fact that from school to school, average ability is

negatively associated with average academic self-concept (Marsh, 1987, 1994; Marsh &

Parker, 1984). As a result, students of similar academic ability tend to have higher academic

self-concepts in low ability schools than in high ability schools. Soares and Soares (1969)

first reported this phenomenon in their work with disadvantaged children. It is a frame-of-

Page 46: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

41

reference issue parallel to variation in school grading standards that was previously discussed,

and both phenomena can be seen in the same data (Marsh, 1994). The implication for the

present investigation is to illustrate a further advantage of analyzing data in a pooled within-

school sample in order to avoid distortions due to spurious school differences.

Despite these various difficulties, academic self-concept has proven to be a useful

measure in research. More pertinent to the present investigation was a meta analysis reporting

that, on average, self-concept correlated .34 with grades and .22 with several types of

achievement tests (Hansford & Hattie, 1982, Table VIII). That pattern suggests that academic

self-concept is a variable that may be useful in accounting for grade-test score differences.

Locus-of-control is a somewhat related attitude measure that has been included in the HSB

and NELS surveys. It refers to a tendency of people to feel responsible for things that happen

to them (i.e., internal control) or to feel that forces beyond their control determine outcomes

in life (i.e., external control). This measure showed some promise in HSB data (Ekstrom et

al., 1988) but does not appear consistently to have the same favorable pattern of correlations

just cited for academic self-concept (Findley & Cooper, 1983).

Eccles (1983) has proposed a framework of “expectancies, values, and academic

behaviors” that suggests a much broader range of beneficial student attitudes than does a

positive academic self-concept alone. Her theoretical orientation assumes a very practical

network of manifest behaviors and interpersonal relationships—all necessary elements in a

student’s pursuit of effective personal development through educational achievement. Thus,

positive academic attitudes are reflected in aspiration and planning, in recognizing the value

of taking certain courses and working hard, in positive relationships with parents and teachers,

and in choosing peers who help to define and reinforce academic commitments and beneficial

Page 47: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

42

habits. Many such behaviors were noted in studies previously discussed. Such results

suggest the need for additional comments on the family—a final domain of potential influence

on student grades.

Family. Sociologists have established in many studies that socioeconomic status

(SES) of the family is positively related to educational attainment (Jencks et al., 1972). It has

also been widely assumed that parental encouragement has an independent positive effect on

the educational aspirations of students (Sewell & Shah, 1968). Harris (1995) has challenged

that conventional view, arguing that peer culture, not family, plays the dominant role in

shaping the behavior, personality, language, motivation, and values of young people. Most

recently, it has been argued that parents do influence children, but mainly through indirect

effects and interactions with other variables (Collins, Maccoby, Steinberg, Hetherington, &

Bornstein, 2000).

With regard to the more specific question addressed in this investigation, it is not clear

whether the family has more effect on cognitive skills developed over time and probably

better represented in test scores, or on school achievement better represented in a more

proximal measure like grades. The analysis of HSB data by Rock, Ekstrom, Goertz, and

Pollack (1986) did indicate that the family had an influence on differential grade performance

(i.e., controlling for test performance). Other studies have indicated that the educational

aspirations of students are influenced by the parents’ aspirations on their behalf (Eccles, 1983;

Hossler & Stage, 1992) and that parents may directly influence grade performance of students

only through specific types of parenting behavior (DiMaggio, 1982; Lamborn, Mounts,

Steinberg, & Dornbusch, 1991).

Page 48: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

43

Researchers have had a special interest in the influence of the single-parent family on

students and their academic performance. There is a small relationship between family

structure and test performance (higher average scores for students in intact families), but those

differences are evidently explained by demographic differences (Milne, Myers, & Rosenthal,

& Ginsburg, 1986). In another study, such differences associated with family structure were

larger on grades than on test scores (Mulkey, Crain, & Harrington, 1992). In the latter

analysis test differences connected with family structure also vanished when background was

controlled. Residual grade differences were associated with student behaviors; for example,

absenteeism, not doing homework, frequent dating, not talking to parents. Marsh (1990a

found similar effects with HSB senior data but not when sophomore outcomes were

controlled.

Teacher Ratings

Several considerations suggest that teacher ratings may be useful in understanding

better the observed discrepancies between grade performance and test performance. First,

teachers’ ratings should give an indication of whether a student tends to perform well on local

instructional goals—outcomes likely to be reflected more accurately in grades than in external

test scores. Second, teachers should be able to provide good evidence not otherwise

attainable regarding student behavior that can result in a direct debit or credit when grades are

assigned (e.g., work habits, class behavior). Also, teachers may be able to provide

independent evidence regarding complex skills that are relevant to course objectives and

reflected in course grades but not ordinarily reflected in test scores (e.g., performing skills,

ability to explain the subject to others).

Page 49: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

44

In considering whether teacher ratings can provide useful evidence regarding observed

discrepancies in grade-test performance, it is useful first to examine the obverse. Do teacher

evaluations of students provide a reasonably accurate estimate of a student’s subject

knowledge and skill as represented in test performance? In a review of 16 studies, Hoge and

Coladarci (1989) found average correlations in the mid-60s between teacher judgments and

standardized achievement tests. On this basis they concluded “high validity” for the teacher-

judgment measures. Another reasonable conclusion is that there are substantial differences

between the two measures. What other information do the teacher judgments offer?

Some years ago, Davis (1965) carried out a series of studies to identify what

components make up the college teacher’s perception of the valued student. Based on faculty

ratings of students on a number of traits, Davis identified 16 factors, five of which had a

consequential relationship to college grades. The relationship with grades was high for

ratings of “academic performance,” moderate for “intellectual curiosity” and “orientation to

tasks,” and low for “creativity” and “achievement motivation.” There is limited evidence in

these particular studies as to whether faculty grading is influenced by student behavior other

than the specific knowledge and skills pertinent to the course. What happens when the

research gives such factors more direct attention?

Pedulla, Airasian, and Madaus (1980) offered interesting data based on 170 teachers

and 2617 fifth-grade students in Ireland. Factoring three tests and 15 teacher ratings, they

found three factors: one based on behavior in school, another on academic work habits (also

behavior), and a third loading on tests and ratings of academic achievement. Teacher ratings

of both behavioral factors were more highly related to grades than to test performance,

Page 50: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

45

suggesting that the teachers’ judgments were reflecting a noncognitive grade component to

some degree.

Ekstrom (1994) factored teacher comments on students that were collected in the High

School and Beyond longitudinal survey of high school students. The first factor was defined

as a teacher comments composite. The composite looked like student motivation, loading

mainly on “self-discipline,” “seems to dislike school,” and “will probably go to college.”

This teacher comment composite was correlated—typically in the .20s—with several items

from the questionnaire completed by students: In order of magnitude they were discipline

problems, attitudes about school, anxiety about school, educational aspirations, attendance,

and came to class unprepared. This teacher comment composite added to tests and other

student characteristics in predicting grade performance. The multiple correlation with English

grade average as of the sophomore year was .68 with teacher comments included, and .63

without.

Teacher ratings of behavior can be unduly influenced by the grades a student earns,

especially if the ratings are collected at about the same time as the grades are assigned. A

study by Farkas, Grobe, Sheehan, and Shaun (1990) appears to illustrate the point. These

authors reported a concurrent correlation of .77 (N = 486) between teachers’ ratings of

students’ work habits and their grades in social studies—quite high for a single course grade.

It is also possible that teacher ratings can be a source of discrepancy between test

scores and grades either because teacher judgments influence grades (Caldwell & Hartnett,

1967) or, because their judgments may become self-fulfilling prophecies in student learning

and grade performance (Rosenthal & Jacobson, 1968). The specter of the self-fulfilling bias

has been an active and controversial research topic (Brophy, 1983). Researchers continue to

Page 51: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

46

interpret results in different situations either as evidence of only minor effects (Jussim, 1989)

or as substantial effects (Babad, Inbar, & Rosenthal, 1982).

Implications for This Study

The foregoing review of previous research has focused on grading practices, student

characteristics, and teachers’ ratings of students. The NELS Study includes extensive

information on these three topics and each holds promise for understanding performance in

school, especially differential grade performance in relation to test performance. Results of

the previous work suggest a number of points that bear especially upon the choice of variables

and analyses that are likely to be most appropriate to the current study.

• A long history of research indicates that grading patterns vary from time to time and

situation to situation. Grading standards often vary among instructors, courses, schools,

academic majors, and colleges. Correction for such differences has typically enhanced the

relationship between grades and test scores and reduced any differential prediction for

gender and ethnic groups. Differences in standards across courses and across schools

seem the most likely sources for grading variations that can result in observed

discrepancies between grades and test scores.

• Because of the multiple purposes served by grades, teachers and schools often report

taking into account student behavior and other factors in addition to knowledge and skills

relevant to a course when they assign grades. Thus, various aspects of effective school

skills such as attendance, class participation, discipline, and timely completion of assigned

work may be important components of grades that are not represented in test scores.

• In addition to using effective school skills, initiative and involvement with school also

tend to be associated with higher academic achievement—higher test performance and

Page 52: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

47

especially higher grades. Evidence of student initiative such as taking a strong program of

demanding courses, participating in school activities, and avoiding activities that might

compete with academic work all tend to be more strongly associated with grades than with

test performance. Evidence to date suggests studying student activities in several separate

spheres: school activities (athletic distinct from non-athletic), community activities, and

other ways that students may spend their time in competition with schoolwork.

• Effective school skills and initiative tend to predict future grades as well as concurrent

grades. That parallel pattern suggests that such characteristics of students represent a

somewhat stable orientation to schooling. Thus, some student behaviors, like turning in

homework, may result in differences between grades and test scores in several ways:

quite directly if the teacher gives extra grade credit for completing assignments, also

directly if doing the assigned work enhances the specific knowledge and skill that results

in a higher grade in that particular course, and indirectly if such behavior reinforces an

habitual pattern of commitment to academic work and commensurate payoff in better

grades.

• Various aspects of the student’s family life and socioeconomic status have shown promise

in helping to account for academic performance. There is wide agreement among

researchers that positive attitudes foster achievement in school—especially attitudes

reflecting confidence in specific subjects and aspirations regarding educational goals.

Since it is also likely that good grades foster good attitudes, it is important to guard

against spurious attitude “effects” that could result from using attitude variables that are

obviously dependent upon past achievement. This is a form of confounding or

Page 53: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

48

experimental dependence—a design hazard to avoid in trying to understand the dynamics

of school achievement.

• Another methodological consideration concerns the context of data analysis. Different

lines of evidence suggest the importance of studying academic achievement within

schools. One is the well-documented difference in grading standards across schools.

Another is the noncomparability of some student characteristics—especially attitudes and

extracurricular activities—in larger and smaller schools. These problems recommend a

within-school analysis; that is, an analysis based upon deviation scores after school means

have been subtracted from all scores for all variables. Much evidence also underscores

the importance of correcting for range restriction in such analyses.

• Studies of teacher ratings indicate that teachers can be a useful source of information as to

what student characteristics are associated with achievement in school. Such ratings are

typically more highly related to grades than to test scores, though teacher judgment can be

unduly influenced by the grades that students have earned—another form of confounding.

For that reason, it may be important to use teacher ratings that are not collected at the

same time that grades are assigned and that are focused on specific behaviors of students

rather than their academic reputation.

Page 54: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

49

Study Design

The National Education Longitudinal Study of 1988 (NELS) is part of a major long-

term program of the National Center for Education Statistics (NCES) to study representative

samples of students as they progress through elementary school, high school, and beyond.

“The general aim of the NELS program is to study the educational, vocational, and personal

development of students at various grade levels, and the personal, familial, social,

institutional, and cultural factors that may affect that development” (Ingels et al., 1994, p. 1).

The NELS database comprises an unusually broad and detailed record of students’

characteristics, attitudes, experiences, and achievements. The analyses here reported are

based largely on information from the NELS second follow-up. Much of these data were

collected in January through March of 1992, when the survey students were in their senior

year, though some other data were gathered as late as the summer of 1992. Five types of data

were employed: tests administered in the senior year, a senior student questionnaire, a full

course-grade transcript for grades 9 through 12, teacher ratings collected in the sophomore

year, and some additional information from school records.

The general approach in this analysis is to “correct” for the five factors discussed in

the previous section by introducing each, in turn, as adjustments or predictors in a multiple

regression analysis that starts with the concurrent correlation between average grade

performance and NELS Test performance. To what extent can the five factors account for

differential grade performance, the common tendency for students to make grades somewhat

higher or lower than expected on the basis of a test based on generally similar academic

material? To what extent can the multiple R be raised and differential prediction be lowered?

The analytic approach requires close attention to the sample of students, the selection and

Page 55: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

50

definition of variables, and the statistical procedures. Appendix B describes the student

variables and lists acronyms used here.

The Sample

The 12th grade NELS cohort numbered 17,153 students. We restricted this study

sample to students with essential data; i.e., those who took the NELS Test, filled out a

questionnaire, and had transcript data with no missing records. Students in special education

and bilingual programs were not included since the comparability of their grades and test

scores was uncertain. With these constraints, the original sample with requisite data

numbered 10,849.

Ten subgroups were available for analysis: two genders, four ethnic groups (African-

American, Asian-American, Hispanic, and White), and four school programs (Rigorous

Academic, Academic, Academic-Vocational, Vocational). For research purposes, NCES

assigned students to school programs insofar as possible on the basis of the pattern of

coursework in each student’s record. In the sample employed for this analysis, some 90%

were assignable to the four programs indicated. Each of these 10 groups included sufficient

students for separate analysis, but an additional constraint on the sample had to be taken into

account.

Assessing grading variations required at least a minimal number of students in each

high school—the more the better in order to stabilize any grade-score differences among

schools or courses. From this perspective, the database posed a problem because some of the

original cohort in the 8th grade moved to other cities or districts. NELS followed those

students to their new school even if there were no other survey participants in that school.

This meant adding many schools that had only one or two students in the NELS study—too

Page 56: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

51

few for our purposes. Deciding on a minimum school sample size involved conflicting

considerations. Setting a higher minimum number of survey participants per school promised

to yield more stable data in individual schools, but that constraint would simultaneously

reduce the total sample as well as subgroup representation.

As it happened, setting the minimum school sample size above 10 quickly reduced

subgroup samples to a level too small for analysis but offered very little improvement in the

typical size of course or school samples. It was not necessary to go below a minimum school

sample of 10 to obtain sufficient data (i.e., a minimum of 400 in all subgroups of interest).

There were 581 schools having 10 or more students with the requisite data. Together, these

schools yielded a reduced student sample of 8454 (78% of the original sample meeting all

data constraints). Subgroup sample sizes are shown in Table 2.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 2 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

The most obvious difference between the reduced sample and the original sample with

requisite data would be a smaller number of students in the reduced sample who had changed

high schools. A substantial proportion of students who moved would necessarily transfer into

a school with few other NELS students because the new school was often not in the original

NELS sample. This final, reduced sample appeared similar to the sample with full data in

most respects. For example, the average test score differed by .035 SD; the standard

deviation differed by .3%. Ethnic minority representation was 26% in the original sample,

23% in the reduced sample. Also, in analysis of NELS data in grade eight, little association

was found between family moves and school performance level (Finn, 1993, p. 66).

Page 57: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

52

Nevertheless, it is impossible to say what self-selected considerations might be

involved. Because this selective influence rendered the sample not necessarily representative

of high school students nationally, NELS sample weights were not employed in the analysis.

Another reason for not using weighted data was the possibly distorting effects that wide

variations in sample weights might have on the estimation of local grading parameters in

small school samples. Due to these sampling limitations, the analysis is best regarded as

exploratory.

Due to these considerations, statistical tests of hypotheses were not carried out, and

standard errors of statistics were not estimated. Results should be interpreted as descriptive of

characteristics observed in a database of 8454 students attending 581 schools throughout the

country, who earned some 187,000 credits in 21,000 high school courses. It can be expected

that the results reported here are similar to what would be found with another similar group of

schools, courses and students. Needless to say, results would more likely be similar in a

group of 4125 males than 402 vocational students, the latter being the smallest group in which

correlational analyses were undertaken.

Tests and Grade Averages

The NELS database provides four test scores and four grade averages covering

generally similar academic subject matter over the same time period. Students were

administered “a series of curriculum-sensitive cognitive tests to measure educational

achievement and cognitive growth between the eighth and twelfth grades in four subject

areas—reading, mathematics, science, and social studies (history, geography, civics).”

(Ingels et al., 1994, p. 7). The complete battery comprised 116 multiple-choice items. Ingels

et al. (1994) described the four tests as follows:

Page 58: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

53

• Reading Comprehension. (21 questions, 21 minutes) This subtest contained five short

reading passages or parts of passages, with three to five questions about the content of

each. Questions encompassed understanding the meaning of words in context, identifying

figures of speech, interpreting the author’s perspective, and evaluating the passage as a

whole.

• Mathematics. (40 questions, 30 minutes) Test items included word problems, graphs,

equations, quantitative comparisons, and geometric figures. Some questions could be

answered by simple application of skills or knowledge; others required the student to

demonstrate a more advanced level of comprehension and/or problem solving.

• Science. (25 questions, 20 minutes) The science test contained questions drawn from the

fields of life science, earth science, and physical science/chemistry. Emphasis was placed

on understanding of underlying concepts rather than retention of isolated facts.

• Social Studies. (History/Citizenship/Geography—30 questions, 14 minutes) American

history questions addressed important issues and events in political and economic history

from colonial times through the recent past. Citizenship items included questions on the

workings of the federal government and the rights and obligations of citizens. The

geography questions touched on patterns of settlement and food production shared by

other societies as well as our own.

Student grades were obtained from the NELS Transcript Component Data File (Ingels et

al., 1995). The file contains an enormous amount of detailed information regarding individual

term grades for all courses attempted by each student. Working with individual student

transcripts and curriculum specifications from the schools, NELS slotted each course as

accurately as possible into a single comprehensive framework of 1540 courses with different

Page 59: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

54

titles. The total number of courses in all schools was only 21,000 because many courses were

available in only a few schools. It is impossible to say to what extent courses with the same

title (e.g., Precalculus, European History, or Introduction to Computers) may have actually

covered somewhat different subject matter from school to school. In addition to different

types of courses within mathematics, history, science, etc., the framework included placement

levels (remedial, honors, etc.) and the year (academic and calendar) in which the course was

taken. NELS also transformed different school grading systems to a common scale of 1 to 13

(i.e., A+ to F).

Our intended analyses required grades for individual courses as well as several

averages and summary indices based on the total transcript. The unusual size of the NELS

transcript tape and the noncomparability of course information from school to school

complicated the compilation of these various measures. For our purposes, it was necessary to

create a course-grade file on which numerous instances of multiple grades for the same

student for the same course title (repeats and multiple terms of different lengths in different

schools) were appropriately weighted and averaged in order to provide a single grade and a

comparable term length for each course. This dedicated file provided for each student a

measure of total course hours and total course credits expressed on a common Carnegie Unit

(CU) base.

The file contained 311,607 individual course grades based on all 1540 courses in the

NELS framework, not counting a few service courses like physical education and driver

training. In this analysis grades were converted to a “4.0” scale—actually 4.3 because the

original scale provided for an A+ grade. These course grades, weighted by CU hours, yielded

the data necessary for the analysis of grading variations from course to course. The course

Page 60: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

55

grades also served as the basis for a High School Average, HAS(T), based on each student’s

total transcript . The correlation between this HSA(T) and the unweighted total score on the

NELS Test provided the initial baseline indication of the relationship between grade

performance and test performance for the NELS graduating seniors.

In order to take account of Factor 1 (Subjects Covered), it was also necessary to

compute a grade average based on subject matter generally comparable to that represented on

the NELS Test. As it happened, NELS had already computed for each student a set of grade

averages corresponding to the so-called “New Basics” subject areas (Ingels et al., 1995, p. 56

and Appendix H). The NELS Tests correspond reasonably well to four of these six subject

areas (excepting Foreign Languages and Computer Science):

• English (113 courses)

• Mathematics (47 courses)

• Science (74 courses)

• Social Studies (256 courses)

For the present study, the mean of the students’ grade averages in these four subject

areas was defined as HSA, the academic average. Unless otherwise specified, all of the

analyses reported here are based upon this set of 490 academic courses, though as will be

described, a correction for grading variations was applied in some analyses. HSA was based

on a great variety of courses in these four academic areas, including advanced as well as

remedial work. Courses that were represented in HSA(T) but not in HSA included all work in

foreign languages and computer sciences, a wide variety of other special interest courses of an

academic nature, and all vocational courses.

Page 61: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

56

Other Variables in the Analysis

The selection and definition of variables used in this analysis were determined by our

proposed approximation as discussed earlier, by what information was available in the NELS

database, and by information gleaned from our review of previous research. The overriding

interest was to identify a set of student characteristics and qualities that might help to account

for differential grade performance—as here defined, the tendency of students to make

somewhat higher or lower grades than would be expected on the basis of the NELS

curriculum-based test scores. As prior research indicates, many such characteristics and

qualities show some promise in this regard.

All available measures were included if they appeared to be potentially useful and did

not involve excessive missing data or raise technical problems such as those discussed below.

Needless to say, even the very rich NELS database did not include all information that might

be interesting. This is especially the case for information concerning the knowledge and skills

expected of students in individual courses, the meaning of grades assigned, and student skills

that might be particularly pertinent to test performance but not to grade performance.

As Table 3 indicates, 26 student variables were included under these five headings:

School Skills, Initiative, Competing Activities, Family Background, and Student Attitudes.

The measures reflect both behavior and context. The first two categories are clearly

behavioral; these measures refer to things students do. Competing Activities are also

behavioral but are likely to be more influenced by context and are therefore somewhat less

under the student’s control. Student Attitudes and Family Background are more contextual.

These latter variables refer less to behavior than to conditions and circumstances that can

influence behavior. Appendix B provides a description of each variable. See Green, Dugoni,

Page 62: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

57

Ingels, and Camburn (1995) for a “profile” that gives extensive information about the NELS

seniors and how they vary by subgroup and background characteristics.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 3 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

In the main, student variables were developed and retained on the basis of their

rationale and substance, not their interrelationships or validity in accounting for differential

grade performance. Thus, the 26 student variables originally selected were retained

throughout the analysis. Analytical relationships among possible measures was sometimes

critical, however, in the initial choice and definition of variables to be used in the analysis.

This was because, in actual practice, some pairs of variables proved to be mutually

constrained or confounded. While we guarded against collinearity among the student

characteristics, there was no avoiding the mirror image “zero-sum,” problem described by

Coleman (1961). That problem is evident in Competing Activities. A student cannot easily

score high on killing time, child care, employment, etc., all at the same time. Each of the

variables in that category may have an important idiosyncratic bearing on school

achievement, but statistically, they hardly form a coherent construct.

Ambiguous causality is also a special problem. For example, has involvement in

honor societies contributed to a student having earned high grades relative to test scores or has

such involvement merely resulted from superior classroom performance? Thurstone (1947, p.

441) referred to confounding of this sort as experimental dependence and argued that it

deserves close attention in the correlational analysis of behavioral data, especially data

collected concurrently. It is useful to distinguish three levels of potential confounding in the

relationship between individual variables and grade performance.

Page 63: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

58

a) A statistical constraint can consistently increase or decrease the association between

grades and another measure regardless of the underlying relationship. For example, a

failing grade automatically lowers both the student’s grade average and the number of

credit hours earned. This type of spurious dependence of one variable on another can

often be identified and even statistically isolated.

b) An explicit dependence can influence the relationship of a variable to a student’s grade

average even though the effect is not necessarily identifiable or consistent. Examples

include the use of variables such as the following: attitudes like “I don’t do well in

math,” enrollment in remedial English or in the Vocational curriculum (assignment to

which may be directly influenced by the grade record), or an activities measure based

partly on “member of an honor society.”

c) An implicit dependence may be subtle and unmeasurable but real, nonetheless.

Examples include the following: partial dependence of self-esteem on grades earned,

the likely tendency of students to form educational aspirations on the basis of

academic performance, the tendency of students to like or dislike school partly on the

basis of how they do there, the likelihood that decisions to enroll in demanding

courses are influenced by a student’s grade history in that subject area.

We attempted to avoid the more clearly artifactual and potentially misleading forms of

confounding, especially those in the first two categories above. Note, however, that a

student’s level of academic motivation or degree of scholastic orientation is clearly influenced

by a history of doing well or poorly in school. That is one of the key phenomena under study

here—the student’s personal orientation to schooling. It is not a momentary state or a

technical problem that is pertinent only to a concurrent analysis. Student attitudes about

Page 64: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

59

school do change over time and circumstance, but a strong or weak commitment to school

will likely be reflected in longitudinal as well as in concurrent analyses. Much of the national

effort that goes into encouraging excellence and selecting good students for demanding

educational programs is based on that assumption. In describing results of the analyses, we

will come back to the possibility of spurious findings due to confounding.

In most cases, student variables were constructed as composites of several items in the

Student Questionnaire or other types of information (see Table 3 and Appendix B).

Developing composite variables served two purposes. One was to enhance generalizability by

including different aspects and evidence of a characteristic. Another objective was to

minimize missing data; at least partial information was available for almost all students on

most variables.

NELS collected from classroom teachers a variety of ratings regarding the academic

behavior and work habits of the participating students. These judgments provide a valuable

supplement to the student variables because they come from a person who is presumably

more objective than the student but is also experienced and informed. Five Teacher Rating

variables were used, either based on individual ratings or based on composites of two or three

ratings. The five variables, shown in Table 3, represent observable classroom behavior that is

often cited as a legitimate consideration in grading or might be expected to influence teachers’

evaluations of students and the grades they assign.

Such ratings were collected in both the first and the second follow-up survey (in the

middle of the sophomore and the senior year). Several considerations led to a decision to use

the sophomore ratings. In the senior year, only one rating was obtained from either a science

or a mathematics teacher who had a NELS student in her or his class. In the sophomore year,

Page 65: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

60

ratings were obtained from two teachers balanced among English, mathematics, science, and

social studies—one from a verbal and one from a quantitative discipline for each student.

Thus, the two sophomore ratings are more reliable and more representative than is the single

senior rating.

Another consideration was the possibility of the teachers’ ratings being biased by their

knowledge of the students’ grade records. Such influence is not unlikely, even though care

was taken to avoid using any ratings with wording that suggested any direct dependence on

the grade record (e.g., “I have talked to this student’s parents about his/her grades”). Concern

about such possible confounding was also a major reason for using teacher ratings that

pertained to student behavior, not academic achievement. Focusing on behavior undoubtedly

reduced the likelihood that the Teacher Ratings would reflect some types of learning

outcomes that may be reflected more in grades than in test scores; for example, social goals of

education and individualized learning outcomes (Categories A.3.a and A.2.c, respectively, in

Figure 1).

Since more than half of the grade record came after the first follow-up teacher ratings,

these ratings are, in part, predictors rather than concurrent evaluations. Using the sophomore

ratings seemed a more conservative choice for valid and useful teacher ratings. Finally, this

choice resulted in less missing data: 90% of our sample had at least one sophomore rating,

while teacher ratings in the fourth year of high school were available for only about seven in

ten participating seniors (Ingels et al., 1994).

Page 66: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

61

Statistical Analysis

The statistical analysis proceeded in the following manner. It was proposed earlier

that five major factors could help to explain discrepancies between grades and test scores; that

is, improve the concurrent prediction of grade average. These corrections, Factors 1 through

5, were introduced successively. The presenting issue in this study concerns this question:

What effect do these corrections have on grade prediction and on observed patterns of

differential prediction for subgroups? Other questions concern how each factor operates; for

example, what role is played by different components, or what happens if the factors are

defined through different methods or in different sequence? In this connection, the nature of

grading variations and student engagement in their schooling deserve some special attention.

Finally, we examined the generality of results across gender and ethnic groups and among

high school programs.

Most of these analyses involve familiar applications of regression analysis, which are

best described as results are reported. Missing data was not extensive and was handled

through pairwise deletion. Two methodological issues require some initial explanation.

These involve analysis of grading variations and estimation of the reliability of grade

averages.

Adjusting for Grading Variations

Our review of the research literature identified various types of grading variations and

a number of statistical methods that have been used for correcting such differences. The

NELS transcript database provided sufficient information to analyze grading variations from

school to school and from course to course, but not variations among instructors or sections.

We employed relatively simple methods to correct only for average differences in grading

Page 67: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

62

level (i.e., the overall strictness or leniency in relation to test score level). Intercept

differences appear to largely account for observed variations in grading standards (Linn,

1966). Limiting the model in this manner makes the school and course-grading variations

additive; the objective is to correct each in turn.

The same methods can often apply to the correction of grading variations across

different types of groups; e.g., students enrolled either in different courses or in different

schools. Figure 2 illustrates two methods of correcting grading variations—in the case

depicted, variations across schools. High School Average is here regressed on a NELS

composite test score. In each of the three panels, the regression line is based on all students in

the total sample. The top panel represents scatterplots for four schools as they might appear

in the original data. The middle panel shows how those plots would look after applying the

“within-school” method. Similarly, the bottom panel illustrates the “residual method.”

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Figure 2 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

In the within-school method, school means are subtracted from the observed scores for

all variables (i.e., all grade averages, test scores, student characteristics, and teacher ratings).

This creates a pooled within-school matrix of deviation scores where all variables have a

mean of zero. The effect is to superimpose the school scatterplots so that all have their center

at scale values of 0,0. The pooled within-school correlation matrix based on these deviation

scores is then corrected for range restriction. This mutivariate correction employs an

extension of the Pearson-Lawley method (Gulliksen, 1950, p. 165).

In all corrections for range restriction, we used a composite of the four NELS tests and

the SES composite as explicit selection variables. These two were the only variables in the

Page 68: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

63

matrix that could be assumed to be reasonably comparable across schools. The variances of

both showed substantial reduction after school means were removed, but correction for range

restriction brought the variances for both measures back to their original value across all

schools. With this correction, the correlations represent more faithfully what one might

expect to obtain from a full range of scores on all variables for a national sample of high

school seniors.

As illustrated in the bottom panel of Figure 2, the residual method makes use of the

average difference between the observed grades in a particular school and grades that would

be predicted based on the regression line for the overall sample. For example, if the High

School Averages (HSA) in a given school run low compared to those predicted on the basis

of the NELS Test in the total sample, this mean residual is treated as the School Grading

Factor (SGF) and all HSAs in the school are adjusted by that amount. As illustrated in the

figure, when this correction for grading strictness is applied to each school grade scale, the

effect is to move all school scatterplots to the overall regression line. Recognize, however,

that in this method HSA is the only variable so corrected.

Ostensibly, the within-school and the residual methods are based on different

assumptions. The residual method takes any difference between actual and predicted grade

performance to be error that is best removed. The within-school method ignores school

differences altogether and relies on individual differences (within schools) to best understand

the nature of achievement. Earlier analysis of these two methods (Willingham, 1962b)

suggests that they are actually quite similar in derivation. In the present situation, there

appear to be two main analytical distinctions between the two methods. First, linear

corrections generally akin to the residual method will tend to overfit, and to some degree,

Page 69: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

64

inflate the relationship between the two measures in question3. Second, scale distortions in

other pertinent variables have been well-documented. The within-school method corrects for

such scale differences in other variables, while the across-school method does not.

Previous investigators have demonstrated school contextual effects on other variables

that are both psychological and statistical in character. One example is the “big-fish-little-

pond” effect already noted; viz., students’ attitudes about themselves and their education can

be inversely influenced by the average ability of students in the school that they attend

(Marsh, 1987; Marsh & Parker, 1984). The size of a school has contradictory effects on two

other variables of interest. Larger school enrollment is directly related to the

comprehensiveness of the curriculum and therefore expands a student’s course-taking

possibilities (Monk & Haller, 1993). On the other hand, smaller school enrollment expands a

student’s possibilities for extracurricular achievements, because fewer people are competing

for limited positions of honor (Lindsay, 1982, 1984). Holland and Andre (1987, p. 437)

argued that, as a consequence, “Low-ability and lower SES students are more involved in

school life in smaller schools.” The within-school analysis removes these sources of ‘noise.’

There is an implicit assumption; namely, that the resulting gain in accuracy outweighs any

unappreciated signal associated with schools that may be lost in the process.

Both the overfitting problem and the evidence of scale distortion in other variables

recommend a within-group analysis in the case of school differences. It is not clear that this

method is appropriate for course-grading differences. In any event, samples were entirely too

small for this latter purpose. Consequently, the within-school method was used first to correct

school differences for all variables. Then the residual method was applied to the pooled

within-school data matrix in order to correct grading differences among courses. In

Page 70: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

65

subsequent discussion, this joint correction of school differences and course-grading

differences is referred to as the “within-school” analysis. This usage clearly differentiates the

method of correcting for school grading variations, which was the grading variation of

consequence in the analysis.

Following Ramist et al. (1994), the course-grading correction was handled in the

within-school analysis by adding an adjustment for course-grading strictness, K, to each

student’s within-school grade average. Thus, the doubly corrected grade average

becomes KHSAw + , where ASHHSAHSAw −= denotes the deviation score obtained by

subtracting the mean HSA at a student’s school from that student’s average grade. The K

correction for course-grading strictness was the average Course Grading Residual (CGR) for

the particular courses in which the student was enrolled. In the within-school analysis CGR

was defined for each course (j) as

j

N

kkjkkkj NASHCGASHHSACGR

j

/)]()[Pred(1

)()(∑=

−−−= .

In this equation, jkCG represents the grade obtained by the kth student in course j,

)(kASH is the mean HSA in this student’s school, )Pred( )(kk ASHHSA − is the predicted value

of this student’s wHSA based on a pooled within-school regression analysis,4 using the NELS

Composite deviation score for the student as the predictor, and jN denotes the total number

of students with grades in course j. In other words, the predicted HSAw is used as a baseline

for determining how strict the grading is in a particular course. The Course Grading Residual

(CGR) for a course is the average discrepancy between the predicted HSAs and the grades

obtained by students in that course—all computed within schools on the basis of deviations

Page 71: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

66

from each school mean, and then pooled across schools. Again, each student’s K score was

simply the mean CGR for the courses taken by that student.

A complication in computing CGRs was the large number of empty or near-empty

school-course cells due to the wide variations in courses elected and the diverse names that

schools attach to courses5. As a result, CGRs ultimately had to be based on all students taking

a course with a given title; i.e., all students in Algebra I, all students in American History,

etc., irrespective of school. Ignoring schools assumes that grading variations from course to

course follow the same pattern in all schools. As will be reported among the results

concerning grading variations, a special analysis was undertaken in order to determine what

proportion of course-grading variations was lost by collapsing each course across schools in

this manner.

An important practical consideration had a bearing on the methodology of analyzing

grading variations. One research objective was to understand the effect of grading variations

on observed grades of individual students and on groups of students. Another useful

objective, recognized in the course of the study, would be to examine the relationship between

school grading variations and course-grading variations. These objectives required indexing

the effect of grading variations for each student. In the within-school method, course-grading

variations could be indexed for individual students through the K score, but school-grading

variations could not be indexed in that method because the pooled within-school data set

simply removed school differences altogether.

An alternate correction procedure, based solely on the across-school matrix (the

original data set), was developed in order to examine independently the effects of school and

course-grading variations and how each factor was related to other variables and to group

Page 72: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

67

differences. In this way each grading factor becomes a separate predictor variable. Two

grading factors were defined: a School Grading Factor (SGF) and a Course Grading Factor

(CGF) which are given by

[ ] i

N

kkki NHSAHSASGF

i

/)Pred(1

∑=

−=

j

N

kjkkj NCGHSACGF

j

/])[Pred(1

∑=

−=

Both were derived in a manner similar to the computation of K, described above. The

equations are comparable to the one for CGR, but simpler because we are not here working

with within-school deviation scores. Course-grade residuals for individual courses and HSA

residuals for individual schools were determined from predictions based on the regression of

HSA on the four NELS tests with original data in the total sample. When the grading factors

are associated with individual students, the SGF for a given school is assigned to all students

in that school. However, the index for course-grading variations for each student is the

MCGF, the mean CGF for the particular courses that student took.

The two grading factors, SGF and MCGF, were expressed on the same scale as HSA,

so they could be used either as predictors or as criterion corrections. Since each factor

represented a grading strictness “handicap,” the criterion correction simply entailed adding

the two factors to the high school grade average; i.e., HSA + SGF + MCGF (labeled

“HSA+2G”). In effect, this correction adds to or subtracts from each student’s HSA an

amount appropriate to the strictness or leniency of grading in the student’s high school and the

particular courses that the student elected to take.6 Applying the SGF correction alone gives

the lower panel of Figure 2. Incorporating SGF and MCGF in HSA+2G made it possible to

examine the relationship of variables to an original across-school HSA in which grading

Page 73: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

68

variations were corrected. Using SGF and MCGF with other variables in their original form

is hereafter referred to as an “across-school” analysis (in contrast to a within-school analysis).

Because an across-school analysis makes it possible to examine how the effects of

grading might vary among groups, this method was used in lieu of the within-school method

in most of the analyses based on students in different school programs, gender, and ethnic

groups. As we have noted, the across-school analysis has two shortcomings that influence the

grade-test correlation in opposite directions. First, the method is likely to overfit. Second, it

does not correct known scale distortions in other variables relevant to the analysis.

Nevertheless, the across-school analysis serves as a partial check on the within-school method

and provides additional evidence regarding the assumptions on which the methods are based.

Estimating Reliability

The proposed analyses required estimates of the reliability of high school grade

averages and the four NELS tests for two purposes. Reliability estimates were needed in

order to correct for measurement error as one factor in explaining discrepancies in grade and

test performance. Reliability estimates were also needed in order to adjust for any differential

effects of HSA reliability on analyses within different subgroups. It was assumed that HSA

reliability would vary to some degree because subgroups took varying numbers of the

academic courses on which HSA was based. The following reliabilities for the NELS tests

were reported by NCES (Rock, Pollack, & Quinn, 1995):

.85 Reading

.94 Mathematics

.82 Science

.85 Social Studies

Page 74: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

69

Standard equations for the reliability of a composite (see Note 7) were used to estimate the

reliability of two other test variables:

.96 NELS-T The total NELS test score

.95 NELS-C The best weighted test composite for predicting HSA

Since HSA was based on the equally weighted average of four New Basics subject

area grade averages computed by NELS, its reliability was estimated as the reliability of an

equally weighted composite of the split half reliabilities for the four area average grades.

Reliability estimates of HSA for the 10 subgroups (2 gender, 4 ethnic, 4 school programs)

were similarly derived. Reliability of HSA(T) was based on a simple split half estimate for

the total grade record. Finally, reliability estimates were corrected for range restriction as

necessary using procedures described in Ramist et al. (1994, p. 10).

One further complication arose because grade averages sometimes included course-

grade corrections (e.g., HSA+K). This model corrects only for the unsystematic

measurement error inherent in grades and in test scores. Such measurement error is unrelated

to systematic grading variation, another factor in the approximation of the original

framework. Since neither grading variations nor any other factors are being corrected for

attenuation, the criterion component represented by course-grade corrections was assumed to

have no error; i.e., perfect reliability.7 This assumption is conservative with regard to

corrections for attenuation (i.e., the effect of Factor 3).

Page 75: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

70

Page 76: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

71

Results of the Analyses

The analyses revolved around the five factors assumed to affect the relationship

between the NELS tests and high school grade average: 1. Subjects Covered, 2. Grading

Variations, 3. Reliability, 4. Student Characteristics, and 5. Teacher Ratings. The first issue

to examine is how taking each of these factors into account alters the pattern of individual

differences on test scores and grade averages; that is, the multiple correlation between the

two. Following an overview of this analysis, each factor is examined more closely. Next is

an examination of group effects. This entails an analysis of differential prediction, and

finally, a condensed analysis of how the most important variables work for students in each of

the gender and ethnic groups and the four school programs. Descriptive statistics for all

variables are shown in Appendix A.

The Effects of Factors 1 to 5

Figure 3 shows the accumulating effect of Factors 1 through 5 in predicting the grade

performance of individual students; that is, the extent to which one can one account for

differential grade performance. The right column describes the steps involved in making the

five adjustments; the left column characterizes the status of the grade-score relationship

before and after each adjustment. The correlation with grade average at each of those points

appears in the middle column. In considering the successive correlations in Figure 3, it is

good to remember that the correlations, and therefore the steps between them, are not

comparable in the sense of adding variables to a multiple regression. It is one thing to add a

variable, another thing to remove school differences, yet another to correct for unreliability.

Each step must be interpreted on its merits.

Page 77: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

72

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Figure 3 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

The initial correlation, .62, is based on the simplest representation of grade average

and test score: essentially all course grades on the transcript and the total score for all four

tests. From that point, each of the five adjustments resulted in a tangible increment in the

correlation. Matching the subject matter of grade average and test score increases the

correlation by .06. Taking account of grading variations and measurement errors raises the

multiple correlation by .13. Taking account of student characteristics and teachers’ judgments

of students' behavior increased the multiple R another .09. The eventual multiple correlation

of .90 was based on the four NELS tests, 31 additional variables, corrections for school and

course-grading variations, and corrections for unreliability of grades and test scores. This

analysis resulted in an increase to 81% of variance accounted from the 38% based on the total

NELS Test score alone.

There is logic to the order of the factors. It is necessary, first, to move from the grade

and score in hand to the grade and test measures that are appropriate to the analysis (Factor 1),

to then correct errors in those measures (Factors 2 & 3), and lastly, to consider why students

perform differently on the “true” scores and grades (Factors 4 & 5). The last two factors are

of a different character because they add new predictors in order to help account for grade and

test score differences.

It is possible to entertain somewhat different orders, in which case the end result is the

same but the increments in R that are associated with each factor may vary. In general, a

factor will make less apparent contribution if taken into account after, rather than before,

another overlapping factor that makes a contribution. Reliability is a special case here. In

Page 78: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

73

third position, the correction for unreliability increases the grade-score correlation by .05.

Were it Factor 5 at the end, it would increase the multiple only .02 .

Factor 1. Subject Match. The total NELS Test score and the average of all grades on

the school transcript are both summative measures of performance in high school, but they

differ even in the general skills and subject matter that that they comprise. Bringing the two

measures into rough concordance involved two steps. One is to restrict the grade average to

courses in the four New Basics subject areas that best correspond to the four tests. Another is

to weight the tests so that a composite of the four (NELS-C) best represents (predicts or

reproduces) individual differences in the grade average. Together, both steps raise the

correlation between grade and test score from .62 to .68.

Since the effects of the two steps will overlap to some degree, how much each

contributes to the correlation depends upon which comes first. As Table 4 indicates,

weighting the tests to predict the total transcript average HSA(T) increased the multiple

correlation by .03. If shifting to the more strictly academic grade average (HSA) had come

first, the correlation would have been increased by .02. Two other aspects of the data in Table

4 are noteworthy. The correlations and standard regression weights associated with the two

grade criteria are remarkably similar. The traditional academic courses represented in HSA

constitute about two-thirds of the courses in HSA(T), the remainder often being quite

different in surface character. The heavy weight on mathematics in concurrent prediction of

grade performance is also notable. Part of that heavier weight is due to the mathematics test

being longer and more reliable than the other tests. The small negative weight for the science

test does not appear to be a consequential finding but rather results from the collinearity of

that variable with the other tests.

Page 79: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

74

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 4 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

Factor 2. Grading Variations. Correcting the grade criterion for variations in grading

standards was also a two step process. As previously described, school grading variations

were removed by subtracting the school means from all variables, thereby creating a pooled

within-school covariance matrix, which was then corrected for range restriction. In this data

set the multiple correlation between the NELS tests and HSA was .75, an increment of .07

attributable only to removing the variation in grading from school to school.

Step two was to remove variations in course grading. This involved adding to each

student’s grade average the constant, K, the average grading strictness of the courses taken by

that student. The multiple correlation of the NELS tests with this doubly adjusted HSAw+K

was .76, an additional increment of only .01 associated with differences in course-grading

standards. Using K as an independent predictor had no effect on the multiple correlation.

This result is in sharp contrast to the findings of Ramist et al. (1994). Their similarly derived

Z, an index of grading strictness in college courses, substantially improved Freshman GPA

predictions (to .58 from a multiple correlation of .48 based on the SAT and high school

average). The Z index of Ramist et al. (1994) correlated negatively with GPA, indicating that

students who take strictly graded courses earn somewhat lower grades. But in the present

analysis of high school grades, the comparable K index correlated positively with HSA.

At this point in the analysis it was not clear why K behaved unexpectedly. The

discrepancy in these results and those of Ramist et al. (1994) could be related to different

grading habits between school and college instructors or different habits of college freshmen

and high school students in selecting tough versus easy courses. The likelihood of such

Page 80: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

75

school-college differences is considered in later discussion of possible educational

implications of our findings.

At this point it is desirable to examine two empirical questions that may be helpful in

interpreting the grading data. One question is whether school and course-grading variations

are highly correlated. If that were the case, the initial removal of school-grading variations in

a within-school analysis would also have removed much of the course-grading variations.

Thus, the impression in our analysis of a large school-grading effect and a small course-

grading effect would be due simply to having lumped most of the course-grading variations in

with the correction for school-grading variations.

Another empirical question concerns the consistency of the pattern of course-grading

strictness in high school. If the pattern of course-grading standards varies noticeably from

school to school, the sparseness of NELS data within individual schools would take on added

significance. In that case, the lack of sufficient data to represent course by school interactions

in grading variations would limit the effectiveness of K and, as a result, underestimate the

overall effect of grading variations. Two additional analyses were undertaken in order to

explore these questions further.

One analysis required indexing both school grading variations and course-variations

for each student so that the two factors could be examined as separate predictor variables. As

described earlier, a School Grading Factor (SGF) and a Course Grading Factor (CGF) were

developed based on the original across-school data. Both SGF and CGF represented grading

strictness; that is, a positive score on CGF or SGF implies that the student would have

received, respectively, a higher course grade or grade average were it not for variation in

Page 81: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

76

grading standards. Each was based on residuals from the regression of HSA on the four

NELS tests in the total sample.

The SGF for a given school—and for each of it students—was simply the average

residual HSA (i.e., predicted minus actual HSA) in that school. The index of course-grading

variations assigned to each student was the Mean Course Grading Factor (MCGF) for the

courses taken by that student. The development of MCGF was parallel to that of the Course

Grading Residual leading to K in the within-school analysis (see p. 67). CGF was computed

for each course with at least 25 students enrolled in the total sample. Since MCGF and SGF

were on the same scale as HSA they could be used as criterion corrections (instead of

predictors) by simply adding the two factors to HSA.

As was the case in the within-school analysis, correcting for course-grading variations

by adding MCGF to the HSA criterion resulted in a relatively small addition (.015) to the

multiple R based on tests alone. MCGF had essentially no effect on the multiple correlation

when used as a predictor. As will be evident, SGF was quite useful as a criterion correction,

either when it alone was added to the NELS test or when it was used as one of 37 predictors.

These results were consistent with the previous within-school analysis, but puzzling

nonetheless. At least, this result established that the failure of course grading to play much of

a role in the within-school analysis was not because course-grading variations are highly

correlated with school grading variations and thereby removed from the picture when school

means were removed from the analysis. As here defined, the two factors are additive and

interdependent but do not appear to be strongly related in high school grading patterns. SGF

and MCGF correlated only .18.

Page 82: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

77

There is still to consider the second possible reason why correcting course-grading

variations did not work according to expectation; namely, not being able to take into account

differences in course-grading patterns from school to school. It was possible to determine

whether that was the case through an analysis of grading variations attributable to schools,

courses, and the interaction of schools and courses. This ANOVA was performed on residual

course grades in an across-school analysis, using HSA predictions based on the four NELS

tests. There were 225 courses in the four subject areas on which HSA was based. These 225

courses in 574 schools created a matrix with 129,150 cells, of which only about one in six

contained data. Table 5 shows the results of an analysis of variation among those cells,

weighted to account for unequal N’s and credits.8

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 5 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

Two main findings are evident in Table 5. First, 54% of the variation between cells is

attributable to the additive model, which assumes a course effect (represented by MCGF) and

a school effect (represented by SGF). The course-school interaction accounted for the

remaining 46% of the cell variation. Thus, the two variables that have been used to correct

for grading variations actually only accounted for approximately half of the systematic

variation in grading that is jointly associated with schools and courses. This result indicates

that our analysis substantially underestimates the extent to which discrepancies between

grades and test scores are due to grading variations. If the pattern of course-grading

variations could be determined within individual schools, the final multiple correlation in

Figure 3 would in all likelihood have been larger, though how much larger cannot be

determined with the data at hand.

Page 83: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

78

Second, the ANOVA results indicate that the relative effect of course grading versus

school grading on discrepancies between grades and test scores is underestimated in our

analysis. The course main effect, which is the sole basis of MCGF in our analysis, accounted

for 14% of the variation in cell means (i.e., all course-grade residuals). The interaction,

representing 46%, is also most reasonably treated as variation in course grading. Thus,

differences in the pattern of course grading from school to school accounted for three times as

much grading variation as did the overall differences from course to course (e.g., 3rd year

Chemistry versus 4th year English). Had it been possible to reliably identify grading patterns

within schools, whatever enhanced grade prediction the interaction might provide would have

gone to MCGF. Thus, the ANOVA results clarify the misleading impression of the regression

analysis—that course-grading variations are not consequential. Such variations played a

minor role in this analysis only because limitations in the data base precluded corrections for

the major source of inconsistency in course grading.

Factor 3. Reliability. In our five-factor model, variation in grading standards

represents a systematic error in HSA that can be associated with particular students and

situations. Unreliability, on the other hand, represents unsystematic measurement error. As

here defined, the two types of error are independent and additive as correction factors.

Unreliability in both grades and test scores contributes to observed differences between the

two. In meta analyses of validity studies, it is appropriate to correct for unreliability of the

criterion alone (Hunter & Schmidt, 1982, p. 88). But in the present study the objective was

rather to remove the effects of observed discrepancies between grades and test scores due to

errors of measurement in either of the two.

Page 84: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

79

The reliabilities for the NELS tests (.95 for total score) were reported above. No data

were readily available to determine whether the reliabilities of the NELS tests vary among

subgroups. They were, however, the same tests for all students or parallel forms that were

scaled together, and there is no indication that they would vary in reliability except due to

differences in range, which are correctable and not reflected in the error of measurement.

Other evidence suggests that the reliability of standardized tests of this type are not likely to

vary consequentially across groups (Rock & Werts, 1979).

On the other hand, the reliability of a grade average could vary if there is variation in

the nature of the courses included in the average or in the amount of coursework on which the

average is based. Table 6 illustrates such an association between reliability and amount of

coursework. Among subject area averages for different groups, reliabilities varied from .63 to

.92. The lower reliabilities tend to be found in mathematics and science. In these areas

weaker students tend to take fewer courses, so reliability is lower due to fewer observations

and a somewhat restricted range. In Table 6, reliability estimates of overall average were

corrected for range restriction but those for subject averages were not.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 6 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

Reliabilities for overall grade averages are quite high. The reliability of HSA for

vocational students is the apparent exception. Previously reported estimates of the reliability

of grade averages have tended to be substantially lower—from the low .60s to the low .80s

(Etaugh, Etaugh, & Hurd, 1972; Ramist et al., 1990; Werts, Linn, & Joreskog, 1978). Several

possible explanations come to mind. Estimates of the reliability of grade averages are

typically based upon samples with a restricted range, but here the range is quite broad since

Page 85: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

80

all high school seniors are included. Furthermore, estimates of grade reliability are normally

based upon a more limited amount of coursework than one finds in a four-year high school

transcript.

Another consideration is that the reliability estimates were based on a conventional

odd-even split half method that, in this context, may yield overly highly high reliabilities.

The odd-even method treats any variation in performance across years as reliable covariance.

An alternate interpretation of HSA reliability would treat yearly variation in performance as

unsystematic error. This latter definition would presumably yield a lower and arguably more

appropriate estimate of reliability because that would be more consistent with our use of a

concurrent analysis to control grade-test score differences. Figure 1 treats variations over

time as one source of such differences. In an analysis of college data, Humphreys (1968, p.

375) reported that, “A substantial amount of instability of intellectual performance over this

four-year time span is revealed.” His data also showed systematic differences in the pattern

of grade-test correlations from year to year.9 These data suggest that our corrections for

unreliability are likely to be conservative.

Range restriction does not appear to place much of a ceiling on the reliability of HSA.

Correcting for range restriction typically had relatively little effect. It is also noteworthy that

the reliabilities of HSA and HSA(T) are very similar for most groups and programs, despite

there being only two-thirds the amount of coursework represented in HSA as is in HSA(T),

which is based on the full transcript. Reliability is lower for HSA than for HSA(T) in the

Vocational Program because those students take a smaller proportion of academic courses.

For other programs and groups, the reliability enhancing effects of more coursework in

Page 86: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

81

HSA(T) is possibly offset by more homogeneous grading standards among the more strictly

academic New Basics courses included in HSA.

In any event, the reliability of the grade average (or the test score) alone was not one

of the larger factors in this accounting for observed differences in grades and test scores. The

lack of comparability of grades due to variation in grading standards was a more

consequential source of grade-test score discrepancy. Correcting for unreliable grades

increased the multiple correlation between grades and test scores about .02,10 but correcting

for grading variations increased the correlation by .08. As was experienced in the

experimental Vermont state assessment a few years ago, however, low reliability can be a

significant problem in a high-stakes situation with a less traditional performance test (Koretz,

Stecher, Klein, & McCaffrey, 1994).

Factor 4. Student Characteristics. Table 7 shows the original correlations (across

schools) of 26 student characteristics with the total test score, NELS-T, and the transcript

grade average HSA(T). These variables were selected on their promise of showing some

relationship to performance in school. Each of the five categories, A to E, included variables

with at least a moderate correlation with performance. As expected, some of the relationships

were negative, as in the case of behavior involving Competing Activities. Also, as expected,

a number of the characteristics were more strongly correlated with grades than with test

scores. For seven of the 26 characteristics, the absolute value of the correlation was at least

.10 higher with HSA(T) than with NELS-T.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 7 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

Page 87: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

82

In two instances, the absolute value of the correlation was .10+ lower with HSA(T)

than with NELS-T. That is not surprising in the case of SES, because the social advantages

implied by a higher SES would presumably act over a lifetime on the development of general

cognitive skill in and out of school. The test is more likely to reflect such general skills than

is the school average that focuses on specific learning objectives and behavior in the

classroom. Furthermore, the correlation between SES and grade average is likely to be

depressed by grading variations (an assumption supported by the corrected correlation in

Table 8 following).

The correlations with “Leisure Reading” on material unrelated to school (Variable 16)

were somewhat more of a surprise. Students who read more tended to have higher test scores

but not higher grades. The underlying reasons are unclear but may be similar to the case with

SES. A history of outside reading could raise general cognitive skills as do other advantages

of a high SES, but also take time away from schoolwork. The result would be less

opportunity for the achievement on specific course objectives that is required for good grades.

In either sense, leisure reading would be a competing activity with regard to relative

performance on grades and tests.

Table 7 also shows that the various student characteristics have a highly similar

pattern of correlations with HSA(T) and HSA. Conventional wisdom might suggest that

some significant reordering of students would occur with a shift from the full transcript

HSA(T) to the academic emphasis of HSA. The more important consideration is that the two

averages are based on substantially overlapping coursework and correlate .97. In particular

subgroups, the average HSA(T) tended to be .01 to .03 higher than the average HSA (see

Appendix Tables A-3 and A-4). The focus of the analysis now moves from HSA(T) to HSA.

Page 88: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

83

This shift helps to satisfy the first specification of our five-factor model, an improved fit of

HSA with the NELS Test with respect to subject matter.

Table 8 traces the relationship of each student characteristic with HSA as several

corrections and controls are taken into account. The first data column shows the across-

school correlation of each characteristic with HSA—a repeat of the third data column in Table

7. All other data columns in Table 8 are based on within-school analyses corrected for range

restriction. Going from the first to the second column, one might expect some increase in the

relationship of these variables to HSA after grading errors in HSA are corrected (within-

school method). The correlations in the second column do tend to be somewhat higher than

the original correlations; only one is smaller in absolute value.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 8 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

The third column of Table 8 shows the partial correlation of each student characteristic

with HSA when grading variations are corrected and scores on the four NELS tests are

controlled. This partial correlation gives the best indication of the extent to which each

characteristic is related to differential grade performance. Again, all five categories, A to E,

include variables that make such an explanatory contribution. All partials for competing

activities are negative, as are “Discipline problems” and “Stress at home.”

As would be expected, the partials are typically lower than the correlations in the

preceding column, but the partial for “Work completed” (#4) increased substantially. This

self-rating of homework habits was related to grades earned, despite a slightly negative

relationship to test scores. Thus, getting one’s schoolwork done presents a near mirror-image

Page 89: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

84

of the situation—behaviorally and statistically—with “Leisure Reading,” where a small

positive relationship with HSA swung negative when the test score was held constant.

The beta weights in column 4 indicate which variables made an independent

contribution in accounting for grade performance. As would be expected, the weight for

many student characteristics was near zero. The interesting aspect of this stage of the analysis

was which types of variables remained in the picture. Variables 17 through 26 concerning

Family Background and Student Attitudes largely drop out. It was mostly the behavioral

variables—especially those directly involved with school—that made an independent

contribution in accounting for grades earned.

Scholastic Engagement. As mentioned earlier in our review of pertinent literature,

researchers have looked hard for variables that might help in understanding why some

students work hard in school and others do not. Many studies have focused on measures or

circumstances that influence achievement, but several writers who have taken a more holistic

view in seeking to understand how students do or do not become involved or engaged in

school. These and similar ideas have been variously applied to students’ behavior and

attitudes (Finn, 1993; Hanson & Ginsburg, 1988; Lamborn, Brown, Mounts, & Steinberg,

1992; NCES, 1995; Newmann, 1992)

In seeking to understand the relationship of student characteristics to school

achievement, it seems helpful to distinguish overt student behavior from contextual factors

like family background and student attitudes. The distinction is especially useful in

accounting for differences in the two types of school outcomes: grade performance and test

performance. As Table 7 shows, the measures of Family Background and Student Attitudes

(categories D & E) are, for the most part, similarly related to grade average and test scores. It

Page 90: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

85

is mostly the student’s behaviors (categories A, B, & C) that are more strongly related to

grades and seem therefore to largely determine whether a student’s differential grade

performance is high or low (i.e., HSA relative to test scores) .

The measures in categories A through C all involve behaviors that represent different

aspects of being engaged in school: taking a demanding courseload, doing the work assigned,

and not being involved in competing activities. A number of those behavioral measures

showed a consequential partial correlation with HSA when grading variations and test scores

are held constant. As indicated in the third data column of Table 8, nine of the 16 behavioral

measures had a partial of ±.10 or larger.

Table 9 shows comparable partial correlations computed by subgroup, with the nine

measures listed in order of their partial in the total sample. The most obvious thing about the

table is the generally similar pattern of partial correlations across school programs as well as

gender and ethnic groups. Two differences are noticeable: a) taking advanced electives,

participating in class, and being involved in school activities played little role in defining

engagement for Vocational students; b) discipline problems and killing time played no such

role for Asian-American students.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 9 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

Table 10 shows group differences in the level of these behaviors for different groups.

For most of the measures, a trend is evident across the four school programs. Moving

progressively from the Rigorous Academic program to the Vocational program, students are

likely to take substantially fewer advanced electives and have many more disciplinary

Page 91: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

86

infractions. Similar but lesser trends are visible in the overall number of courses completed,

in attendance, in class participation, and in number of school activities.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 10 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

Differences in the incidence of these behaviors were typically small among the gender

and ethnic groups, though three exceptions are notable. Females and males showed a two-

fold difference in the incidence of disciplinary problems (26% vs. 51%). Also, females were

generally more engaged in school, as here defined. In fact, women scored more positively

than males on all nine measures. Lastly, there was a substantial difference in the number of

advanced electives taken by Asian-American students compared to African-American and

Hispanic students. On this measure, the standard mean differences, D, between the Asian-

American group and the latter two groups were .88 and .85, respectively.11

In subsequent analyses Scholastic Engagement—or more simply, Engagement—was

defined as a composite based on the nine characteristics listed in Table 9. The nine

components were weighted in proportion to the partials for the total group. Engagement

correlated .56 with HSA and had a moderately strong relationship to Family Background and

Attitude measures. Indeed, the family and attitude measures were more closely associated

with Engagement (Multiple R = .59) than with High School Average (Multiple R = .51). This

correlational pattern involving Engagement, along with the pattern of partial correlations and

regression weights in Table 8, support a commonsense view that it is largely the student’s

behavior that directly influences differential grade performance.

As the correlations in Table 11 suggest, Scholastic Engagement is more related to the

students’ attitudes than to their backgrounds. When these two sets of variables were used to

Page 92: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

87

predict Engagement in the total sample, the multiple correlations were .57 and .41,

respectively. This finding is consistent with some writers’ supposition (Eccles, 1983;

Newmann, 1992) that involvement with schoolwork is intrinsically a matter of attitude.

Variable 26, the students’ judgment regarding their closest peers’ attitudes about education,

had a somewhat stronger correlation with Engagement than did Variable 20, the parents’

educational aspiration for their child (r = .44 versus .30 in the total group). That relationship

would appear to be compatible with the controversial argument that peers count more than

parents in the development of personality (Harris, 1995). Of course, parents do influence the

selection of friends. It is also notable that the results in Table 11 are very similar for males

and females.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 11 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

Factor 5. Teacher Ratings. Teacher ratings showed strong relationships with grade

performance and with differential grade performance. Four of the five ratings correlated from

.37 to .63 with HSA. A single 5-point rating on whether the student regularly turned in

assignments had a partial correlation of .56 with HSA, with grading variations controlled and

holding the test score constant—the highest partial for any variable that was examined. Three

of the five teacher ratings had significant beta weights in the overall regression analysis (see

Table 8). Teacher Ratings raised the multiple correlation .04 (.86 to .90), even after all other

factors had been taken into account.

Several aspects of these ratings make these results all the more striking. The ratings

do not represent a consensus judgment of all the teachers who knew each student. Ratings

were available from an average of only 1.6 teachers per student. The evaluations were not

Page 93: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

88

very consistent from teacher to teacher. The reliability of an average rating by two teachers

ranged from .27 to .66 for Variables 27 through 31. Finally, the ratings were collected in the

middle of the sophomore year, not near the end of high school when the teachers would have

had a more extensive track record on which to judge.

By design, only those NELS Teacher Ratings that had a behavioral emphasis were

included in the analysis. In Variables 27 to 31, teachers largely described what students did in

school. The teacher ratings obviously overlapped with several of the student self-ratings

among Variables 1 to 26. The last two columns of Table 8 indicate that the Teacher Ratings

accounted for some, but not all, of the predictive variance associated with the student

characteristics. In the regression analysis based on 26 student characteristics, the larger beta

weights dropped somewhat when the Teacher Ratings were added, but not nearly to zero.

Both the teacher’s and the student’s judgments contributed in accounting for

differences in grade and test performance. The student is privy to information that the teacher

is not, but the teacher is probably more objective than most students would be in judging their

own behavior. The student and the teacher may have slightly different views of attendance

and completing assignments (Variables 1 & 27 and 4 & 31, respectively). Observe that there

are two ways in which such behaviors can directly influence grades earned. Effective

academic behavior can lead to added knowledge and skill that gets reflected in higher grades.

The student’s behavior can also result in points being added to or subtracted from course

grades, irrespective of knowledge and skill actually acquired. Teachers not only provide an

independent view of the student’s behavior; they also assign the grades in ways that reflect

their pedagogical values.

Page 94: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

89

A Teacher Rating Composite (TRC) was developed for use in some subsequent

analyses. TRC was based on the best-weighted average of the five ratings for predicting

HSA. All five ratings contributed to that prediction; the heaviest weights went to Work

completed (.33), Educational mtivation (.28), and Class behavior (.11). The TRC composite

correlated .68 with HSA. Engagement, based largely on information supplied by the students,

correlated .56 with HSA.

It is possible that Teacher Ratings help to account for grade-test score differences for

yet another reason. A teacher’s positive or negative bias regarding some students may be

reflected in the teachers’ ratings as well as in the grades they assign. To the extent that that

occurs, it is for our purposes, another real difference between grades and test scores. But if

these Teacher Ratings were heavily based on such “halo effects,” it seems unlikely that three

of the five teacher ratings would each make an independent contribution to the multiple

correlation.

Finally, do Teacher Ratings contribute to grade prediction mainly because of the

information that ratings add or because of confounding? That is, do individual teachers base

high or low student ratings on the grades students earn in their particular class or on

knowledge of the student’s past grade record? Two considerations argue against confounding

being a big factor. First, the one or two teachers who rated each NELS student constitute only

a small fraction of the teachers who assigned grades to that student over four years. Second,

the ratings were collected relatively early in high school and therefore act more as predictors

than concurrent correlates.

Nevertheless, the possibility of spurious effects due to confounding of grades with the

predictors is a legitimate concern. Each of the 31 predictors variables in Table 8 was

Page 95: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

90

examined for possible susceptibility to confounding; that is, the likelihood that the measure

could reflect grade performance rather than account for grade performance. Considering the

nature of the measures, three variables appeared to have the greatest likelihood of a

relationship to HSA due to confounding: #7 Advanced electives, which attract students with

good grades, #23 Educational plans, which are likely to be optimistic if the grade record is

good, and #30 Educational motivation, which teachers may be prone to rate high simply

because the grades are high. When all three of these variables were removed from the final

regression analysis shown in Table 8, the multiple correlation was reduced by .007. Since

other variables largely accounted for the predictive variance contributed by these three

measures that seem to be the most suspect, the role of confounding does not appear to be

large.

Differential Prediction

As already noted, an increment in the multiple R is one index for evaluating the effects

of identifying and adjusting for sources of discrepancy between grades and test scores.

Another index is change in the extent of differential prediction. The former concerns

individual differences in grades and test scores; the latter concerns group differences. Figure

4 shows the extent to which the HSA of various groups of students was over- or

underpredicted on the basis of the NELS Tests and the accumulating effects of adjusting for

grading variations, student characteristics, and teacher ratings. Results are shown for students

in four subgroups and four school programs. Predictions were based on the regression line for

the total sample and did not involve any correction for unreliability.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Figure 4 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

Page 96: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

91

Among the eight groups, initial differential predictions based on the NELS tests alone

ranged from +.13 to –.14; the average was .09, disregarding sign12. When grade predictions

were based on additional information (moving to the right in Figure 4), differential prediction

diminished in all cases where there was originally any consequential differential prediction.

The lines converge toward zero as Grading Variations, Student Characteristics, and Teacher

Ratings are taken into account. With corrections for all three factors, the absolute level of

differential prediction (in column 4) averaged about two hundredths of a letter-grade. Table

12 shows predicted and actual grades of each group at each stage of the analysis. When

grading variations are corrected (moving from column 1 to column 2 in Table 12), both the

actual and predicted mean grades for subgroups were adjusted somewhat because all school

means were removed. Moving from column 2 to 4, only the predicted mean grades were

adjusted.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 12 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

The three corrections affected the groups somewhat differently. Grading variations

had a small effect on differential prediction for most groups but a fairly large effect for

African-American students. As reported above, it was mainly school grading variations rather

than course-grading variations that affected the accuracy of concurrent grade prediction.

African American students were slightly overrepresented in schools that graded more

strictly.13

Otherwise, there appears to be little if any systematic relationship in these data between

grading standards and subgroup representation from school to school (see SGF means for

subgroups in Appendix Table A-4). These somewhat surprising results are inconsistent with

Page 97: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

92

an assumption that African American and Hispanic students are more likely to benefit from

easy grading due to being overrepresented in poor schools.14

Taking student characteristics into account reduced grade underprediction (by –.05 to

–.07) for three groups: women, Asian-Americans, and students in Rigorous Academic

programs. These groups tended to achieve higher grades than one might expect on the basis

of the NELS Test. Correcting for Student Characteristics had a more substantial effect (+.11)

in accounting for the underprediction of the grades of Vocational students. These effects were

notably associated with different levels of Scholastic Engagement. The standard mean

difference (D) in Engagement between males and females was .48. The difference in

Engagement between students in Rigorous Academic versus Vocational programs was

substantially larger (D = 1.41). Teacher Ratings tended to reduce differential prediction

slightly for all groups.

A frequently noted gender difference in differential prediction (Bridgeman et al.,

2000; Willingham & Cole, 1997) was consistently observed in these data as well. The total

differential prediction by gender (absolute difference for males and females) ranged from .18

to .25 for all ethnic-racial groups: African-American, Asian-American, Hispanic, and White.

When predicted and actual grades were computed for male and female students within these

four groups, there was a wider spread of results but the same convergence to low levels of

differential prediction when additional information was taken into account. Actual minus

predicted grade ranged from +.21 (Asian-American females) to –.23 (African-American

males). For all groups, the mean absolute value of differential prediction was .12 based on

tests alone, and .03 based on all information. Table 13 shows all predicted and actual grades

by gender within ethnic group. These results closely match those of Bridgeman et al. (2000,

Page 98: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

93

Table 6), which were based on a comparison of college grades and admissions test scores.

The main exception is greater underprediction of the grades of Asian women in the high

school data.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 13 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

Differential validity. In these analyses of differential prediction, the objective has

been to identify and evaluate the effects of different sources of discrepancy in the grade

performance and test performance of groups of students. Differential validity is a related

issue. A following section—Gender, ethnicity, and school program—takes up the question of

whether the several major factors under study here show similar correlational patterns for

different groups. But it is first useful to look for possible patterns of differential validity

based on the NELS Test alone. Table 14 shows correlations, multiple correlations, and mean

score levels for six subgroups, four school programs, and the total sample.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 14 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

The correlations in Table 14 are corrected for range restriction, grading variations, and

unreliability. Consequently, the correlations are more comparable across groups and tests

than is normally the case in inspecting such data. First-order correlations show a quite similar

pattern across groups. Except for the vocational students, NELS Mathematics has

consistently the highest correlation with HSA. Because of that similar pattern from group to

group, the multiple correlation with HSA is typically very close to the correlation between

HSA and the NELS Composite, which is based on weights derived from the total sample16.

Page 99: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

94

Based on either the multiple R or the composite, the corrected correlation between the test and

the grade average ranges fairly consistently from the low .70s to the low .80s.

The correlational patterns indicate that the four NELS Tests function quite similarly

from group to group with respect to their relationship to grade performance. There were two

main differences. As has been frequently reported, tests tended to be somewhat better

predictors of women’s grades than of men’s grades. Also the grades of vocational students

were not predicted as well as those of students in academic programs. That was true despite

HSA being based only on academic courses.

Furthermore, performance of the individual groups is, on the whole, similar on each of

the four tests. On the other hand, there is a range of approximately one standard deviation in

mean test performance among the ethnic groups and among the school programs. In the

following paragraphs, those differences are compared with differences on other major

variables.

A Condensed Analysis of Major Factors

To this point, our analysis of differential grade performance has involved a large

number of variables, which makes interpretation somewhat unwieldy. The results suggest

that each of the major factors contributing to differences between grades and test scores might

be represented with little loss by a single variable or composite. If that is the case, a much-

condensed analysis of the major variables could help to clarify relationships among the

several factors. Furthermore, one of the purposes of the study was to test the generality of the

results within subgroups of students. A simplified basis for describing how grades and test

scores work for key groups of students would likely offer analytic and heuristic benefits.

Page 100: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

95

It is clear that school grading variations is one major variable that would need to be

included in such a condensed analysis. Since there are zero school differences in a within-

school data matrix, it was necessary to use an across-school analysis in order to index school

grading variations as an independent variable affecting each student’s grade average. This

analytic approach raises first the question of how the results of the within-school and the

across-school analyses compare (see pp. 61-68 for a description of the two methods). Table

15 shows the outcome for the two approaches using all 37 variables plus corrections for

unreliability of grades and test scores. In the across-school analysis, school and course-

grading variations were treated as predictors (column 2) and as criterion corrections (column

3). In both cases the across-school analysis uses the residual method of correcting grading

variations.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 15 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

Overall, the within-school and the across-school (residual) methods of correcting

grading variations gave comparable results. The pattern of beta weights for the various

Student Characteristics and Teacher Ratings are highly similar in the two types of analysis.

The negative weight for the NELS Science Test is apparently due to collinearity and should

not be taken seriously. As Table 14 indicates, the Science test had a strong positive

correlation with HSA in all groups save the Vocational students.

As expected, the within-school analysis was somewhat more successful in accounting

for grade performance than was the across-school analysis; the multiple R was .90 in the

former and .88 in the latter. The within-school analysis removes school-related scale

differences for all variables; the across-school analysis adjusts only for grading differences.

Page 101: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

96

Thus, the across-school approach limits the tendency of scale corrections to enhance the

multiple correlation. Apparently this limitation outweighed the tendency of the across-school

method to inflate the correlation somewhat through overfitting. Subtracting the school means

from all variables in the within-school analysis could also remove real school differences that

affect both grade and test performance. The results suggest, however, that subtracting the

means removed more noise than signal.15

Table 15 also shows that quite similar results were obtained when SGF and MCGF

were employed either as criterion corrections or as predictors. The two methods of handling

grading variations both gave an overall multiple of .88 and a pattern of regression weights that

was highly similar to that of the within-school analysis. The main difference was where the

grading variations show up. They appear as somewhat higher beta weights for the test in the

within-school analysis and when SGF and MCGF are used to correct the criterion (columns 1

& 3). They appear as SGF and MCGF beta weights when those grading factors are used as

predictors (column 2). The two grading factors, separately indexed, offer an additional

perspective on grading variations in a solution that is almost as effective as the within-school

analysis.

The following four composite variables gave essentially the same level of predictive

accuracy, as did the analysis based on 37 variables:

• NELS Test Composite (NELS-C). The NELS total test score was the initial predictive

baseline. This best-weighted composite of the four as predictors of HSA (in lieu of

HSA(T) incorporates Factor 1 in the model.

• School Grading Factor (SGF). The Course Grading Factor (MCGF) correlated .31 with

HSA and .47 with NELS-C. As Table 15 shows, the course-grading correction (MCGF)

Page 102: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

97

added little if anything to the multiple correlation of the NELS tests with HSA. That was

true for every group (as later shown in Tables 19 & 20). Thus, SGF comprises the useful

part of Factor 2 and represents the second major variable in the condensed analysis.

• Engagement Composite. Typically, the student characteristics most directly associated

with differential grade performance were the behavioral variables. The Scholastic

Engagement Composite includes the nine behavioral variables that were most effective in

that regard, each weighted in proportion to its contribution. Thus, Engagement is included

as a third major variable on the assumption that it incorporates most of the useful variance

among the 26 variables that originally defined Factor 4.

• Teacher Rating Composite (TRC). This best-weighted composite of the five Teacher

Ratings in predicting HSA incorporates Factor 5.

Table 16 shows correlations among these variables and HSA. Since the correlations

are corrected for unreliability of the test and the HSA, Factor 3 is incorporated in addition to

the four factors to which the four variables above refer. A noteworthy aspect of Table 16 is

the multiple correlation of .876, which is within rounding error of the multiple R of .884

based on 37 predictors in the comparable across-school analysis just reported in Table 15.

Condensing the analysis to four major predictors lost little, if any, information in accounting

for grade average.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 16 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

Of these four major predictors of grade performance, the NELS Test had the largest

correlation with HSA, though the Teacher Rating was a close second. Each of the three non-

test variables correlated substantially higher with HSA than with NELS-C, and each had a

Page 103: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

98

consequential beta weight in the multiple regression. The weight for the NELS Test was

clearly larger than that of Engagement or Teacher Rating, in part, because the latter two

variables overlap as previously discussed. School Grading (SGF) had a more modest

correlation with HSA, but it made a substantial contribution in accounting for differential

grade performance because it was largely independent of the other variables.

Gender, Ethnicity, and School Program

Table 17 shows correlations and multiple regression results in a condensed, four-

variable analysis for the two gender and four ethnic groups. As was stated in the results of the

condensed analysis for the total group, little information was lost in the subgroup analyses in

going from the 37-variable to the 4-variable analysis. In the latter case, the multiple Rs are all

at nearly the same level. The striking thing about these regression analyses is the similarity of

results across groups. Almost without exception, the critical results that were noted above for

the total group also characterized each gender and ethnic group.

Had we carried out this and the following analysis by gender within ethnic groups,

some additional distinctions would no doubt have obtained. Considering that the gender by

ethnicity breakdown produced comparable overall results in Table 12 and Table 13 and the

fact that gender differences were quite similar from group to group, we used the less detailed

analysis here for the sake of a simpler presentation.

In all groups the NELS Test and the Teacher Rating had the strongest relationships

with grade average. In each group, each of the non-test predictors—Engagement, Teacher

Rating, and School Grading—had a substantially higher correlation with HSA than with the

NELS Test. Consistently, the Teacher Rating had a moderately high correlation with

Engagement. School Grading also showed a fairly consistent, moderately negative

Page 104: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

99

relationship with HSA from group to group, though that was likely due in large part to the

happenstance of each group being more or less equally represented among schools that graded

more and less strictly.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Table 17 & 18 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

As a result of this consistent pattern of correlations, the standard regression weights

were also quite similar across groups. The NELS Test and the Teacher Rating, in particular,

carried almost exactly the same weight in all gender and ethnic groups. Indeed the only

differences of any note concerned the Asian-American students. In this group, compared to

students generally, Engagement was somewhat more important, and School Grading was

somewhat less important in accounting for differential grade performance.

Table 18 shows results for the condensed multiple regression analysis for each of the

four school programs. Results of this analysis showed important parallels to the analysis by

gender and ethnic subgroups—as well as some important differences. Here again, the four-

variable analysis yielded a multiple correlation almost as high as the 37-variable analysis in

all groups. Furthermore, Engagement, Teacher Rating, and School Grading were consistently

more highly related to grade average than to test score. There are, however, progressive

changes in the predictive pattern as one moves, left to right, across the table from the more

academic to the less academic school programs.

In general, the correlations tended to be lower in the less academic programs. Grade

performance was somewhat less predictable; the multiple R dropped progressively from .88 in

the Rigorous Academic program to .73 in the Vocational program. The correlation between

the NELS Test and HSA dropped from .71 to .40 even though both measures were based on

Page 105: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

100

performance in traditional academic subjects for the students in each of the programs. On the

other hand, the Teacher Rating was a stronger predictor in the less academic programs.

Among Vocational students, Teacher Rating had a larger regression weight than did the test.

Two factors may be at work in producing lower correlations between test scores and

grade averages in the more vocationally oriented programs. A somewhat broader range of

competences in the more vocational programs may influence teachers’ judgments and their

grading. Also, a widely practiced social promotion among academically weaker students may

play a role. If many teachers are inclined to pass students partly on the basis of effort, as has

been reported (Public Agenda, 2000), the practice might well produce in vocational programs

a correlation with grades that is lower for the test and higher for the teacher rating.

Table 19 shows results of multiple regression analyses for gender and ethnic groups

based on the full set of 37 variables with reliability corrections as before. These regression

weights present a very consistent picture from group to group. The parallel Table 20 for

school programs shows some differences in the dynamics of educational achievement in the

more academic versus the more vocational programs. As one moves along the academic

continuum from Rigorous Academic to Vocational programs of study, competence in reading

(NELS Reading) becomes a more important predictor of differential grade performance, and

the Science test evidently becomes less relevant. Some unsystematic fluctuations in weights

for the four NELS tests from group to group likely reflect instability due to a high degree of

collinearity among the tests after scores and grade averages were corrected for unreliability.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Tables 19 & 20 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

Page 106: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

101

Class behavior and educational motivation evidently have more influence on teachers’

ratings of vocational students than academic students. For vocational students, that judgment

of motivation was the best predictor of HSA (highest correlation and highest regression

weight) among all of the student characteristics and teacher ratings. For reasons that are

unclear, some variables took on negative weights for the vocational students in the full

regression analysis. For example, turning in assignments appears to be just as important for

Vocational students as it is for academic students, but spending a lot of time on homework

was associated with poor grades. Since the Vocational group is not large, chance fluctuations

are likely responsible for some of these aberrant results. Effects that appear in both the

Academic-Vocational as well as the Vocational groups or effects that change progressively

across the academic-vocational continuum are likely to be the most dependable.

The previous discussion was concerned with group differences in the role of several

major variables in explaining grade performance; that is, whether the factors that influence

grade achievement are similar or different from group to group. A separate issue is whether

subgroups tend to score at a similar or different level on such measures. Figure 5 shows

profiles of average scores for HSA and the three major composite measures that reflect

individual differences: the NELS Test, Scholastic Engagement, and the overall Teacher

Rating. School Grading is not included here because it reflects school, not individual

differences. Each of these measures is expressed on a standard scale with a mean of 50 and

standard deviation of 10. The horizontal lines at 48 and 52 are shown in order to aid

interpretation (± .20 standard deviations being, by convention, the lower boundary of a so-

called “small” difference). Actual means and standard deviations are shown in Appendix

Tables A-7 and A-8.

Page 107: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

102

_ _ _ _ _ _ _ _ _ _ _ _ _

Insert Figure 5 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

The four measures in Figure 5 can be viewed as somewhat different indicators of

student performance in school. In that sense, the profile of average scores for each group

gives some indication of how the groups are similar and different with respect to their

scholastic achievement. Looking first at the top panel of Figure 5, score levels differed

substantially and in a rather consistent pattern from program to program. Students in the

Rigorous Academic program had almost the same consistently high mean scores on HSA,

NELS-C, Engagement, and Teacher Rating. Students in the Academic, Academic-

Vocational, and Vocational programs also had generally similar mean scores on those four

measures, but at progressively lower levels. The range of mean scores across the four school

programs was quite similar for each measure—a difference of 1.2 to 1.5 standard deviations

from Rigorous Academic to Vocational. More important, the pattern was much the same for

each measure.

The group means in the lower panel of Figure 5 tell a generally similar story, but with

important differences. Each of the six groups tends to show a characteristic pattern of

performance, above or below average to some degree. Mean scores for the four ethnic

groups—Asian American, White, Hispanic, and African American—reflect a pattern of scores

almost as divergent as that of the four school programs. Mean scores on the four measures

were not always as consistent for a given gender or ethnic subgroup as was typically true for

students in one of the four school programs. Females had slightly higher means than males,

but gender differences on these broad measures were typically less than “small;” that is,

between the lines designating ± .20 standard deviations. Both the Hispanic and the African

Page 108: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

103

American students, particularly the latter, tended to score lower on NELS Test and HSA,

compared to Engagement and Teacher Rating. This finding could be due partly to

noncomparability of ratings and student-reported information due to these groups being more

likely to cluster in particular schools.

Results here show the familiar tendency of women to score somewhat better on the

grade average than on the test (.16 Standard Deviations). Recall that the superior grades of

women compared to men was consistently observed in all ethnic groups (see column 1 in

Table 13). To a considerable extent, that gender difference appears to be attributable to

women being more involved in school. Women scored higher than men on all nine of the

measures that defined Scholastic Engagement. The results here are notable in the generally

similar performance of each ethnic group on mean test score and mean grade average. This

finding in high school stands in contrast to the relatively low average grades that have been

reported for some minority students in selective colleges. Bowen and Bok reported that,

among students with comparable SAT scores, 50% of White students versus 25% of Black

students achieved a median rank in class on their grade performance (1998, Figure 3.10).

Page 109: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

104

Page 110: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

105

On Four Noteworthy Findings

The previous section described in some detail a number of results from the various

analyses that were undertaken. What were the principal outcomes? Broadly speaking, four

findings seem especially significant and warrant further discussion.

First, our premise that it should be possible to account for most of the observed

differences between grades and test scores proved to be largely accurate. Several other

factors possibly explain the remaining differences. Second, grading variation is clearly a

major source of discrepancies between observed grades and test scores. Taking those

variations into account may prove problematic in practice, however, because patterns of

course grading vary from school to school and are not likely to be reliably identifiable. Third,

Scholastic Engagement defines a logical pattern of successful academic behavior—an

organizing principle that holds promise for studying and improving achievement in school.

Fourth, subgroups often differed in achievement level but were mostly quite similar in

achievement dynamics—including generally consistent average performance on grades and

tests.

Each of these findings is addressed in the following pages. The discussion closes with

some reflection on two questions. First, what general implications might the findings have

regarding our evaluation of the validity and fairness of grades and test scores? Second, what

do the findings imply regarding the particular strengths of these measures when they are used

in high-stakes decisions?

In considering the four findings, it is useful to call attention again to limitations in the

scope of this study and to caution against overgeneralizing the results. The scope of the study

is defined by the nature of the data and the context from which they are derived. Clearly, the

Page 111: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

106

results are most directly applicable to high school performance. At lower grade levels or in

advanced education, the relationship between grades and test scores will certainly engage

additional issues not considered here. In principle, however, it is likely that the major reasons

for differences in grades and test scores will, in varying degree, transcend a particular

educational situation.

The scope of the study is also defined by the specific objectives of the analysis. The

presenting issue in this study was why students score differently on grades and tests. The

statistical connection is obviously important because high-stakes decisions depend on score

level. Using one measure or the other results in some difference in the group of students

selected. Therefore, understanding the main reasons why grades and tests yield somewhat

different results should help to inform questions concerning their validity and fairness.

Nevertheless, this study does focus on how current high school grades are related to a set of

pertinent standardized test scores, not on the specific content of each measure or what they

ought to represent from an educational or social perspective.

The generality of the results also depends upon the stability and representativeness of

the sample. This national sample is large and includes good representation of important

subgroups. Nevertheless, there were constraints on the sample such as availability of requisite

data and loss of those students in schools with fewer than 10 NELS participants. An effect of

the latter constraint was to exclude many students who moved during high school. The effect

of such sample losses is impossible to evaluate with any accuracy. For example,

underrepresentation of students who move during high school may influence these results,

even though this constraint appeared to have limited effect on either the mean test scores or

Page 112: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

107

the ethnic representation. In interpreting the results of this exploratory analysis, the sample

constraints should be kept in mind.

Accounting for Grade-Test Score Differences

The premise of this study was that an approximation based on five factors could

account for most of the observed differences in grade performance and test performance. The

premise was largely accurate. Adding corrections and supplemental information that bear on

the five proposed factors did account, to a substantial degree, for individual and group

differences in grade and test performance. One index of that result was an augmentation in

the correlation between the NELS Test and high school grade average from .62 to a multiple

correlation of .90 based on the test plus 31 additional variables and corrections for

unreliability and grading variations (see p. 72). A corollary finding was a similar effect on

group differences. Taking all information into account reduced average differential prediction

to two hundreds of a letter-grade for four subgroups and four school programs—about one-

quarter of the differential prediction based on test scores alone.

An important related finding was that grade performance could be explained to a large

extent with only four composite variables: a test covering a similar academic domain, school

grading variations, student engagement, and an overall teacher rating. These four “major”

variables yielded a multiple correlation only .008 less than the multiple based on otherwise

comparable analyses including all 37 variables (Table 15 versus Table 16). The significance

of this result lies in the simple accounting that it permits. Individual differences in grade and

test performance need not be conceived as an amorphous array of numerous conditions and

student characteristics. Rather, accounting for school performance can be usefully viewed as

Page 113: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

108

mainly dependent upon a few recognizable variables that are readily studied and evaluated in

relation to educational practice and social policy concerns.

To what extent does the five-factor approximation actually succeed in accounting for

grade-test score differences? An average differential prediction of .02 letter-grades suggests,

in absolute terms, little room for improvement in the accuracy of accounting for group

performance. On the other hand, a correlation of .90 leaves 19% of the grade variance of

individuals unaccounted for. In part, this is because the effect of each of the five factors was

probably underestimated; viz., Factor 1 (subject match) because curricula vary, Factor 2

(grading variations) because variations are known to be larger than those here corrected,

Factor 3 (reliability) because our correction for HSA reliability was conservative, Factors 4

(student characteristics) and Factor 5 (teacher ratings) because of technical problems in

measuring those variables accurately.

A useful perspective on the results is to consider, in more general terms, what types of

differences between grades and test scores we are likely to be missing in this analysis. As

Thorndike (1963) once argued, identifying all causes of difference between grade

performance and test performance would require intensive study of individual students. It is

possible, however, to identify some additional factors that likely account for most of the

remaining variance. Five come to mind. They concern limitations just cited in the

effectiveness of the five-factor analysis as well as other factors that could not be taken into

account. In considering these several types of grade-test score difference that are at least

partly missing, recall that the test is assumed to be external in some sense, i.e., not a

classroom or school-based test.

Page 114: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

109

Curriculum variation. By intention, educational objectives vary somewhat among

districts and schools, teachers vary somewhat in the material they teach and assess, and

students take different courses and focus on somewhat different learning goals. Accordingly,

grades are based on a syllabus that varies to some degree across schools, classrooms, and

individuals. Tests are designed to avoid content that is unique to particular learners or

learning situations. A good standardized test is intended to sample content from the common

ground of the curriculum in order to provide a common measure that is fair to all students.

Grades, on the other hand, are intended to reflect the diverse learning of different students in

different situations. Thus the test content is constant, but the substance of the grading

standard necessarily varies from student to student. There is no way to adequately “correct

for” that fundamental difference between grades and tests.

Construct differences. Aside from variations in curriculum, it is in the nature of

teachers’ grades and external tests to focus on somewhat different learning outcomes. Grades

will necessarily reflect a broader range of knowledge and skills than can be represented in a

test of limited length and more restrictive modes of assessment. Grading may also be

influenced to some degree by broad educational objectives like leadership and citizenship, but

tests are normally limited to more traditional cognitive skills and subject knowledge. Abstract

reasoning is more likely to be stressed on a standardized test; performance skills are more

likely to be reflected in grading. Class grades may be somewhat more influenced by writing

or expositional skills, tests by other test-taking skills. Some learning opportunities may be

unique to graded class assignments; other learning may stem mainly from out-of-school

experiences.

Page 115: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

110

Scholastic Engagement and Teacher Ratings presumably account for some of the

curriculum and construct differences just outlined, but the database for this analysis contained

little information concerning student differences that would be directly pertinent to curriculum

or construct features intrinsic to grades versus test scores. Considering the many possibilities

for curriculum and construct differences, we might not expect the relationship between grades

and test scores to approach 1.00. From this perspective, one might even argue that with

careful evaluation of each student’s performance a multiple correlation with grade average of

.90 sounds higher than it ought to be.

Temporal variations. The NELS Test and the NELS Questionnaire were administered

in the senior year, but the HSA was based on grades earned in four different years. Students

change over time, and they do not necessarily perform the same in school from one year to the

next. Therefore, the time factor was not fully controlled, and the apparent concurrent analysis

was only partly that. Earlier research has shown that grade-test correlations can vary with the

length of time separating the two measures, as can intercorrelations among term-to-term grade

averages. Such temporal variations were not taken into account in either the correlational

analyses or in the corrections for unreliability of HSA.

Technical shortcomings. Several shortcomings have been cited regarding the

measures and statistical methods used in the analysis. The accuracy of questionnaire data is

always open to question. That is probably especially true of data from graduating seniors.

The effects of unreliability in grades and test scores were likely underestimated. There were

significant missing data in this analysis, particularly among the Teacher Ratings that came

from only one or two teachers. The reliability of those ratings was quite modest, and the data

were collected relatively early in the students’ high school program. Both of the statistical

Page 116: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

111

methods that were used to adjust for school grading variations had known deficiencies. All

such factors would likely limit an accounting of grade-test score differences.

Grading standards. While variation in grading standards from school to school was

found to be an important source of differences between grades and test scores, other grading

variations such as differences among instructors and sections could not be identified in the

database. It is more consequential that schools do not follow the same pattern of grading

strictness from course to course. A major shortcoming of this analysis was the lack of

adequate grade data in order to correct for those different patterns. Aside from leaving an

important source of grade-test score difference unaccounted for in this study, this data

limitation has more general implications regarding the negative effects of grading variations.

The Problematic Variation in School Grading

Our analysis attempted to take separately into account variation in grading standards

in both schools and courses. It is important to recall that here, as in most research on this

topic, variation in grading standards means average grading level in relation to average test

score level. As shorthand, we typically refer to variation in grading standards as grading

variations.

As expected, grading variations among schools was a major factor in diminishing the

observed relationship between grades and test scores. The result is consistent with extensive

previous research. This means, as a corollary, that accurate interpretation of grades is

problematic and may often be an important source of unfairness when school grades are used

for high-stakes decisions. An unexpected finding was the rather unpredictable pattern of

school grading variations. It is often assumed that students from families at a higher

socioeconomic level tend to come from schools that grade strictly, and that students with a

Page 117: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

112

disadvantaged or minority background are likely to attend schools with lenient grading.

Tables 7 and 12 suggest no such relationship. School grading standards were mostly

unrelated to any of the personal or background characteristics of students that we examined.

Several studies have demonstrated substantial negative effects of course-grading

variations on predictive accuracy (Elliott & Strenta, 1988; Ramist et al., 1994; Young, 1990).

The results of our analysis of course-grading variations were both unexpected and instructive.

Unlike the findings of previous work on college grade prediction, grading variations among

high school courses had little such effect in this analysis. The different result is evidently due

to our analysis being based on course grade data that were pooled across schools, while earlier

studies have analyzed course-grading variations within each school. Limitations in school

sample sizes in the NELS database prevented our measuring course-grading variations within

individual schools. ANOVA results indicated substantial differences in course-grading

patterns from school to school. Because of extreme data fragmentation within schools, our

analysis could not capture most of the course-grading differences.

Earlier investigators had suggested a consistent pattern in course-grading strictness

from college to college; e.g., more strict in the natural sciences, less strict in the social

sciences (Elliott & Strenta, 1988). On that basis, we hoped that within-school data pooled

across schools would capture most of the course-grading variations. Why course-grading

patterns are evidently more variable among secondary schools than among colleges is unclear.

Teachers in different schools may have different grading cultures; that is, different values,

theories, and habits as to the fair and proper way to grade honors courses, remedial courses,

service courses, and so on. That may be less true of colleges. College instructors tend to

congregate in different discipline areas where students of somewhat different average ability

Page 118: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

113

tend to concentrate their course-taking. These conditions may well promote similar course-

grading patterns from college to college.

In any event, our inability to represent much of the course-grading variation in this

database undoubtedly degraded the quality of any course-grading index so derived.

Correlational patterns provide clues as to the effect of this shortcoming in the data. SGF, our

index for strictness of school grading, correlated –.29 with HSA and +.09 with the NELS Test

composite. Thus, being subjected to strict school grading is somewhat associated with lower

grades but not with lower test scores. In the analysis of Ramist et al. (1994), where data were

available for most students in each college rather than a small sample, the index Z of course-

grading strictness showed a similar pattern: a correlation of –.22 with college grade average

and +.18 with SAT score.

In our data, the corresponding pattern of correlations with the course-grading index

MCGF was markedly different from that of Z. In the present analysis, MCGF showed a

substantial positive correlation with both HSA (+.31) and NELS test composite (+.46). With

this pattern of correlations, MCGF makes no contribution to NELS-C in predicting HSA.

Table 7 shows how different SGF and MCGF actually are. SGF was largely unrelated to any

student characteristic or teacher rating. On the other hand, MCGF was related to a number of

student variables—most strongly with the teacher’s rating of educational motivation, taking

advanced electives, and parents’ education aspirations for the student.17

MCGF identifies students who take courses that tend to be strictly graded throughout

the country, though as we know, there is much variation from school to school. Judging from

its pattern of correlations with HSA, NELS-C, and other student variables, the MCGF index is

apparently a weak representation of educational motivation. If MCGF reflects motivation to

Page 119: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

114

some extent, why does it add nothing to the NELS Test in predicting HSA? Presumably, this

is because its correlation with HSA is attenuated by the second component of MCGF. The

grades of students with a high MCGF suffer somewhat from strict course grading.

All in all, the irretrievable school variations in course-grading patterns resulted in a

poor MCGF for our purposes. It is neither a good measure of motivation nor a good measure

of course-grading variations. Notice, however, that MCGF might perform quite differently if

it were being used to predict future grades rather than current grades. In that case, both

components of MCGF might be helpful in forecasting academic performance; i.e., stronger

motivation and an undervalued HSA.

A problem in using coursework information in high-stakes admissions decisions is

signaled by the large variation we found in course-grading patterns from one school to

another. That result needs further confirmation, but it is apparently a mistake to assume that

one can rely on conventional wisdom in evaluating a student’s course grades in high school.

Our data indicate no sound basis for knowing whether a course with a particular title will be

strictly or leniently graded in a given secondary school. This ambiguity makes interpretation

of course grades problematic.

The same caution apparently applies to fair interpretation of high school grade

average. Even if a school is known to grade strictly or leniently, one cannot safely say that

the grade average of a particular student is artificially high or low. The student’s particular

selection of courses may not reflect the average grading strictness of the school. Overall, it is

probably hazardous to make any assumption regarding school grading strictness or the worth

of a specific course grade without analysis of the particular situation. College admissions

officers frequently face such questions, but seldom have sufficient data to examine them.

Page 120: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

115

Further research on these issues is badly needed. State testing programs with course-grade

data on all students may provide the opportunity.

Scholastic Engagement as an Organizing Principle

The composite of several student characteristics, which we have termed Scholastic

Engagement, represents one of the major factors in accounting for observed differences in

grades and test scores. As an added benefit of this analysis, the Engagement composite was

found to represent a quite sensible pattern of student behavior. The measure may therefore

prove to be more broadly useful. It is worthwhile to consider further the nature of Scholastic

Engagement and how it might provide a useful organizing principle in understanding student

achievement.

Several ideas are prominent in the notion of Scholastic Engagement. Foremost is the

idea that learning is not a purely intellectual activity. Achievement is much dependent upon

personal motivation and positive attitudes about school and the value of learning. Certain

behaviors apparently play a critical role in translating those attitudes into relevant academic

accomplishment. Finally, the connection between engagement and achievement suggests that

influencing such behaviors positively is likely to be a worthwhile institutional goal.

That learning does not depend upon cognition alone is hardly a new idea. A century

ago, Dewey (1900) articulated an influential philosophy of education that rejected the image

of the learner as an empty cognitive vessel to be filled through instruction. For Dewey,

effective schooling depended critically upon student involvement; that is, experiencing

directly the substance and process of learning. Fifty years later, work on the psychology of

behavior enlisted individual motivation as another aid in understanding school learning.

Page 121: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

116

By mid-century, it was commonly accepted that the challenge in explaining variations

in academic achievement was to understand the influence of the non-intellective factors

(Fishman, 1958). At that time, theory regarding individual need for achievement was much

influenced by two interpretations of motivation: fear-of-failure and hope-for-success

(McClelland, Atkinson, Clark, & Lowell, 1953). Subsequent work revealed that motivation

has many dimensions and is often highly dependent upon the circumstances of particular

situations. A much broadened domain of research and theory has incorporated such

additional notions as peer status, school culture, competitiveness, intrinsic motivation, and the

value orientations that such distinctions carry (Jackson, Ahmed, & Heapy, 1976; Pintrich &

Schunk, 1996; Weiner, 1992).

In recent years a further effort to understand individual differences in school

achievement has focused on the role of non-cognitive factors in the ongoing learning process.

Snow (1989), in particular, articulated the conative aspects of learning—notably such factors

as interest, volition, and self-regulation, which lie at the intersection of the affective and

cognitive domains of behavior. Snow and Jackson (1993; 1997) have provided helpful

reviews of research and theory concerning the assessment of conative constructs in learning.

Clearly, concern with the effects of students’ involvement or engagement with their learning

and schooling has a very long history. As the data and analysis here show, grade achievement

is more sensitive than is test achievement to individual differences in motivation,

involvement, and effort of high school students. Scholastic Engagement represents our

attempt to capture that pattern of behavior with respect to school generally.

Several researchers have used the term “engagement,” in somewhat different ways, to

characterize students’ attitudes and behavior in school. Finn’s (1993) conception of student

Page 122: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

117

engagement was based on two components: “participation” (primarily behavior) and

“identification” (primarily attitude and value). Disengagement from school has been

characterized as poor attendance, disruptive behavior, and giving up (Kleese & D’Onofrio,

1994). Engagement has also been defined on the basis of student self-ratings of effort,

concentration, and attention to academic work (Lamborn et al., 1991; Newmann, 1992). For

these researchers, engagement was one manifestation of involvement in school; the others

being misconduct, doing homework, and academic expectations—all of which were seen to

be affected by family, peers, extracurricular activities, and part-time work.

In the broader context of personal development, a number of writers have advanced

the idea that student involvement is essential to an effective college education. Books by

Astin (1985) and Chickering (1981) illustrate that work. Student development in secondary

education is a somewhat separate literature, well summarized by Eccles, Wigfield, and

Schiefele (1998).

In the course of our analysis, Scholastic Engagement was defined on the basis of

several considerations, both inductive and deductive. The analysis started with 26 student

characteristics that one or more previous studies had shown to be related to achievement in

school. Among those characteristics, the 16 that involved overt behavior got early attention in

our work, because it was mostly those measures that made independent contributions in

predicting grade performance (see Table 8). These 16 measures were sorted into three types

of behavior 18 that had some logical relationship to school achievement. For working

purposes, it seems reasonable to view the three types of behavior as three components of

Scholastic Engagement. Finally, for our purposes, an overall Engagement composite was

based on the nine specific behaviors that had the largest effect on differential grade

Page 123: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

118

performance (HSA, with test scores controlled). Each of the three components was

represented by at least two behaviors. In this schema:

• The engaged student employs appropriate School Skills. Engagement means coming to

school regularly, participating in class, refraining from misbehavior, and doing the work

assigned. A number of studies reviewed earlier showed one or more of these behaviors to

be related to school achievement. A particularly interesting result in the present analysis

is the contrast between time on homework and homework completed. Cooper et al.

(1998) pointed out that studies on the effects of homework have been based almost

routinely on reports of the number of hours that a student spends doing homework. From

their analysis of a limited sample, those authors suggested that homework completed

might be the more significant variable. Our results for student self-reports on Homework

Hours and Work Completed strongly support that proposition. Furthermore, the teachers’

rating of Work Completed proved to have the highest partial correlation with HSA (test

score controlled) and the highest beta weight in predicting HSA among all of the 31

student variables in the present analysis. Notice that teachers see the outcome of the

student’s homework efforts, not how much time they have devoted to it.

• The engaged student takes Initiative in school. Engagement means taking a full and

demanding program of coursework and participating in other scholastic activities. The

student’s track record of course enrollment was a strong indicator of differential grade

performance. A robust transcript appears to be the most telling aspect of a student’s

academic paper trail. A recent analysis by Adelman (1999) provides a form of

corroborative evidence. His findings indicate that the quantity and quality of high school

coursework is also a strong indicator of the likelihood of a student graduating from

Page 124: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

119

college. Like earlier studies, our analysis suggests that participation in school activities

also reflects scholastic initiative. There is a long history of research and interest in the

contribution of sports to the personal development and school life of young people

(Steinberg et al., 1988). Results here support Hanks and Eckland’s (1976) early data and

well-argued contention that only school activities relevant to academic work are likely to

have a bearing on grade performance. The same principle apparently applies to

community activities.

• The engaged student avoids unnecessary Competing Activities. Engagement means

abstaining, where possible, from pursuits that take undue time and commitment away

from schoolwork. All six of the competing activities included here showed some negative

effect, especially killing time and involvement with drugs and gangs. Some earlier

research has associated differential grade performance with movie-goers (Astin, 1971) and

“druggies” (Lamborn et al., 1992). Research has typically shown the effects of after-

school employment to be marginal (see Marsh, 1991, and Steinberg et al., 1988), as do the

findings here. Leisure Reading is an interesting item on the list of activities that may

compete with schoolwork. It remains to be seen whether the significant negative beta

weight of this variable in predicting HSA does actually represent a behavioral retreat from

homework, or simply means that avid readers develop higher test scores over time. None

of the effects for competing activities is large, but this may be partly due to the behaviors

not being very accurately reported.

The heuristic attraction and potential benefit of this Engagement proposal lie in the

overall configuration of the findings, not in the results for particular measures. The three

components describe a recognizable pattern of behavior: employ appropriate school skills,

Page 125: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

120

take initiative in school, and avoid competing activities. Together, they provide a logical

framework that extends what we know about successful student performance and makes

makes clear the primacy of scholastic behavior. The framework appears, in the most

important respects, to be consistent with previous research and to accommodate the most

salient variables. Finally, the Engagement composite incorporates in one measure essentially

all of the variance in the 26 student characteristics that was useful in accounting for students

doing better or poorer than their test performance would suggest.

As a single variable, an Engagement composite has the potential for conceptual

complexity as well as analytical parsimony. Its behavioral focus is a distinct benefit. To be

sure, the behavior of students is constrained by background and conditioned by peer culture

and personal attitudes. Nevertheless, it is reasonable to assume, and the data do suggest, that

it is behavior that most directly affects achievement. Furthermore, behavior is more readily

described, observed, and assessed. Behavior can, in principle, be modified to the benefit of

teachers and learners. All this is to suggest that the idea of behavioral engagement in school

can be a useful tool in studying and improving the educational process.

If students do well because they are engaged, is it not just as likely that they are

engaged because they do well? That is certainly true as has been argued with respect to the

reciprocal effects of self-concept and achievement (Marsh & Yeung, 1997). The

methodological concern is the possibility of spurious causal effects due to direct dependence

of grade predictors on grade performance. As previously discussed, we took pains to avoid

such confounding, apparently with some success (see p. 89). The reciprocity of behavior and

performance is not a technical problem pertinent only to the analysis of concurrent data. To

be sure, a student’s attitude about school will change with time and circumstance, but a strong

Page 126: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

121

or weak commitment to school will likely be reflected in longitudinal as well as in concurrent

analyses.

Is the engagement argument circular? To a degree, yes. But, from an educational

perspective, the challenge is to influence student behavior in ways that create positive self-

fulfilling achievement prophecies. Much of the national effort that goes into encouraging

excellence and selecting good students for demanding educational programs is based on that

assumption. When interest and commitment lead to achievement, it is a demonstration of the

affective side of learning. Students do better on those topics and lines of endeavor that match

their prior experience, interests, and value orientation (Dwyer & Johnson, 1997; Stricker &

Emmerich, 1999). Some students like school more than others. Those who work in school

learn schoolwork. There is a critical assumption. Students who are engaged in school not

only learn their lessons; are also more likely to develop the habits of mind and broadly

applicable cognitive skills that serve the unpredictable demands of adult life.

Recall that Engagement is substantially related to both test performance and grade

performance, but especially to the latter (see Tables 7 & 16). Reason asserts and the data

confirm: Students who are more motivated toward academic pursuits are likely to perform

relatively better on the specific knowledge and skills that grades are intended to recognize.

Thus, a stronger motivational component in grade performance is an important difference

between grades and test scores.

A special merit of Scholastic Engagement, as defined, is the focus on behavior. The

data suggest that if we want to know who is motivated or what motivates, look at the students’

behavior. From a practical standpoint, what does that mean? In designing a research project

or considering alternate educational practices, evidence in the students’ track record may

Page 127: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

122

prove useful. Extracurricular precursors in high school of success in college are a good

example (Willingham, 1985). Clearly, teacher ratings are also a potentially quite useful

source of information.

The Teacher Ratings included in our analysis focused largely on some of the same

behaviors as represented in the Engagement composite—School Skills and Initiative. The

Teacher Ratings were actually more strongly related to achievement and better indicators of

differential grade performance than was student-supplied information. The strength of the

results from the Teacher Ratings offers further validation of the Engagement composite,

especially considering that the ratings came from the middle of the sophomore year. The

Teacher Ratings may provide a somewhat more objective and valid view of the students’

behavior19 than do the Student Characteristics (Variables 1-26) that are mostly based on

student self-reports. On the other hand, the Student Characteristics and the Teacher Ratings

made independent contributions in predicting HSA. Ratings by the teachers may show

stronger relationships partly because teachers add and subtract grade points depending upon

their own observations of students. Also, teachers likely base their grades to some degree on

competencies and considerations that are important in their classroom, but not normally

covered by tests.

Different issues arise if evidence of engagement is intended for actual use in high-

stakes decisions such as promotion, graduation, or admission to a selective institution or

demanding program. Teacher ratings, extracurricular activities, or self-ratings can pose

problems because of ethical issues or the obvious pressures and fudging problems that usually

arise when consequential use is made of a measure. The student’s transcript is another matter.

Page 128: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

123

The academic record has long been recognized as a proper basis for high-stakes

decisions. Historically, successful completion of prescribed courses has been a major factor

in deciding who earns a diploma or is admissible to a selective advanced program of study.

The student’s transcript is public, highly sanctioned, and not readily manipulated. Our data

indicate, as does Adelman’s (1999) recent analysis, that a strong track record of academic

coursework is a robust indicator of academic success.

Group Performance: Similar Dynamics, Different Levels

Much of the attention in this analysis was directed to understanding individual

differences in grade performance and test performance. Parallel questions regarding group

differences on these high-stakes measures take on special significance because group

differences are particularly associated with notions of fairness in assessment. Fairness in

high-stakes tests has often been cast as differential validity and differential prediction (Linn,

1982a). In the present context, it is useful to think of differential validity as pertaining, more

generally, to whether the dynamics of achievement are comparable across groups.

Comparable achievement dynamics imply that the major variables that influence achievement

function in a similar manner from group to group; that is, they are similarly constituted and

interrelated in a similar manner. In contrast, differential prediction pertains to comparability

of performance level across groups.

The data and the analyses offer several advantages for examining the ways in which

groups of students are similar and different on grades and test scores with respect to the

dynamics of school achievement and performance level. We have here a sizable—if not

assuredly representative—national sample of students, four composite measures known to

largely account for grade performance, additional detailed information underlying those

Page 129: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

124

composites, six gender and ethnic groups, and four program groups distributed along a

continuum of academic to vocational emphasis. Furthermore, the analyses incorporate

corrections, where appropriate, for distortions that often cloud comparative group data: range

restriction, unreliability, and grading variations.

Dynamics of achievement. Do grades, test scores, and related measures function

similarly or dissimilarly as indicators of school achievement from one group of students to

another? Our results indicate that the main factors that represent or influence school

achievement are quite similar across six gender/ethnic subgroups. At different points in the

analysis, several types of evidence point to this conclusion.

• Prediction of HSA on the basis of the four NELS tests yielded similar multiple

correlations for the six gender/ethnic subgroups, all within the range of the low .70s to the

low .80s (variables corrected as appropriate for range restriction, grading variations, and

unreliability). [Table 14]

• Prediction of HSA in a full analysis based on 37 variables yielded quite similar multiple

correlations for the six gender/ethnic subgroups, all within the range of .84 to .89. [Table

19]

• A corresponding condensed analysis based on four composite predictors—NELS Test,

Scholastic Engagement, Teacher Rating, and School Grading—yielded a multiple R

within .02 of that for the full analysis for each of the six subgroups. [Table 17]

• In all six gender/ethnic subgroups, Scholastic Engagement was more highly correlated

with HSA than with the NELS Test. As a result, Engagement was a significant

contributor in accounting for differential grade performance in all six groups. A similar

Page 130: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

125

relationship obtained for both the Teacher Rating and School Grading in each of the six

subgroups. [Table 17]

• In the condensed analyses, the magnitude of the standard regression weights for the four

variables followed an almost identical rank order for each gender/ethnic subgroup: NELS

Test, Teacher Rating, School Grading, and Scholastic Engagement. NELS Test had by far

the highest weight—in the tight range of .49 to .53—for each of the six groups.

Regression weights for the other three composite variables were similarly consistent.

[Table 17]

• Each of the student-based composite measures was constituted in a consistent manner

from one gender/ethnic subgroup to another:

• In each subgroup, the Mathematics test had the highest correlation with HSA among

the four NELS Tests. [Table 14]

• In each subgroup, the same two ratings—Does Homework and Educational

Motivation—had the highest and second highest weight, respectively, in predicting

HSA among the five constituents of the composite Teacher Rating. [Table 19]

• In each subgroup, with minor exceptions, the same three behaviors—taking advanced

electives, completing work assignments, and coming to school—had the largest partial

correlations with HSA among the nine student characteristics constituting Scholastic

Engagement. [Table 9]

• In each subgroup, the internal consistency of the high school average was uniformly

high—ranging from .94 to .97 for HSA and .96 to .97 for the full-transcript HSA-T.

[Table 6]

Page 131: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

126

These results indicate that the dynamics of achievement are strikingly similar across

gender and ethnic groups. Few differences are apparent from group to group in the behaviors

that contribute to school achievement or in the abilities that come into play. The major

variables that here account for differences in grades and test scores are similarly constituted

and function in a very similar manner across groups.

Parallel analyses based on students in the four school programs provide an additional

perspective on the dynamics of school achievement. In large measure, the consistent results

just described for the six gender/ethnic groups hold for the four program groups as well.

There were, however, some significant variations associated with school programs. NELS

classified students into programs on the basis of coursework. Most were placed in either the

Rigorous Academic (25%) or the Academic (61%) program. Differences in the dynamics of

school achievement mainly pertained to the two relatively small groups of students who took

either some vocational coursework or a strictly vocational program. These were the principal

differences observed:

• The correlations between the NELS Test and HSA were .80 and .79 in the two

academic programs but declined to .73 and .56 in the more vocationally oriented

programs—even though HSA was based only on courses classified as academic.

[Table 14]

• The multiple correlation between the four major composite variables and HSA was .88

and .87 in the two academic programs but declined to .83 and .73 in the more

vocationally oriented programs. In that analysis, the role of two predictors changed

quite noticeably from academic to vocational programs. The standard regression

Page 132: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

127

weight for the NELS Test decreased progressively from .57 to .35, but the weight for

Teacher Rating increased from .31 to .44. [Table 18]

• The reliability of HSA declined from .97 in the Rigorous Academic program to .89 in

the Vocational program (based on academic coursework alone and corrected for

restriction in range). [Table 6]

While the dynamics of school achievement were quite similar across gender and

ethnic groups, it is clear that there were some differences in the way that the major variables

influence school achievement in academic versus vocational programs. In this respect, the

findings here appear to echo Snow’s observation that there is much evidence for interaction

between ability (here, factors influencing achievement) and treatments (here, programs), but

little evidence of interaction between ability and group membership (Snow, 1998, p.100).

What is the implication of similar dynamics in school achievement across gender and

ethnic groups? Consistency in the way that the variables are related from group to group

indicates little differential validity—an important mark of fair assessment. Whereas groups

obviously differ in interests, background, and culture, the findings here give no indication that

such differences impart a different meaning to grades, test scores, and the other major

variables that influence school achievement. The level at which students achieve in school

may be another matter.

Achievement level. As just described, achievement dynamics largely reflected group

similarities. Achievement level presents a more complex picture of group differences as well

as similarities. The performance of students in different school programs provides a useful

frame of reference for considering gender and ethnic group differences. We focus here on the

four major student-based variables: High School Average, NELS Test, Engagement, and

Page 133: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

128

Teacher Rating. As was illustrated in Figure 5, mean scores on these four variables presented

a very consistent picture. Students in each school program tended to score at much the same

level on each variable—the Rigorous Academic group scoring moderately high on all four,

the less academically oriented groups scoring progressively and consistently lower on all four

measures.

The High School Average and the NELS Test show a pattern of mean differences

among the four ethnic groups much like that found among the four school programs—

approximately one standard deviation spread from the highest to the lowest scoring group on

each of these measures. These ethnic group mean differences are also reflected with some

consistency in the four grade averages that make up HSA (Table A-4) and the four subtests

that constitute the NELS Test (Table 14). The gender mean differences on grades, and

especially tests, were considerably smaller.

The question of greatest interest to the present study is, of course, how grade

performance and test performance compare, group by group. Relative to the size of the

differences in academic achievement from group to group, the mean difference between these

two measures was typically quite small. Among the 10 groups that we studied, in only one

group (the Academic-Vocational program) was the mean difference in HSA and NELS Test

as large as one-fifth of a standard deviation. The analysis of differential prediction provides a

somewhat different view of group differences, but does suggest that even these small mean

differences on the standard scale can be largely accounted for. When school grading, student

engagement, and teacher ratings were controlled, the mean difference between actual grade

average and grade average expected on the basis of test scores was about .02 grade points

(Figure 4).

Page 134: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

129

Most of the gender and ethnic groups tended to score on Scholastic Engagement and

Teacher Rating Composite at a mean level generally commensurate with their mean grade

performance. For reasons that are unclear, that finding was less true of the African American

and Hispanic students. As might be expected, a number of student groups showed different

mean performance levels on the various components of Scholastic Engagement and Teacher

Rating. Also there were instances where these two composite measures indicated different

mean performance levels for apparently the same behavior (i.e., discrepant student and

teacher reports). The nature of these various measures and their relationship to school

achievement seems a promising ground for further research.

These data appear to be at some variance with college data regarding the mean

performance level of some groups of students on grades versus test scores. The results of

differential prediction in these data are generally similar to those of corresponding college

groups (Bridgeman et al., 2000; Ramist et al., 1994). On the other hand, recent college data

suggest that, on average, African American students tend to perform less well on grades than

would be expected from test scores. Results for Hispanic students are not clear. (see Bowen

& Bok, 1998; Ramist et al., 1994; and the discussion in pp. 90-94 and Note 12). The major

variables identified here—engagement, teacher ratings, grading variations—may or may not

be involved in these somewhat different patterns observed in school and college data.

Nonetheless, these factors appear sufficiently promising to warrant careful study at the college

level.

Page 135: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

130

Page 136: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

131

The Merits of Grades and Tests

The somewhat uncertain relationship between grades and test scores was cast as the

presenting issue in this study. Both measures are used in high-stakes educational decisions

and, in some respects, each measure is dependent upon the other for corroborative evidence as

to its quality. Yet it is often not clear why observed performance on grades and tests differs

for individual students, and sometimes groups of students. Results of the present analysis

indicate several reasons for such differences. What, then, do these findings suggest regarding

the merits of grades and tests for high-stakes decisions? That question is usefully considered

from two perspectives: how we evaluate validity and fairness, and the differential strengths of

grades and tests in high-stakes decisions.

In discussing these topics, we do so on the basis of the findings here described.

Examining possible sources of score differences has often indicated differences in the nature

of grades and test scores and suggested broader implications regarding their use in high-stakes

decisions. But it is important to recall again the limited scope of this analysis. Needless to

say, the overall value of grades and tests in a given situation will depend on many additional

factors, especially their content and how they are used.

Validity and Fairness

Traditional statistical indices of validity and fairness are especially appropriate to

high-stakes measures, because rank order is a critical consideration in such decisions. To

account for differences in rank order between the measure and its surrogate or criterion is to

provide evidence relevant to validity and fairness. Given a grade average and a NELS Test

based on a comparable representation of the high school curriculum, the application of several

Page 137: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

132

corrective factors resulted in a quite high correlation between the two measures and a quite

low level of differential prediction for subgroups. Thus, taking into account reasonable

sources of difference, we found grades and test scores to be strongly related. Furthermore, the

several corrective factors appeared to work in largely similar fashion from group to group.

The results thus provide, in principle, evidence of the validity and fairness of grades

and test scores based on generally similar subject material, assuming that extraneous sources

of error like grading variations are taken into account. As here illustrated, the two measures

can be mutually validating because they are differently derived indicators of student

performance. Grades represent the teacher’s summative judgment of performance based on

evidence collected in class. An external test represents a performance sample of knowledge

and skills devised by external subject matter specialists.

Only moderate correlations between grades and test scores and some degree of

differential prediction for subgroups are routine results in real-life assessment. In the public’s

perception, poor test performance in relation to grades earned often means that the student

does not “test well.” In some cases that is undoubtedly true, and groups can certainly tend to

do less well on a particular type of test. For example, other things being equal, men tend to

do less well on a test that calls for writing, and women tend to do less well on a test that calls

for spatial visualization, though such relationships are complex and not always predictable

(Bridgeman & Lewis, 1994; Willingham & Cole, 1997, p. 244f). But simply ascribing

observed discrepancies between grades and test scores to “testing well” would be misleading

at best. In these data the various subgroups typically earned mean grades generally similar to

their mean test scores. In the analysis of differential prediction, non-test factors largely

Page 138: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

133

accounted for the small differences that were observed between these two measures for a

given group.

Anything less than a strong correspondence between test results and grade results is

usually taken to be evidence of invalidity and unfairness in the test scores—seldom the

grades. That interpretation is not surprising given the formal connection of these statistical

indices to professional standards of test validity and fairness (American Psychological

Association et al., 1999) and the emphasis that such empirical markers receive in test

interpretation materials and in the habits of researchers. On the other hand, the interpretation

seems oddly inconsistent with the results reported here. Given a grade average and a test

score based on generally similar subject matter, discrepancies between the two appear to have

less to do with mysterious sources of invalidity or defects in the test than with errors in the

grades and incomplete information about the students and their approach to schooling.

Do the statistical findings suggest that, except for a few known differences, grades and

tests based on a corresponding domain are likely to be comparably valid and fair? In a limited

sense that is true, because accepted statistical indicators of validity and fairness look very

robust when we take into account factors that should logically cause the two measures to

differ in practice. But this view of the topic leaves quite a lot unsaid—even about these

statistical criteria of validity and fairness.

A broader view. Paradoxically, finding that it is possible to account for much of the

apparent discrepancy between grade performance and test performance should caution against

excessive reliance on the statistical indicators that we seek to explain and improve.

Researchers and policy analysts give close scrutiny to whether a particular test correlates with

grades .50 or .55 and to whether differential prediction for a particular group is .10 or .05 of a

Page 139: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

134

letter-grade. We know from earlier work that statistical indicators of validity and fairness can

be strongly influenced by various technical artifacts as well as social and educational values.

These include, for example, range restriction and other aspects of selective sampling (Lewis

& Willingham, 1995; Linn, 1983), unreliability of predictors and criteria (Humphreys, 1968;

Linn & Werts, 1971), the nature of the criterion that is available or preferred (Elliott &

Strenta, 1988; Ramist et al., 1994; Willingham, 1985), institutional differences and changes

over time (Willingham & Lewis, 1990), whether other variables typically used in decision-

making happen to be included or not included in an analysis (Linn & Werts, 1971), and

finally, what particular ethical values are emphasized in defining validity and fairness (Hunter

& Schmidt, 1976; Messick, 1989; Petersen & Novick, 1976).

The results here further demonstrate the hazards in uncritical interpretation of

statistical indicators of validity and fairness. We see that it is possible to substantially

increase grade-test correlations and decrease differential prediction with corrections for

factors that may have little if any actual connection with the quality of the test being

evaluated. The statistics will vary if grades and tests are based on somewhat different

subjects, if grades deliberately include factors other than knowledge and skill in the subject, if

grading standards vary, and so on. Without corrections for such potential artifacts, these

common statistics will be unduly conservative; that is, indicate less validity and less fairness

than is warranted. In practice, the information necessary to make such corrections is hard to

obtain. Furthermore, the statistical indices are insufficient.

With a broader view of the topic, it is also obvious that the substance of the measures

is critical. Grades, as currently assigned in typical schools, may be based on much the same

knowledge and skills as represented in the NELS Test. Assessing different constructs or

Page 140: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

135

using different assessment formats for grades or test scores might affect their validity and

fairness in a variety of ways, statistical results notwithstanding. Ultimately, the validity and

fairness of any measure depend upon the consequences of its use for individuals, groups, and

the public interest. A broader view of the consequences of using a test includes not only the

particular high-stakes decision at issue. Evaluation of validity and fairness must also include

the backward effects on instruction and learning and the forward effects on the eventual social

outcomes of the educational process (Frederiksen, 1984; Haertel, 1999; Messick, 1989;

Resnick & Resnick, 1992; Willingham, 1999).

Validity coefficients and differential prediction provide useful information regarding

validity and fairness, but such evidence is bounded by what we can learn from a correlational

model that has inherent limitations. A low correlation between a grade and a test score may

say little about fairness and validity of the test if the grade criterion is poor. That is well

understood. Similarly, a high correlation does not necessarily inform us about the quality of

either measure except that they are mutually supportive. Depending upon the situation and

the nature of the high-stakes decision, other grades and other tests might better serve

educational objectives. Also, alternate tests might have quite similar correlations with a grade

average, yet have quite different learning implications, social significance, or practical

ramifications (Willingham & Cole, 1997, p. 234-244).

As we stated at the outset, the type of analysis presented here can tell little about the

relevance or sufficiency of the construct that is normally represented by grades and tests.

Indeed, the analysis helps to illustrate the importance of understanding better the particular

knowledge, skills, and cognitive processes that are and are not included in a given

representation of educational outcomes.

Page 141: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

136

Finally, it is important to distinguish what conclusions the results do and do not

suggest regarding the comparability of grades and test scores. Being able to adjust away a

major part of the observed differences between the two measures does not mean that they are

the same. To the contrary, the several factors that were employed in “correcting” the

relationship between the two measures show that grades and test scores are dissimilar in

important ways. These factors influence grades and test scores differently with respect to

content relevance and fidelity.

What do these differences suggest regarding the use of grades and tests in making

high-stakes decisions? In the main, the findings imply that grades and test scores have

different strengths. The different strengths are likely to have different practical effects in the

actual use of one measure or the other. For that reason, the findings reinforce the importance

of considering all aspects of validity and fairness in considering what measures to use in high-

stakes situations.

Differential Strengths

In considering what these results may imply regarding differential strengths, we start

with the factors that differentiate the two measures. Of the five factors originally included in

the analysis, three represent characteristic differences between grades and test scores: Factor

2—grading variations, Factor 4—scholastic engagement, and Factor 5—the teacher’s

judgment of the student’s performance in school. Each of these three can be seen as a

component of grades that is not normally represented in tests. Clearly, an analysis of

individual differences—on which we have focused here—is not unrelated to an analysis of

construct differences. Indeed, critical evidence of construct differences lies in systematic

patterns of individual differences.

Page 142: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

137

From that perspective, the additional components found mainly in grades can be

expected to generate different patterns of individual and group differences in grades and test

scores. As the analyses indicate, taking the three components into account brings grades and

test scores more closely into line. The following discussion focuses on possible strengths of

grades and tests that may be implied by those three components. Factors 1 and 3 (subject

match and reliability) are not considered because neither necessarily represents a strength or

advantage that is especially associated with the use of either grades or test scores.20

The practical implications of Factors 2, 4, and 5 can easily vary with the nature of the

high-stakes decision. There are many types of decisions where the stakes are high for

individual students. It is clearly beyond the scope of this report to examine possible

implications of using one measure or the other for specific purposes in particular situations.

Nevertheless, the findings do suggest that some inherent differences between grades and tests

tend to be associated with particular strengths. In evaluating possible strengths, it is useful to

consider typical objectives of high-stakes assessment and the context in which assessment is

carried out. Figure 6 proposes a schema for that purpose.

_ _ _ _ _ _ _ _ _ _ _ _ _ _

Insert Figure 6 about here_ _ _ _ _ _ _ _ _ _ _ _ _ _

As this figure suggests, each of Factors 2, 4, and 5 is associated with a component of

grades that is especially pertinent to a particular assessment objective. They are, respectively,

that high-stakes assessment should be fair to the student, that assessment should recognize

and foster the development of critical skills, and finally, that assessment should help to

motivate effective teaching and learning. The relationships are not exact. Also, these three

objectives do not apply equally to all types of high-stakes assessment, nor do they exhaust or

Page 143: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

138

fully state its various purposes. Nevertheless, these important goals of assessment provide a

useful context for considering possible implications of the findings. In comparing grades and

test scores, Figure 6 helps to connect substantive issues (the goals and content of education)

with outcome issues (the statistics of individual differences).

It is also necessary to take account of the particular administrative context in which

the two types of assessment are carried out. Grades represent each teacher’s judgment in each

class as to how well the student has fulfilled the implicit local contract between teacher and

student. The contract may be poorly stated, but in broad outline it is likely to be reasonably

clear to both parties. For example, the implicit understanding with a given teacher may be, “If

you master the knowledge and skills pertinent to my course reasonably well, you will

probably get at least a B—maybe higher if you do all the assignments and contribute to the

class, but maybe lower if you forget your homework or disrupt class.”

A test, on the other hand, provides an external standard that is intended to compare

performance across educational units. For that reason, the test is designed to include

knowledge and skills generally representative of relevant coursework that, in detail, will differ

somewhat from unit to unit. Naturally, there is considerable overlap between the local

contract and the external standard, but as Figure 6 indicates, they do have distinguishing

characteristics that tend to be associated with different strengths. In this schema, Factors 2, 4

and 5 suggest differential strengths of grades and test scores as follows.

Grading variations (Factor 2). One important objective in high-stakes assessment is

to insure that all students are evaluated on the same scale. In the case of grades, same scale

usually means the “local standard” that is normally applied in a given situation. For some

purposes that may be viewed as sufficient for fair assessment. The strictness of grading in a

Page 144: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

139

particular program or institution can clearly affect the likelihood that a student will pass, but

local standards are institutionalized and may not be perceived as a fairness issue if applied in

an evenhanded manner. Passing standards that vary from one educational unit to another are

not necessarily dysfunctional. Schools or educational programs with students who are

academically weak or unusually talented may not be well served by the same grade scale.

On the other hand, many high-stakes decisions call for an assessment that is

comparable across educational units—either to be fair to all students involved or, for

educational purposes, to base decisions on some objective standard of competence. Grading

variations become an important issue if a system or state wishes to enforce accountability by

imposing comparable standards across schools. Admission to a selective college or graduate

program is, of course, the most familiar and telling instance of the fairness hurdle imposed by

noncomparable grade records. In all such situations, grading variations represent a failure in

the fidelity of grades as a basis for high-stakes decisions. In this regard, the consistent scale

meaning of a “common yardstick” has long been seen as a distinctive strength of a

standardized external test (Cameron, 1989; Dyer & King, 1955).

It is no surprise that grading variations should be a major factor in accounting for

observed differences between grades and test scores. The earlier review of pertinent literature

indicates a long history and ample documentation of fluctuating standards and differences in

grading patterns from one time to another, one situation to another. Conversely, the promise

of having a common yardstick that avoids such problems is a principal reason why tests are

used in the first place. Tests lack the error component that is represented in grading

variations.

Page 145: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

140

To be sure, test score scales are not always dependable. Some K-12 tests have been

known to deliver suspicious scores,21 but tests used in high-stakes decisions involving

individual students are typically normed and scaled with considerable care. In most high-

stakes situations, consistent scale meaning is likely to be a major strength of tests and a

notable weakness of grades. It has been frequently documented that basing high-stakes

selection decisions on grades and test scores together routinely improves predictive validity

(Ramist et al., 1994). But fairness is an equally compelling argument for using the two

measures together, because a test will tend to compensate for a grade average that is either

inflated or too stringent.

Scholastic engagement (Factor 4). Another common objective of high-stakes

assessment is to evaluate skills generally considered to be critical outcomes of education.

When grades and tests are based on the same subject matter, both presumably cover relevant

knowledge and skills, but there are important distinctions in the content of grades and tests.

The findings here indicate that the grade average reflects in part the degree to which a student

is effectively engaged in school, while the test focuses on academic content. In Figure 6 this

difference is contrasted as conative versus cognitive skills—an important distinction in the

construct relevance of grades and test scores as high stakes measures.

Some students are more engaged than others, and they work harder. The effort pays

off with higher grades and higher test scores, but the payoff is more direct and surer in the

case of grades, which include the conative component. Furthermore, students often receive

grade credit quite directly for taking school seriously and doing the work assigned. This

undeniable relevance and responsiveness of grades to student effort and learning are clearly

strengths of grades as a fair, though not necessarily sufficient, basis for many high-stakes

Page 146: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

141

decisions. Educators recognize that conative skills like volition, habits of inquiry, effort, and

self-regulation are, in themselves, important goals of schooling in a free and effective

society—a consideration that adds to the validity of grades as a graduation requirement. As

Bandura (2000) notes, the educational enterprise has multifaceted aims, though currently there

is much sentiment throughout the country to hold students to a test-based, largely cognitive

graduation requirement (Baker & Linn, 1997).

Conative and cognitive skills serve complementary roles in facilitating achievement in

school as well as adult life (Bloom, 1956; Krathwold, Bloom, & Masia, 1964; Snow, 1989;

Sternberg, 1985). Colleges have long searched for evidence of student motivation (a close

relative of Factor 4) in the hope of enhancing predictive validity in high-stakes selection

decisions. Interest in developing high-stakes tests of conative skills has raised numerous

possibilities but many hurdles (Fishman, 1958; Messick, 1967; Snow & Jackson, 1993;

Willingham & Breland, 1982). Our findings suggest that a likely reason that the search

typically bears only limited predictive fruit is that scholastic engagement is represented to

some degree in the student’s previous grade record—already used as a predictor in many

high-stakes situations.

The dependence of grades on student engagement is, of course, not always a strength.

Whether a student is engaged with school is not independent of influences beyond the

student’s control. As the data indicate, engagement can be influenced negatively by family

circumstances, dysfunctional peer associations, and so on. And for a student who has

experienced a disastrous period of indifference or disconnect in high school, grades can be an

unforgiving indicator of academic incompetence. A test can be a safety net for such students

because the more general skills typically emphasized on the test are more dependent on a

Page 147: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

142

lifetime of learning, both in and out of school. On the whole, however, the conative

component is a distinctive strength of grades. This feature of grades is especially valuable

because of the continuing difficulties likely to be encountered in developing tests of such

characteristics as scholastic engagement that would be valid and acceptable in high-stakes

situations.

For similar reasons, the focus on cognitive skills is a strength of tests. Research in

recent years has added to our understanding of a broader range of cognitive skills that are

known to have value in academic work and adult life (Shepard, 1992a). Several

circumstances make it now possible to better focus the content of a test on important cognitive

skills. Briefly, these advances include the emphasis on educational standards, the developing

technology of test design, the standardization procedures that yield comparable measures for

all students, and the research potential for determining which designs work best. In recent

years considerable effort has gone into improved design and delivery in the assessment of

more complex cognitive and performing skills (Bennett & Ward, 1993; Frederiksen et al.,

1990; Linn, 2000; Mislevy, Steinberg, Breyer, Almond, & Johnson, 1999; Tatsuoka &

Tatsuoka, 1992).

In comparison, harnessing the judgment of innumerable teachers to the goal of

correctly recognizing and consistently assessing specific cognitive skills is a daunting task.

There is some evidence to support the conventional wisdom that the grades and classroom

tests of teachers tend to emphasize facts, terms, and rules (Fleming & Chambers, 1983;

Terwilliger, 1989). Recently, however, serious efforts have been undertaken to improve

classroom assessment in order to better serve both instruction and accountability.22 (Shepard,

2000; Snow & Mandinach, 1999).

Page 148: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

143

Teacher judgment (Factor 5). That teacher ratings are more closely related to grades

than to test scores suggests a connection to a time-honored purpose of outcome assessment;

namely, to motivate academic achievement by providing feedback. “Motivate” implies the

need to shape learning as well as encourage effort. Grades and tests perform thus function

differently. Grades are based more directly on what is going on in the individual classroom.

Through the teacher’s assessment, grades provide an immediate reward or punishment to the

student for good or poor work on the specific material assigned.

External tests are not usually tied to specific courses. Tests are more concerned with

how students are doing at the end of the year on the knowledge and skills that are common to

and most relevant to the curriculum—in the system, the state, or the nation, depending on the

test. Thus, the inclusion of the teacher’s judgment in grading represents another distinction in

the content relevance of grades and test scores. Grades more clearly reflect performance on

the wide range of material that students are actually studying in a given class, day by day. In

discussing this content distinction, Shepard (2000) refers to the formative and summative

roles of classroom assessment and external tests, respectively.

It is no surprise that teacher ratings are more highly correlated with grades than with

test scores. Evidently, teachers are about evenly split on whether it is acceptable to pass

students simply because they have tried hard (Public Agenda, 2000). Parents and teachers

certainly look to grades to motivate students, hopefully on a month to month basis. On the

other hand, administrators and politicians look to annual test results to motivate teachers and

schools. In theory, grades motivate students because they reflect the student’s behavior, and

tests motivate the educational process because they encourage accountability and provide a

more effective means of designating what knowledge and skills are most important.

Page 149: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

144

Developing good work habits and attaining long-term curriculum objectives are surely

complementary and worthy goals. How effectively either grades or tests actually motivate or

can be designed to motivate is, of course, a matter of regular study and debate (Covington,

2000; Linn, 2000; Shepard et al., 1995). In practice, do these different characteristics of

grades and tests constitute differential strengths in making valid and fair high-stakes

decisions?

That the teacher’s judgment of students’ performance in school has special weight in

accounting for grades is very likely a net gain for the validity and fairness of grades in many

high-stakes decisions. The teacher is in the best position to know who is working hard and

achieving learning objectives in school—in all ways that students do achieve. Teachers are in

a position to recognize many different forms of achievement on many occasions. For that

reason, grades inevitably cover a broader range of skills than do tests. In time, advances in

technology may narrow the gap between teacher judgement and standardized tests. Better

models of performance along with computer simulation and natural language processing of

student responses hold promise for assessment of much more complex skills than is possible

with current tests (Braun, Bennett, Frye, & Soloway, 1990; Bennett, 1999; Gitomer, 1993).

To be sure, the added breadth that teacher judgement adds to assessment likely comes

at the expense of consistent meaning as to what the grade represents. On the other hand, tests

can be off the mark altogether. Heubert and Hauser (1999) describe the public policy

dilemmas and the legal issues that arise if an external test does not adequately match the

school’s curriculum or is otherwise considered unfair to its students. Thus the more direct

instructional relevance of grades and the comparability of test scores are key complementary

strengths of the two measures.

Page 150: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

145

Additional evidence of validity and fairness in grades lies in the observation that

teacher ratings reduced differential prediction. This analysis of high school data may actually

sell short the importance of the teacher’s judgment as a component of grades. In elementary

school and in graduate school, we normally assume that the teacher has the best feel for how

the student is doing. Traditionally, teacher recommendations have been a critical element in

selective admissions and probably a useful predictor as well (see Willingham, 1985, Table

5.2). Factor 5 appears to work much like Factor 4 in arguing the strengths of grades and tests

for high-stakes decisions.

In principle, engagement in school and the teacher’s judgment of school achievement

should overlap considerably as components of grades or sources of variation in grades. Both

surely reflect the student’s conative skills. Both surely indicate a student’s level of attainment

on local learning objectives. But in this particular analysis, the Teacher Rating is probably a

weak measure of the latter—partly because the ratings focused on student behavior rather than

learning outcomes, and partly because they were based on the judgment of only two teachers

in the sophomore year.

The influence of teacher judgment may represent both strength and weakness in

grades, because the teachers’ judgment may be as subject to varying standards, as are the

grades. Presumably, teachers are the primary source of positive and negative adjustments in

grading to give explicit credit for effort, attendance, completing assignments, and other

factors that are not necessarily related to acquired knowledge and skill. To what extent such

considerations are a strength or weakness of an assessment depends partly on the nature of the

particular high-stakes decision and partly on the educational philosophy that governs it.

Page 151: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

146

Theory aside, it would be useful to have better facts about the role of teacher

judgment. In Table 16, the Teacher Rating Composite was correlated .69 with High School

Average. Is all of that strong correlation good news? Considering that the 1.6 teacher ratings

per student came from the middle of the sophomore year, this substantial relationship is not

likely to be explained simply as contamination between the two measures. Is the behavior of

students (the main focus of the ratings) really that stable through high school? Or could the

.69 partly reflect a tendency for teachers’ grades to be influenced by mindsets as to who

deserves good grades? The possibility of self-fulfilling prophecies in teacher judgment is an

old and controversial issue (Brophy, 1983; Rosenthal & Jacobson, 1968). This strong

relationship between grade average and teacher ratings of student behavior further illustrates

the need for more research on the validity and fairness of grades as the basis for high-stakes

decisions. It would also be desirable to see a more balanced concern regarding fair use of

grades as compared to the concern typically advanced regarding legal obligations in using test

scores (Office of Civil Rights, 1999).

This discussion of differential strengths of grades and test scores was not intended to

be exhaustive. Mainly, we have cited differential strengths suggested by the analysis reported

here. While grades and tests clearly overlap, differences between the two measures are

thereby all the more obvious. It is also more obvious that the strengths of grades and tests are

often complementary. For example, in the last few pages we have noted these contrasts:

Grades can represent broader content and reflect distinct accomplishments, but tests are more

amenable to research and focused design on the most important content. Tests can more

readily focus on priority cognitive skills, but grades can more readily focus on motivational

components of academic achievement. Grades can readily reflect performance on what

Page 152: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

147

students are studying, but tests can focus on more significant long-term educational

objectives. Test scores can be compared from one school to another, but grade scales can be

accommodated to local situations and programs.

The implications of these various characteristics will naturally depend upon the nature

of the educational decision. High-stakes decisions can be quite different as to purpose,

context, and values. Certain assessment strengths can be major considerations in some high-

stakes decisions. In selective admissions, for instance, evidence of scholastic engagement is a

major strength of grades, while the common yardstick is a major strength of test scores. The

two strengths advance validity and fairness in complementary ways.

Furthermore, it is useful to recall again that this comparison of grades and test scores

has been based on an analysis of individual differences as opposed to a construct analysis.

These two perspectives are not the same, though they are parallel and complementary. Each

is useful because it suggests different value issues and different possible courses of action.

Construct analysis leads to assessment and curriculum design; analysis of individual

differences leads more naturally to questions concerning differential educational practice and

performance of particular students.

Finally, it is worth observing that this analysis has been based on grades that we

commonly know and tests that we commonly use. In making consequential decisions about

individual students, it will often prove desirable to use the two measures together. Both are

useful, and both have unique and complementary strengths. But there is no reason to assume

that either grades or tests are as good as they could be or need to be. Measurement specialists

and educators might do well to worry less about the statistical indices of the grade-test

relationship and more about what we assess and how that improves teaching and learning.

Page 153: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

148

Page 154: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

149

Summary

Both grades and tests are used in high-stakes educational decisions. Who will be

placed in a slow or fast track in grade school, or earn a high school diploma, or be accepted in

a selective college or a prestigious graduate program? Grades and test scores play a curious,

interdependent role in such decisions. We use tests to keep grade scales honest or because we

do not fully understand or trust grades as an accurate indicator of educational outcomes. On

the other hand, we use grades to judge the validity and fairness of tests.

Grades and tests serve this mutually supportive role, in part, because it is commonly

assumed that they measure much the same thing. Yet we wonder why observed grades and

test scores frequently differ. There is much evidence but little systematic attention to why the

two measures often yield somewhat different results. The premise of this study was that it

should be possible to account for the differences in large part, and that such an analysis would

be helpful in understanding better the distinctive strengths of grades and tests as high-stakes

measures.

A Framework of Differences. To that end, a framework of possible sources of

discrepancy between grades and test scores was developed. The framework emphasized four

broad domains: content differences between grades and test scores, individual differences

related to content differences, errors in grades and in test scores, and situational differences

across contexts and over time. In the large sample necessary for this type of study, it is not

feasible to examine all possible sources of difference, but that was not essential because the

sources overlap.

Furthermore, results of prior studies suggest that some sources of potential difference

between grades and test scores are much more promising than others. A long history of

Page 155: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

150

research on grading standards indicates that variations across courses and across schools are

the most likely types of grading errors to produce differences between observed grades and

test scores. Because of the multiple purposes served by classroom assessment, grading is

influenced by educationally relevant student behavior in addition to subject knowledge and

skills. Evidence suggests that teacher ratings of students are a useful source of information

regarding such influence. Various aspects of students’ attitudes and family life have also

shown promise in helping to account for school achievement.

Rather than attempt to cover all possible sources of difference between grades and test

scores, this study employed an approximation amenable to analysis and based upon the most

promising sources of difference that were suggested by relevant prior research. Accordingly,

the analysis took into account these five factors: 1) the particular subjects covered by the

grades and the tests, 2) grading variations, 3) reliability of grades and test scores,

4) characteristics of students that are likely to influence performance in school, and 5) teacher

ratings of behaviors that can bear on grades assigned.

Design of the study. The study focused on patterns of individual and group differences

on grades and test scores, rather than content differences in the two measures. The analysis

posed this question, “How can one correct errors or otherwise add to a test score to account

for differential grade performance?” Grades were regressed on test scores, taking each of the

five factors into account in turn. Achievement was analyzed both across and within schools

because studies have indicated that attitudes and other critical variables in addition to grades

are often not comparable from school to school. Since it is also likely that good grades foster

good attitudes, variables that are obviously dependent upon past grades were avoided in order

to minimize spurious effects.

Page 156: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

151

The data were based on 8454 NELS seniors in 1992 in schools having at least 10

survey participants with necessary data. Ten subgroups were available for analysis: two

genders, four ethnic groups (African-American, Asian-American, Hispanic, and White), and

four school programs (Rigorous Academic, Academic, Academic-Vocational, Vocational).

The five factors were introduced on the basis of the following data and analyses.

• Factor 1: The subjects covered by grades and tests were matched by restricting the High

School Average (HSA) to the four academic areas (English, mathematics, science, social

studies) most similar to the four NELS tests and optimally weighting the tests as

predictors of grade average.

• Factor 2: Grading variations across schools were adjusted by two alternate methods—a

pooled within-school analysis corrected for range restriction and a residual method that

adjusted grades for mean over- or underprediction based on test performance. Course-

grading variations were adjusted by the residual method either within or across schools.

• Factor 3: Correlations between grade averages and test scores were corrected for

attenuation with traditional estimates of reliability, corrected for range restriction where

appropriate.

• Factor 4: All student characteristics available in the NELS database were used where

research evidence indicated promise in predicting grade performance (family and attitude

measures and a number of behaviors such as attendance, coursetaking, and school

activities—26 variables in all, mostly composites of related information).

• Factor 5: Ratings by one or two teachers in the middle of the sophomore year focused on

five types of behavior that are often considered relevant in assigning grades (e.g., effort

and completing assignments).

Page 157: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

152

The Findings. Extensive data were presented regarding many influences on grade

performance and test performance in high school. The most important results can be

summarized under four general findings. The first finding concerns this presenting question,

“Can one account for observed differences between grade performance and test

performance?” Two other significant findings concern the nature of two of the major factors

that do account for grade and test score differences: grading variations and scholastic

engagement. A fourth refers to the systematic patterns of similarities and differences that we

observed in the performance of subgroups.

1. Accounting for the Differences. The assumption that it would be possible to

account for most of the observed differences between grades and test scores by making

adjustments for the five proposed factors proved largely accurate. Thus, in an analysis based

on the test and 31 additional variables plus corrections for grading variations and unreliability,

both individual and group differences between the two measures were reduced substantially.

Reduced error in predicting the grade performance of individuals was reflected in a multiple

correlation of .90 compared to one of .62 based on the tests alone. Average differential

prediction in subgroups was reduced to two hundredths of a letter-grade. Grade performance

could be explained almost as well with only a few composite variables: a test covering a

comparable academic domain, school grading variations, student engagement in school, and

an overall teacher rating. These four major variables gave a multiple correlation with grade

average only .008 less than the analysis based on 37 variables. The remaining individual and

group differences can probably be explained largely by limitations of the analysis or other

known differences that could not be examined: curriculum variations among schools and

students, construct differences in grades and tests, temporal variations, shortcomings in the

Page 158: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

153

measures and statistics used here, and limitations in information available regarding grading

variations.

2. Grading Variations. Grading variation among schools was a major source of

discrepancy between observed grades and grades predicted from test scores. Correction of

those school grading variations had a substantial effect on the correlation between grade

average and test scores. An analysis of variance showed that grading variation among courses

also accounted for a large part of the discrepancy between grade performance and test

performance, however almost half of all observed school and course-grading variation was

associated with differences in the pattern of course grading from school to school. Such

differences in course-grading patterns could not be identified and corrected due to sparse data

within individual schools. As a result, the full effect of grading variations was underestimated

in this study. Furthermore, taking grading variations into account in actual use of grade

information in high-stakes decisions is problematic, because in actual practice, the data

necessary to estimate the effect of course—and therefore also school—grading variations on a

given student’s transcript are seldom likely to be available.

3. Scholastic Engagement. Many student characteristics—family background,

attitudes, and especially behavior—were related to both test performance and grade

performance. Particular types of behavior made large, independent contributions in predicting

differential grade performance; that is, grade performance with test scores taken into account.

These behaviors were oriented toward school and followed a logical pattern. The best single

predictor of differential grade performance was completing schoolwork assignments. Overall,

the students who tended to make higher grades than test scores were those who employed

appropriate school skills, demonstrated initiative in school, and avoided competing activities.

Page 159: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

154

A weighted composite of nine such behaviors, termed Scholastic Engagement, proved to be a

major variable in this analysis of differential grade performance. Family background and

student attitudes were often strongly related to Engagement, but apparently play a secondary

role to overt behavior in determining school achievement. Because the behaviors represent a

coherent pattern and are potentially modifiable, Scholastic Engagement appears promising as

an organizing principle for studying and improving school performance. Teacher ratings

showed a similar pattern of behaviors to be relevant to test performance and especially to

grade performance.

4. Group Performance. Subgroups often differed significantly in average

achievement level, but were mostly quite similar with respect to achievement dynamics. That

is, the major variables that accounted for individual differences in grades and test scores were

similarly constituted and functioned in a similar manner across each of the six gender and

ethnic groups. For example, in all groups: “does homework” and “educational motivation”

were consistently the best grade predictors among the teacher ratings; taking advanced

electives, completing assignments, and coming to school were consistently the most effective

components of Scholastic Engagement. In predicting grades, multiple correlations and

standard regression weights for each of the major variables were quite similar for each of

these six groups. Parallel analyses based on the four school programs gave similar results,

except among vocational students where grades tended to be more closely related to teacher

judgment and less to test performance. The ten student groups differed substantially in school

performance as indicated by grades and test scores, but average results based on each of these

two measures was generally similar within each group, particularly when other group

differences were taken into account. While the groups obviously differed in interests,

Page 160: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

155

background, and culture, the results here give no indication that such differences impart a

different meaning to grades, test scores, and other major variables that influence school

achievement.

The merits of grades and tests. The implications of these findings may vary

substantially depending upon the particular situation in which grades and tests are used, the

nature of the high-stakes decision, and other practical considerations. Nevertheless, the

important differences between grades and test scores appear to be characteristic of the two

measures and can therefore be expected, in some significant degree, to have generalizable

effects on their validity and fairness.

The results further demonstrate that it is possible to decrease substantially both

individual and group errors of prediction by taking into account factors that may have little

connection with the test that is being evaluated. Without correction for such artifacts, validity

coefficients and differential prediction will be unduly conservative indicators of validity and

fairness. That it is possible to so readily account for much of the apparent discrepancy

between grade performance and test performance should caution against assuming that

validity and fairness are fully and accurately described by the statistical indicators that we

commonly seek to explain and improve.

It is obvious that the substance of the measures and the consequences of their use are

critical. Being able to adjust away a major part of the differences between grades and test

scores does not mean that they are the same. To the contrary, the findings imply that grades

and tests are different, and they have different strengths with respect to various interpretations

of validity and fairness in the actual use of one measure or the other.

Page 161: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

156

Two circumstances account for the fact that grades and tests have differential

strengths. First, as this study suggests, three components of grades characterize important

differences between the two measures: grading variations, scholastic engagement, and the

teacher’s judgment of the student’s performance. Second, these components are pertinent to

somewhat different though overlapping assessment objectives. In considering how each

translates into differential strengths of grades and tests, it is useful to think of the two

measures in the following way. Grades represent the teacher’s assessment as to how well the

student has fulfilled the implicit local contract between teacher and student. Tests represent

an external standard that makes it possible to compare performance across educational units.

Grading variations (Factor 2) are especially associated with fairness in assessment.

Flexible grading standards can be an arguable strength in a local contract. Classes largely

composed of academically weak or unusually talented students may not always be well served

by the same grade scale. Most high-stakes decisions, however, involve students in different

educational locales competing for the same recognition or opportunity. In such situations,

variation in grading standards compromises the fidelity of grades as a high-stakes measure.

The fairness inherent in a common yardstick is a critical strength of the external test standard.

Scholastic engagement (Factor 4) appeals to another important assessment objective;

that is, to represent as fully as possible the most critical skills. Aside from covering the

subject matter, grades and tests have the potential for distinctive strengths due to their

somewhat different content relevance to high-stakes decisions. The behaviors that define

Scholastic Engagement pertain to conative skills like volition, effort, and self-regulation—

attributes not readily represented in a test. On the other hand, a well-developed test can focus

Page 162: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

157

on particular cognitive skills considered to be most significant educationally—a goal not

readily achieved in classroom grading.

That teacher ratings (Factor 5) are more closely related to grades than to test scores

suggests a familiar objective of educational assessment: to motivate effective teaching and

learning. Grades and tests do this differently. The potential strength of grades is to motivate

students by focusing specifically on the individual achievement and behavior that their

teachers can recognize and appreciate. The potential strength of tests is to motivate the

educational process by encouraging accountability and providing a more effective means of

designating what educational outcomes are most important. As Heubert and Hauser (1999)

have emphasized, high-stakes testing is a policy instrument.

The strengths of grades and tests are clearly different and often complementary.

Grades can gauge performance specific to the material that students are actually studying,

tests can focus on more significant long-term educational objectives. Tests can help in

developing the most critical cognitive skills; grades can help in developing the often neglected

but essential motivational component of learning. Teacher’s grades may be more nuanced,

but tests are more objective. As high-stakes measures, the comparability of test scores and the

breadth of grades are key complementary strengths. Common advice that the two measures

should be used together where possible is well founded. Emphasis on research and other

efforts that might enhance their complementary strengths would be well placed.

Page 163: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

158

Page 164: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

159

References

Adelman, C. (1999). Answers in the tool box: Academic intensity, attendance patterns,and bachelor's degree attainment. Washington, DC: U.S. Department ofEducation, Office of Educational Research and Improvement.

American Educational Research Association, American Psychological Association, &National Council on Measurement in Education. (1985). Standards foreducational and psychological testing. Washington, DC: AmericanPsychological Association.

American Psychological Association, American Educational Research Association, &National Council on Measurement in Education. (1999). Standards foreducational and psychological testing. Washington, DC: American EducationalResearch Association.

Astin, A. W. (1971). Predicting academic performance in college. New York: The FreePress.

Astin, A. W. (1985). Achieving educational excellence. San Francisco, CA: Jossey-Bass.

Babad, E. Y., Inbar, J., & Rosenthal, R. (1982). Pygmalion, Galatea, and the Golem:Investigations of biased and unbiased teachers. Journal of EducationalPsychology, 74(4), 459-474.

Baker, E. L. & Linn, R. L. (1997). Emerging educational standards of performance inthe United States (CSE Technical Report 437). Los Angeles: National Center forResearch on Evaluation, Standards, and Student Testing.

Bandura, A. (2000). A sociocognitive perspective on intellectual development andfunctioning. Newsletter for Educational Psychologists, 23(2), 1-4.

Beatty, A., Greenwood, M. R. C., & Linn, R. L. (Eds.). (1999). Myths and tradeoffs:The role of tests in undergraduate admissions. Washington, DC: NationalAcademy Press.

Bejar, I. I., & Blew, E. O. (1981). Grade inflation and the validity of the ScholasticAptitude Test. American Educational Research Journal, 18(2), 143-156.

Bennett, R. E. (1999). Using new technology to improve assessment (ETS RR-99-6).Princeton, NJ: Educational Testing Service.

Bennett, R. E. & Ward, W. (Eds.) (1993). Construction versus choice in cognitivemeasurement: Issues in constructed response, performance testing, and portfolioassessment. Hillsdale, NJ: Lawrence Erlbaum Associates.

Page 165: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

160

Bloom, B. S. (Ed.). (1956). Taxonomy of educational objectives: Handbook I.Cognitive domain. New York: David McKay Company.

Bloom, B. S., & Peters, F. R. (1961). The use of academic prediction scales forcounseling and selecting college entrants. New York: The Free Press of Glencoe.

Bowen, W. G. & Bok, D. (1998). The shape of the river. Princeton: PrincetonUniversity Press.

Braun, H. I., Bennett, R. E., Frye, D., & Soloway, E. (1990). Scoring constructedresponses using expert systems. Journal of Educational Measurement, 27(2), 93-108.

Braun, H. I., & Szatrowski, T. H. (1984). The scale-linkage algorithm: Construction ofa universal criterion scale for families of institutions. Journal of EducationalStatistics, 9(4), 311-330.

Breland, H. M. (1981). Assessing student characteristics in admissions to highereducation (Research Monograph No. 9). New York: College EntranceExamination Board.

Bridgeman, B. & Lewis, C. (1994). The ralationship of essay and multiple-choice scoreswith grades in college courses. Journal of Educational Measurement, 31(1), 37-50.

Bridgeman, B., McCamley-Jenkins, L., & Ervin, N. (2000) Predictions of freshmangrade-point average from the revised and recentered SAT I: Reasoning Test(College Board Report No. 2000-1, ETS RR-00-1). New York: College Board.

Brookhart, S. M. (1993). Teachers' grading practices: Meaning and values. Journal ofEducational Measurement, 30(2), 123-142.

Brophy, J. E. (1983). Research on the self-fulfilling prophecy and teacher expectations.Journal of Educational Psychology, 75(5), 631-661.

Brown, B. B. (1988). The vital agenda for research on extracurricular influences: Areply to Holland and Andre. Review of Educational Research, 58(1), 107-111.

Burnham, P. S. (1954). The evaluation of academic ability. College admissions . NewYork: College Entrance Examination Board.

Byrne, B. M. (1986). Self-concept/academic achievement relations: An investigation ofdimensionality, stability, and causality. Canadian Journal of Behavioral Science,18(2), 173-185.

Byrne, B. M. (1996). Academic self-concept: Its structure, measurement, and relation toacademic achievement. In B. A. Bracken (Ed.), Handbook of self-concept:

Page 166: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

161

Developmental, social, and clinical considerations (pp. 287-316). New York:John Wiley & Sons, Inc.

Caldwell, E., & Hartnett, R. (1967). Sex bias in college grading? Journal of EducationalMeasurement, 4(3), 129-132.

Cameron, R. G. (1989). The common yardstick: A case for the SAT. New York:College Entrance Examination Board.

Cannell, J. J. (1988). Nationally normed elementary achievement testing in America’spublic schools: How all 50 states are above the national average. EducationalMeasurement: Issues and Practices, 7(2), 5-9.

Chickering, A. W. (Ed.). (1981). The modern American college. San Francisco:Jossey-Bass Publishers.

Clark, M., & Grandy, J. (1984). Sex differences in the academic performance ofScholastic Aptitude Test takers (CB Rep. No. 84-8, ETS RR-84-43). New York:College Entrance Examination Board.

Cole, N. S. (1997). Understanding gender differences and fair assessment in context. InW. W. Willingham & N. S. Cole, Gender and fair assessment (pp. 157-184).Mahwah, NJ: Lawrence Erlbaum Associates.

Coleman, J. S. (1961). The adolescent society: The social life of the teenager and itsimpact on education. New York: Free Press.

Coleman, J. S., Campbell, E. Q., Hobson, C. J., McPartland, J., Mood, A. M., Weinfeld,F. D., & York, R. L. (1966). Equality of educational opportunity. Washington,DC: U.S. Department of Health, Education and Welfare.

College Board. (1992). College bound seniors. New York: College Board.

College Board. (1998). High school grading policies (RN-04). New York: CollegeBoard Office of Research and Development.

Collins, W. A., Maccoby, E. E., Steinberg, L., Hetherington, E. M., & Bornstein, M. H.(2000). Contemporary research on parenting: The case for nature and nurture.American Psychologist, 55(2), 218-232.

Cooper, H., Lindsay, J. J., Nye, B., & Greathouse, S. (1998). Relationships amongattitudes about homework, amount of homework assigned and completed, andstudent achievement. Journal of Educational Psychology, 90(1), 70-83.

Cooper, H., Valentine, J. C., Nye, B., & Lindsay, J. J. (1999). Relationships betweenfive after-school activities and academic achievement. Journal of EducationalPsychology, 91(2), 369-378.

Page 167: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

162

Covington, M. V. (2000). Intrinsic versus extrinsic motivation in schools: Areconciliation. Current Directions in Psychological Science, 9(1), 22-25.

Cronbach, L. J. (1975). Five decades of public controversy over mental testing.American Psychologist, 30(1), 1-14.

Crouse, J. & Trusheim, D. (1988). The case against the SAT. Chicago: University ofChicago Press.

Cureton, L. W. (1971). The history of grading practices. Measurement in Education,2(4), 1-8.

Davis, J. A. (1965). Faculty perceptions of students: V. A second-order structure forfaculty characterizations (College Entrance Examination Board RDR-64-5, No.14; ETS RB-65-12). Princeton, NJ: Educational Testing Service.

Dewey, J. (1900). School and society. Chicago: University of Chicago Press.

DiMaggio, P. (1982). Cultural capital and school success: The impact of status cultureparticipation on the grades of U.S. High School students. American SociologicalReview, 47, 189-201.

Dressel, P. L. (1939). Effect of the high school on college grades. Journal ofEducational Psychology, 30, 612-617.

Dwyer, C. A. & Johnson, L. M (1997). Grades, accomplishments, and correlates. In W.W. Willingham & N. S. Cole, Gender and fair assessment (pp. 127-156).Mahwah, NJ: Lawrence Erlbaum Associates.

Dyer, H. S., & King, R. G. (1955). College Board scores. New York: College EntranceExamination Board.

Eccles (Parsons), J. (1983). Expectancies, values, and academic behaviors. In J. T.Spence (Ed.), Achievement and achievement motives (pp. 75-146). SanFrancisco: W. H. Freeman Company.

Eccles, J. S., Wigfield, A., & Schiefele, U. (1998). Motivation to succeed. In N.Eisenberg (Ed.), Social, Emotional, and Personality Development (Vol. 3, pp.1017-1095). New York: John Wiley & Sons.

Ekstrom, R. (1994). Gender differences in high school grades: An exploratory study(College Board Report No. 94-3, ETS RR-94-25). New York: College EntranceExamination Board.

Ekstrom, R., Goertz, M., & Rock, D. (1988). Education & American youth. London:Falmer Press.

Page 168: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

163

Ekstrom, R., & Villegas, A. (1994). College grades: An exploratory study of policiesand practices (College Board Report No. 94-1, ETS RR-94-23). New York:College Entrance Examination Board.

Elliott, R., & Strenta, A. C. (1988). Effects of improving the reliability of the GPA onprediction generally and on comparative predictions for gender and raceparticularly. Journal of Educational Measurement, 25(4), 333-347.

Etaugh, A. F., Etaugh, C. F., & Hurd, D. E. (1972). Reliability of college grades andgrade point averages: Some implications for prediction of academic performance.Educational and Psychological Measurement, 32(4), 1045-1050.

Farkas, G., Grobe, R. P., Sheehan, D., & Shuan, Y. (1990). Cultural resources andschool success: Gender, ethnicity, and poverty groups within an urban schooldistrict. American Sociological Review, 55(1), 127-142.

Fetters, W. B., Stowe, P. S., & Owings, J. A. (1984). Quality of responses of highschool students to questionnaire items. Washington: National Center forEducational Statistics.

Findley, M. J., & Cooper, H. M. (1983). Locus of control and academic achievement: Aliterature review. Journal of Personality and Social Psychology, 44(2), 419-427.

Finn, J. D. (1993). School engagement and students at risk. Washington, DC: NationalCenter for Education Statistics.

Fishman, J. A. (1958). The use of tests for admission to college: The next fifty years.In A. E. Traxler, Long range planning for education (pp. 74-79). Washington:American Council on Education.

Fishman, J. A., & Pasanella, A. K. (1960). College admission selection studies. Reviewof Educational Research, 30(4), 298-310.

Fleming, M. & Chambers, B. (1983). Teacher-made tests: Windows on the classroom.In W. E. Hathaway (Ed.), Testing in the schools (pp. 29-38). (New Directions forTesting and Measurement, no. 19.) San Francisco: Jossey-Bass.

Frary, R. B., Cross, L. H., & Weber, L. J. (1993). Testing and grading practices andopinions of secondary teachers of academic subjects: Implications for instructionin measurement. Educational Measurement: Issues and Practice, 12(3), 23-30.

Frederiksen, J. R. & Collins, A. A systems approach to educational testing. EducationalResearcher, 18(9), 27-32.

Page 169: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

164

Frederiksen, N. (1984). The real test bias: Influences of testing on teaching andlearning. American Psychologist, 39(3), 193-202.

Frederiksen, N., Glaser, R., Lesgold, A. & Shafto, M. G. (1990). Diagnostic monitoringof skill and knowledge acquisition. Hilsdale, NJ: Lawrence Erlbaum Associates.

Fredrick, W. C., & Walberg, H. J. (1980). Learning as a function of time. Journal ofEducational Research, 73, 183-204.

Freeberg, N. E. (1967). The biographical information blank as a predictor of studentachievement: A review. Psychological Reports, 20(4), 911-925.

Gamache, L. M., & Novick, M. R. (1985). Choice of variables and gender differentiatedprediction within selected academic programs. Journal of EducationalMeasurement, 22(1), 53-70.

Geisinger, K. F. (1982). Marking systems. In H. E. Mitzell (Ed.), Encyclopedia ofeducational research (5th ed., pp. 1139-1149). New York: The Free Press.

Gifford, B. R. & O’Connor, M. C. (Eds.). (1992). Changing assessments: Alternateviews of aptitude, achievement, and instruction. Boston: Kluwer AcademicPublishers.

Gitomer, D. H. (1993). Performance assessment and educational measurement. In R. E.Bennett & W. C. Ward (Eds.), Construction versus choice in cognitivemeasurement: Issues in constructed response, performance testing, and portfolioassessment (pp. 241-264). Hillsdale, NJ: Lawrence Erlbaum Associates.

Goldman, R. D., & Hewitt, B. N. (1975). Adaptation-level as an explanation fordifferential standards in college grading. Journal of Educational Measurement,12(3), 149-161.

Goldman, R. D., Schmidt, D. E., Hewitt, B. N., & Fisher, R. (1974). Grading practicesin different major fields. American Educational Research Journal, 11(4), 343-357.

Goldman, R. D., & Slaughter, R. E. (1976). Why college grade point average is difficultto predict. Journal of Educational Psychology, 68(1), 9-14.

Goldman, R. D., & Widawski, M. H. (1976). A within-subjects technique for comparingcollege grading standards: Implications in the validity of the evaluation of collegeachievement. Educational and Psychological Measurement, 36(2), 381-390.

Green, P. G., Dugoni, B. L., Ingels, S. J., Camburn, E. (1995). A profile of theAmerican high school senior in 1992. Washington: U. S. Department ofEducation.

Page 170: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

165

Gulliksen, H. (1950). Intrinsic validity. American Psychologist, 5(10), 511-517.

Haertel, E. H. (1999). Validity arguments for high stakes testing: In search of theevidence. Educational Measurement: Issues and Practice, 18(4), 5-9.

Hanks, M. P., & Eckland, B. K. (1976). Athletics and social participation in theeducational attainment process. Sociology of Education, 49(4), 271-294.

Hansford, B. C., & Hattie, J. A. (1982). The relationship between self andachievement/performance measures. Review of Educational Research, 52(1),123-142.

Hanson, S. L. & Ginsburg, A. L. (1988). Gaining ground: Values and high schoolsuccess. American Educational Research Journal, 25(3), 334-365.

Harris, J. R. (1995). Where is the child's environment? A group socialization theory ofdevelopment. Psychological Review, 102(3), 458-489.

Hartocollis, A. (1999). Chancellor cites test score errors. New York Times, p. A1,Sept. 15, 1999.

Heubert, J. P., & Hauser, R. M. (1999). High stakes: Testing for tracking, promotion,and graduation. Washington, DC: National Academy Press.

Hewitt, B. N., & Goldman, R. D. (1975). Occam's razor slices through the myth thatcollege women overachieve. Journal of Educational Psychology, 67(2), 325-330.

Hills, J. R. (1981). Measurement and evaluation in the classroom. (2nd ed.). Columbus,Ohio: Merrill.

Hoge, R. D., & Coladarci, T. (1989). Teacher-based judgments of academicachievement: A review of literature. Review of Educational Research, 59(3),297-313.

Holland, A., & Andre, T. (1987). Participation in extracurricular activities in secondaryschool: What is known, what needs to be known? Review of EducationalResearch, 57(4), 437-466.

Holland, A., & Andre, T. (1988). Beauty is in the eye of the reviewer. Review ofEducational Research, 58(1), 113-118.

Hoover, H. D. & Han, L. (1995). The effect of differential selection on genderdifferences in college admissions test scores. Paper presented at the annualmeeting of the American Educational Research Association, San Francisco, CA.

Page 171: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

166

Hossler, D., & Stage, F. K. (1992). Family and high school experience influences on thepostsecondary educational plans of ninth-grade students. American EducationalResearch Journal, 29(2), 425-451.

Hulin, C. L., Henry, R. A., & Noon, S. L. (1990). Adding a dimension: Time as a factorin the generalizability of predictive relationships. Psychological Bulletin, 107(3),328-340.

Humphreys, L. G. (1968) The fleeting nature of academic prediction. Journal ofEducational Psychology, 59(5), 375-380.

Humphreys, L. G. & Taber, T. (1973). Postdiction study of the Graduate RecordExamination and eight semesters of college grades. Journal of EducationalMeasurement, 10(3), 179-184. Hunter, J. E., & Schmidt, F. L. (1976). Criticalanalysis of the statistical and ethical implications of various definitions of testbias. Psychological Bulletin, 83(6), 1053-1071.

Hunter, J. E., & Schmidt, F. L. (1976). Critical analysis of the statistical and ethicalimplications of various definitions of test bias. Psychological Bulletin, 83(6),1053-1071.

Hunter, J. E., & Schmidt, F. L. (1982). Meta analysis: Cumulating research findingsacross studies. Beverly Hills: Sage Publications.

Ingels, S. J., Abraham, S., Rasinski, K. A., Karr, R., Spencer, B. D., & Frankel, M. R.(1990). NELS:88 Base year student component (NCES 90-464). Washington,DC: National Center for Education Statistics, U.S. Department of Education.

Ingels, S. J., Dowd, K. L., Baldridge, J. D., Stipe, J. L., Bartot, V. H., & Frankel, M. R.(1994). Second follow-up: Student component data file user's manual (NCES 94-374). Washington, DC: National Center for Education Statistics, U.S.Department of Education.

Ingels, S. J., Dowd, K. L., Taylor, J. R., Bartot, V. H., Frankel, M. R., & Pulliam, P. A.(1995). Second follow-up: Transcript component data file user's manual (NCES95-377). Washington, DC: National Center for Education Statistics, U.S.Department of Education.

Ingels, S. J., Scott, L. A., Lindmark, J. T., Frankel, M. R., & Myers, S. L. (1992a).NELS:88 First follow-up student component (NCES 92-030). Washington, DC:National Center for Education Statistics, U.S. Department of Education.

Ingels, S. J., Scott, L. A., Lindmark, J. T., Frankel, M. R., & Myers, S. L. (1992b).NELS:88 First follow-up teacher component (NCES 92-085). Washington, DC:National Center for Education Statistics, U.S. Department of Education.

Page 172: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

167

Jackson, D. N., Ahmed, S. A. & Heapy, N. A. (1976). Is achievement a unitaryconstruct? Journal of Research in Personality. 10(1), 1-21.

Jencks, C. (1998). Racial bias in testing. In Jencks, C. & Phillips, M. (Eds.), The black-white score gap (pp. 55-85). Washington: Brookings Institution Press.

Jencks, C., Smith, M., Acland, H., Bane, M. J., Cohen, D., Gintis, H., Heyns, B., &Michelson, S. (1972). Inequality: A reassessment of the effect of family andschooling in America. New York: Basic Books, Inc.

Juola, A. E. (1968). Illustrative problems in college-level grading. Personnel andGuidance Journal, 47(1), 29-33.

Jussim, L. (1989). Teacher expectations: Self-fulfilling prophecies, perceptual biases,and accuracy. Journal of Personality and Social Psychology, 57(3), 469-480.

Keeton, M. & Associates. (1976). Experiential learning: Rationale, characteristics, andassessment. San Francisco: Jossey-Bass.

Keith, T. Z. (1982). Time spent on homework and high school grades: A large samplepath analysis. Journal of Educational Psychology, 74(2), 248-253.

Kelly, F. J. (1914). Teachers' marks Their variability and standardization. (Vol. No.6). New York: Teachers College, Columbia University.

Kleese, E. J. & D’Onofrio, J. A. (1994). Student activities for students at risk. Reston,VA: National Association of Secondary School Principals.

Kirschenbaum, H., Simon, S. B., & Napier, R. W. (1971). Wad-ja-get? The gradinggame in American education. New York: Hart Publishing Company.

Koretz, D., Stecher, B., Klein, S. & McCaffrey, D. (1994). The Vermont portfolioassessment program. Educational Measurement: Issues and Practices, 13(3), 5-16.

Krathwohl, D. R., Bloom, B. S., & Masia, B. B. (1964). Taxonomy of educationalobjectives: Handbook II. Affective domain. New York: David McKayCompany, Inc.

Lamborn, S. D., Brown, B. B., Mounts, N. S., & Steinberg, L. (1992). Putting school inperspective: The influence of family, peers, extracurricular participation, andpart-time work on academic engagement. In F. M. Neuman (Ed.), Studentengagement and achievement in American secondary schools (pp. 153-181). NewYork: Teachers College Press.

Page 173: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

168

Lamborn, S. D., Mounts, N. S., Steinberg, L., & Dornbusch, S. M. (1991). Patterns ofcompetence and adjustment among adolescents from authoritative, authoritarian,indulgent, and neglectful families. Child Development, 62(5), 1049-1065.

Lemann, N. (1999). The big test: The secret history of American meritocracy. NewYork: Farrar, Straus, & Giroux.

Leonard, D. K., & Jiang, J. (1995, April). Gender bias in the college predictions of theSAT. Paper presented at the annual meeting of the American EducationalResearch Association, San Francisco, CA.

Lewis, C., & Willingham, W. W. (1995). The effects of sample restriction on genderdifferences (ETS RR-95-13). Princeton, NJ: Educational Testing Service.

Lindquist, E. F. (1963). An evaluation of a technique for scaling high school grades toimprove prediction of college success. Educational and PsychologicalMeasurement, 23(4), 623-646.

Lindsay, P. (1982). The effect of high school size on student participation, satisfaction,and attendance. Educational Evaluation and Policy Analysis, 4(1), 57-65.

Lindsay, P. (1984). High school size, participation in activities, and young adult socialparticipation: Some enduring effects of schooling. Educational Evaluation andPolicy Analysis, 6(1), 73-83.

Linn, R. L. (1966). Grade adjustments for prediction of academic performance: Areview. Journal of Educational Measurement, 3(4), 313-329.

Linn, R. L. (1982a). Ability testing: Individual differences, prediction, and differentialprediction. In Wigdor, A. K. & Garner, W. R. (Eds.), Ability testing: Uses,controversies, and consequences (Vol. 2, pp. 335-388). Washington: NationalAcademy Press.

Linn, R. L. (1982b). Admissions testing on trial. American Psychologist, 37(3), 279-291.

Linn, R. L. (1983). Predictive bias as an artifact of selection procedures. In H. Wainer& S. Messick (Eds.), Principals of modern psychological measurement: Afestschrift for Frederic M. Lord (pp. 27-40). Hillsdale, NJ: Lawrence ErlbaumAssociates.

Linn, R. L. (2000). Assessments and accountability. Educational Researcher, 29(2), 4-16.

Page 174: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

169

Linn, R. L. & Graue, E. & Sanders, N. M. (1990). Comparing state and district testresults to national norms: The validity of the claims that “Everyone is aboveaverage.” Educational Measurement.: Issues & Practices, 9(3), 5-14.

Linn, R. L., & Werts, C. E. (1971). Considerations for studies of test bias. Journal ofEducational Measurement, 8(1), 1-4.

Loyd, B. H., & Loyd, D. E. (1997). Kindergarten through grade 12 standards: Aphilosophy of grading. In G. D. Phye (Ed.), Handbook of classroom assessment:Learning, adjustment, and achievement (pp. 481-489). San Diego, CA:Academic Press.

Madaus, G. F. (1994). A technological and historical consideration of equity issuesassociated with proposals to change the nation’s testing policy. HarvardEducational Review, 64(1), 76-95.

Makitalo, A. (1994). Non-comparability of female and male admission test takers(Department of Education and Educational Research. Report no. 1994:06).Goteborg: Goteborg University.

Marsh, H. W. (1987). The big-fish-little-pond effect on academic self-concept. Journalof Educational Psychology, 79(3), 280-295.

Marsh, H. W. (1990a). Causal ordering of academic self-concept and academicachievement: A multiwave, longitudinal panel analysis. Journal of EducationalPsychology, 82(4), 646-656.

Marsh, H. W. (1990b). A multidimensional, hierarchical model of self-concept:Theoretical and empirical justification. Educational Psychology Review, 2(2),77-172.

Marsh, H. W. (1991). Employment during high school: Character building or asubversion of academic goals? Sociology of Education, 64(3), 172-189.

Marsh, H. W. (1992a). Content specificity of relations between academic achievementand academic self-concept. Journal of Educational Psychology, 84(1), 35-42.

Marsh, H. W. (1992b). Extracurricular activities: Beneficial extension of the traditionalcurriculum or subversion of academic goals? Journal of Educational Psychology,84(4), 553-562.

Marsh, H. W. (1994). Using the National Longitudinal Study of 1988 to evaluatetheoretical models of self-concept: The self-description questionnaire. Journal ofEducational Psychology, 86(3), 439-456.

Page 175: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

170

Marsh, H. W., Byrne, B. M., & Shavelson, R. J. (1988). A multifaceted academic self-concept: Its hierarchical structure and its relation to academic achievement.Journal of Educational Psychology, 80(3), 366-380.

Marsh, H. W., Byrne, B. M. & Yeung, A. S. (1999). Causal ordering of academic self-concept and achievement: Reanalysis of a pioneering study and revisedrecommendations. Educational Psychologist, 34(3), 155-168.

Marsh, H. W., & Parker, J. W. (1984). Determinants of student self-concept: Is it betterto be a relatively large fish in a small pond even if you don't learn to swim aswell? Journal of Personality and Social Psychology, 47(1), 213-231.

Marsh, H. W., & Shavelson, R. (1985). Self-concept: Its multifaceted, hierarchicalstructure. Educational Psychologist, 20(3), 107-123.

Marsh, H. W., & Yeung, A. S. (1997). Causal effects of academic self-concept onacademic achievement: Structural equation models of longitudinal data. Journalof Educational Psychology, 89(1), 41-54.

McClelland, D. C., Atkinson, J. W., Clark, R. A., & Lowell, E. L. (1953). Theachievement motive. New York: Appleton-Century-Crofts.

McCornack, R. L., & McLeod, M. M. (1988). Gender bias in the prediction of collegecourse performance. Journal of Educational Measurement, 25(4), 321-331.

Messick, S. (1967). Personality measurement and college performance. In D. N. Jackson& S. Messick (Eds.), Problems in human assessment (pp.834-845), New York:McGraw-Hill.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.,pp. 13-103). New York: American Council on Education & Macmillan.

Messick, S. (1995). Validation of inferences from persons’ responses and performanceas scientific inquiry into score meaning. American Psychologist, (50)9, 741-749.

Milne, A. M., Myers, D. E., Rosenthal, A. S., & Ginsburg, A. (1986). Single parents,working mothers, and the educational achievement of school children. Sociologyof Education, 59(2), 125-139.

Milton, O., Pollio, H. R., & Eison, J. A. (1986). Making sense of college grades. SanFrancisco: Jossey-Bass.

Mislevy, R. J., Steinberg, L. S., Breyer, F. J., Almond, R. G., & Johnson, L. (1999). Acognitive task analysis, with implications for designing a simulation-basedassessment system. Computers and Human Behavior. 15, 335-374.

Page 176: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

171

Monk, D. H. & Haller, E. J. (1993). Predictors of high school academic courseofferings: The role of school size. American Educational Research Journal,30(1), 3-21.

Mulkey, L. M., Crain, R. L., & Harrington, A. J. C. (1992). One-parent households andachievement: Economic and behavioral explanations of a small effect. Sociologyof Education, 65(1), 48-65.

National Center for Education Statistics. (1995). Extracurricular participation andstudent engagement. Education Policy Issues: Statistical Perspectives (NCES 95-741). Washington, DC: U.S. Department of Education, National Center forEducation Statistics.

National School Public Relations Association. (1972). Grading and reporting: Currenttrends in school policies & programs. Arlington, VA: National School PublicRelations Association.

Natriello, G. (1992). Marking systems. In M. C. Alkin (Ed.), Encyclopedia ofEducational Research (6th ed., Vol. 3, pp. 772-776). New York: MacmillanPublishing Company.

Newmann, F. M. (Ed.). (1992). Student engagement and achievement in Americansecondary schools. New York: Teachers College, Columbia University.

Office of Civil Rights. (1999). Nondiscrimination in high-stakes testing: A resourceguide. Washington, DC: U.S. Department of Education.

Pedulla, J. J., Airasian, P. W., & Madaus, G. F. (1980). Do teacher ratings andstandardized test results of students yield the same information? AmericanEducational Research Journal, 17(3), 303-307.

Pennock-Roman, M. (1990). Test validity and language background. New York:College Board.

Pennock-Roman, M. (1994). College major and gender differences in the prediction ofcollege grades (College Board Report No. 94-2, ETS RR-94-24). New York:College Entrance Examination Board.

Petersen, N. S., & Novick, M. R. (1976). An evaluation of some models for culture-fairselection. Journal of Educational Measurement, 13(1), 3-29.

Pintrich, P. R., & Schunk, D. H. (1996). Motivation in education: Theory, research, andapplications. Englewood Clifts, NJ: Prentice Hall.

Public Agenda (2000). Reality check 2000. Education Week. Special Report. February16. p. S1-S8.

Page 177: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

172

Ramist, L., Lewis, C., & McCamley, L. (1990). Implications of using freshman GPA asthe criterion for the predictive validity of the SAT. In W. W. Willingham, C.Lewis, R. Morgan, & L. Ramist, Predicting college grades: An analysis ofinstitutional trends over two decades (pp. 253-288). Princeton, NJ: EducationalTesting Service.

Ramist, L., Lewis, C., & McCamley-Jenkins, L. (1994). Student group differences inpredicting college grades: Sex, language, and ethnic groups (College BoardReport No. 93-1, ETS RR-94-27). New York: College Entrance ExaminationBoard.

Resnick, L. B. & Resnick, D. P. (1992). Assessing the thinking curriculum: New toolsfor educational reform. In B. R. Gifford & M. C. O’Connor (Eds.), Changingassessments: Alternate views of aptitude, achievement, and instruction (pp. 37-76). Boston: Kluwer Academic Publishers.

Richards, J. M., Jr., Holland, J. L., & Lutz, S. W. (1967). Prediction of studentaccomplishment in college. Journal of Educational Psychology, 58(6), 343-355.

Robinson, G. E., & Craver, J. M. (1989). Assessing and grading student achievement.Arlington, VA: Educational Research Service.

Rock, D. A., Ekstrom, R. B., Goertz, M. E., & Pollack, J. (1986). Study of excellence inhigh school education: Longitudinal study, 1980-82 final report. Washington,DC: U.S. Department of Education, Center for Statistics, Office of EducationalResearch and Improvement.

Rock, D., & Evans, F. (1982). The effectiveness of several grade adjustment methodsfor predicting law school performance (LSAC No. 82-02). Newtown, PA: LawSchool Admission Services.

Rock, D. A., Pollack, J. M., & Quinn, P. (1995). Psychometric report for the NELS:88base year through second follow-up (NCES 95-382). Washington, DC: U.S.Department of Education, National Center for Education Statistics.

Rock, D. A. & Werts, C. E. (1979). Construct validity of the SAT across populations—an empirical confirmatory study (Research Report 79-2). Princeton: EducationalTesting Service.

Rosenthal, R., & Jacobson, L. (1968). Pygmalion in the classroom: Teacherexpectations and pupils' intellectual development. New York: Holt, Rinehart andWinston.

Saslow, L. (1989). Schools say inflated grades cut grants. New York Times. p. 1, May7, 1989.

Page 178: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

173

Sewell, W. H., & Shah, V. P. (1968). Social class, parental encouragement, andeducational aspirations. American Journal of Sociology, 73(5), 559-572.

Shepard, L. A. (1990). Inflated test score gains: Is the problem old norms or teachingthe test? Educational Measurement: Issues and Practice, 9(3), 15-22.

Shepard, L. A. (1992a). Commentary: What policy makers who mandate tests shouldknow about the new psychology of intellectual ability and learning. In B. R.Gifford & M. C. O’Connor (Eds.), Changing assessments: Alternate views ofaptitude, achievement, and instruction (pp. 301-328). Boston: Kluwer AcademicPublishers.

Shepard, L. A . (1992b). Uses and abuses of testing. In M. C. Alkin (Ed.), Encyclopediaof educational research (pp.1477-1485). New York: Macmillan PublishingCompany.

Shepard, L. A. (2000). The role of classroom assessment in teaching and learning (CSETechnical Report 517). Los Angeles: National Center for Research onEvaluation, Standards, and Student Testing.

Shepard, L. A., Flexer, R. J., Hiebert, E. H., Marion, S. F., Mayfield, V. & Weston, T. J.(1995). Effects of introducing classroom performance assessments (CSETechnical Report 394). Los Angeles: National Center for Research onEvaluation, Standards, and Student Testing.

Snow, R. E. (1989). Toward assessment of cognitive and conative structures in learning.Educational Researcher, 18(9), pp.8-14.

Snow, R. E. (1998). Abilities as aptitudes and achievements in learning situations. In J.J. McCardle & R. W. Woodcock (Eds.), Human cognitive abilities in theory andpractice (pp. 93-112). Mawah, NJ: Lawrence Erlbaum Associates.

Snow, R. E. & Jackson, D. N. (1993). Assessment of conative constructs for educationalresearch and evaluation: A catalogue (CSE Technical Report 354). Los Angeles:National Center for Research on Evaluation, Standards, and Student Testing.

Snow, R. E. & Jackson, D. N. (1997). Individual differences in conation: Selectedconstructs and measures (CSE Technical Report 447). Los Angeles: NationalCenter for Research on Evaluation, Standards, and Student Testing.

Snow, R. E., & Mandinach, E. B. (1999). Integrating assessment and instruction forclassrooms and courses: Programs and prospects for research. Princeton:Educational Testing Service.

Soares, A. T., & Soares, L. M. (1969). Self-perceptions of culturally disadvantagedchildren. American Educational Research Journal, 6(1), 31-45.

Page 179: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

174

Spady, W. G. (1970). Lament for the letterman: Effects of peer status andextracurricular activities on goals and achievement. American Journal ofSociology, 75(4), 680-702.

Starch, D., & Elliott, E. C. (1912). Reliability of the grading of high-school work inEnglish. School Review, 20, 442-457.

Starch, D., & Elliott, E. C. (1913). Reliability of grading work in mathematics. SchoolReview, 21(5), 254-256.

Steinberg, L., Brown, B. B., Cider, M., Kaczmarek, N., & Lazzaro, C. (1988).Noninstructional influences on high school student achievement: Thecontributions of parents, peers, extracurricular activities, and part-time work.Madison, WI: National Center for Effective Secondary Schools.

Sternberg, R. J. (1985). Beyond IQ: A triarchic theory of human intelligence. NewYork: Cambridge University Press.

Strenta, A. C., & Elliott, R. (1987). Differential grading standards revisited. Journal ofEducational Measurement, 24(4), 281-291.

Stricker, L. J. & Emmerich, W. (1999). Possible determinants of differential itemfunctioning: Familiarity, interest, and emotional reaction. Journal of EducationalMeasurement, 36(4), 347-366.

Stricker, L. J., Rock, D. A., & Burton, N. W. (1991). Sex differences in SAT predictionsof college grades (College Board Report No. 91-2, ETS RR-91-38). New York:College Entrance Examination Board.

Stricker, L. J., Rock, D. A., & Burton, N. W. (1993). Sex differences in predictions ofcollege grades from Scholastic Aptitude Test scores. Journal of EducationalPsychology, 85(4), 710-718.

Stricker, L. J., Rock, D. A., Burton, N. W., Muraki, E., & Jirele, T. J. (1994). Adjustingcollege grade point average criteria for variations in grading standards: Acomparison of methods. Journal of Applied Psychology, 79(2), 178-183.

Taber, T. D., & Hackman, J. D. (1976). Dimensions of undergraduate collegeperformance. Journal of Applied Psychology, 61(5), 546-558.

Tatsuoka, K. K. & Tatsuoka, M. M. (1992). A psychometrically sound cognitivediagnostic model: Effect of remediation as empirical validity (ETS RR-92-38).Princeton, NJ: Educational Testing Service.

Terwilliger, J. S. (1989). Classroom standard setting and grading practices. EducationalMeasurement: Issues and Practice, 8(2), 15-19.

Page 180: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

175

Thorndike, R. L. (1951). Reliability. In E. F. Lindquist (Ed.), Educational measurement(pp. 561-620). Washington, DC: American Council on Education.

Thorndike, R. L. (1963). The concepts of over- and underachievement. New York:Teachers College, Columbia University.

Thorndike, R. L. (1969). Marks and marking systems. In R. L. Ebel (Ed.),Encyclopedia of educational research (4th ed., pp. 759-766). New York:Macmillan.

Thurstone, L. L. (1947). Multiple-factor analysis. Chicago: University of ChicagoPress.

Tucker, L. (1960). Formal models for a central prediction system (ETS RB-60-14).Princeton, NJ: Educational Testing Service.

Vars, F. E. & Bowen, W. G. (1998). Scholastic Aptitude Test scores, race, and academicperformance in selective colleges and universities. In Jencks, C. & Phillips, M.(Eds.), The black-white score gap (pp. 457-479). Washington: BrookingsInstitution Press.

Vickers, J. M. (2000). Justice and truth in grades and their averages. Research in HigherEducation, 41(2), 141-164.

Warren, J. R. (1971). College grading practices: An overview (Report 9). Washington,DC: ERIC Clearinghouse on Higher Education, George Washington University.

Weiner, B. (1992). Human motivation: Metaphors, theories, and research. NewburyPark, CA: Sage Publications.

Wenglinsky, H. (1997). How money matters: The effect of school district spending onacademic achievement. Sociology of Education, 70(July), 221-237.

Werts, C., Linn, R. L., & Joreskog, K. G. (1978). Reliability of college grades fromlongitudinal data. Educational and Psychological Measurement, 38(1), 89-95.

Werts, C. E. (1967). The many faces of intelligence. Journal of EducationalPsychology, 58(4), 198-204.

Whitaker, U. (1989). Assessing learning: Standards, principles, and procedures.Philadelphia: Council for Adult and Experiential Learning.

Wigdor, A. K., & Garner, W. R. (Eds.). (1982). Ability testing: Uses, consequences,and controversies. Washington, DC: National Academy Press.

Page 181: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

176

Wiggins, G. (1989). A true test: Toward more authentic and equitable assessment. PhiDelta Kappan, 70, 703-713.

Wilgoren, J. (2000). Cheating of statewide tests is reported in Massachusetts. NewYork Times. February 25.

Willingham, W. W. (1961). Prediction of the academic success of transfer students(RM 61-16). Atlanta: Georgia Institute of Technology.

Willingham, W. W. (1962a). Longitudinal analysis of academic performance. (RM 62-5). Atlanta: Georgia Institute of Technology.

Willingham, W. W. (1962b). The analysis of grading variations (RM 62-9). Atlanta:Georgia Institute of Technology.

Willingham, W. W. (1963a). Adjusting college predictions of the basis of academicorigins. In M. Katz (Ed.), The twentieth yearbook of the National Council onMeasurement in Education (pp. 1-6). East Lansing, MI: National Council onMeasurement in Education.

Willingham, W. W. (1963b). The application blank as a predictive instrument (RM 63-10). Atlanta: Georgia Institute of Technology.

Willingham, W. W. (1963c). The effect of grading variations on the efficiency ofpredicting freshman grades (RM 63-1). Atlanta: Georgia Institute ofTechnology.

Willingham, W. W. (1963d). Variation among the grade scales of different high schools(RM 63-3). Atlanta: Georgia Institute of Technology.

Willingham, W. W. (1965). The application blank as a predictive instrument. Collegeand University, Spring, 271-281.

Willingham, W. W. (1974). Predicting success in graduate education. Science, 183,273-278.

Willingham, W. W. (1985). Success in college: The role of personal qualities andacademic ability . New York: College Entrance Examination Board.

Willingham, W. W. (1999). A systemic view of test fairness. In S. Messick (Ed.),Assessment in Higher Education (p. 213-242). Mahwah, NJ: Lawrence ErlbaumAssociates.

Willingham, W. W. & Breland, H. M. (1982). Personal qualities and college admissions.New York: College Board.

Page 182: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

177

Willingham, W. W., & Cole, N. S. (1997). Gender and fair assessment. Mahwah, NJ:Lawrence Erlbaum Associates.

Willingham, W. W. & Lewis, C. (1990). Institutional differences in prediction trends.In W. W. Willingham, C. Lewis, R. Morgan, & L. Ramist, Predicting collegegrades: An analysis of institutional trends over two decades (pp. 141-160).Princeton, NJ: Educational Testing Service.

Wing, C. W., & Wallach, M. A. (1971). College admissions and the psychology oftalent. New York: Holt, Rinehart, & Winston.

Young, J. W. (1990). Adjusting the cumulative GPA using item response theory.Journal of Educational Measurement, 27(2), 175-186.

Young, J. W. (1991). Gender bias in predicting college academic performance: A newapproach using item response theory. Journal of Educational Measurement,28(1), 37-47.

Page 183: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

178

Author Note

Appreciation is expressed to Henry Braun, Brent Bridgeman, Nancy

Burton, Robert Linn, and Lawrence Stricker for reviewing drafts of this report and

to Linda Johnson for graphical and editorial assistance. National Center for

Educational Statistics provided the data for the analysis. The study was supported

by Educational Testing Service.

Page 184: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Figure 1A Framework of Possible Sources of Discrepancy Between

Observed Grades and Test Scores

A. Content Differences Between Grades and Test Scores

1. Domain of general knowledge and skilla. Subjects covered, such as science and history; broad divisions within subjects,

such as physics or European historyb. General cognitive skills, such as reasoning, writing or performance

2. Specific knowledge and skills as reflected in:a. Course-based content throughout the school district, state, or nation

(especially relevant to an external test)b. Classroom-based content (especially relevant to a teacher’s grade)c. Individualized content (especially relevant to personal interests and skills)

3. Components other than subject knowledge and skills:a. Social objectives of education (e.g., leadership, citizenship)b. Academic and personal development (e.g., attendance, class participation,

completing assignments, disruptive behavior, effort and coping skills)c. Assessment skills and debilities (pertinent to test-taking or class assignments,

construct-relevant or irrelevant; general or specific to a particular assessment)

B. Individual Differences Related to Content Differences

1. Early development and relevant learning acquired outside of school2. Student motivation reflected in academic behavior, attitudes, and circumstances3. Teacher judgment regarding the student’s performance

C. Errors in Grades or Test Scores

1. Systematic Error—Noncomparabilitya. Variation in grading standards (across schools, courses, teachers, and sections)b. Variation in test score scales (across forms; across time)c. Cheating (by students or schools, on class assignments or tests)

2. Unsystematic Measurement Error—Unreliability (in grades and in test scores)

D. Situational Differences1. Across contexts2. Over time

________________________________________________________________________

Page 185: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of
Page 186: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Figure 3The Accumulating Effects of Five Factors That Help to Account for

Observed Differences Between Grades and Test Scores

___________________________________________________________________________________ Status of Correlation:

Grade-Score Relationship HSA vs. NELS Adjustment for 5 Factors___________________________________________________________________________________

Transcript grade average [.62]is correlated with total test score

Factor 1. Subjects Covered.Restrict grade average to 4 NAEP “newbasic” academic areas; optimally weightthe 4 corresponding test scores

Grades and scores are based on [.68]corresponding subject matter

Factor 2. Grading Variations.Subtract school means; correct for rangerestriction. Adjust HSA for mean gradingdifficulty of each student’s courses

School and course grading [.76]variations are removed

Factor 3. Reliability.Correct for the unreliability of HSA andNELS test scores

Measurement errors in scores [.81]and grades are taken into account

Factor 4. Student Characteristics.Add 26 student variables to the multiplecorrelation between test scores and HSA

Differential effects of student effort [.86]on grades and test scores are takeninto account

Factor 5. Teacher Ratings.Add 5 teacher judgments concerning schoolbehavior of individual students

Other factors noted by teachers that [.90]may influence grades are taken intoaccount

___________________________________________________________________________________

Page 187: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of
Page 188: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of
Page 189: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Figure 6A Schema for Considering Possible Strengths of Grades

and Test Scores in High Stakes Decisions

Differential characteristics of:

DifferentiatingComponent in Grades

Pertinent AssessmentObjective

GRADES(As performance on the

local contract)

TEST SCORES(As performance on the

external standard)

Grading Variations(Factor 2)

To assess fairly Local standards Common yardstick

Scholastic Engagement(Factor 4)

To evaluatecritical skills

Conative skills Cognitive skills

Teacher Judgment(Factor 5)

To motivateteaching and learning

Assessment basedon each student’s

learning and behavior

Assessment basedon designated

knowledge and skills

Page 190: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 1Percentage of Faculty Reporting that Various Factors Affect

Grades as a Matter of Policy or as Expected Practice*

Grading Factor Official Policy Expected Practice

Late work must be graded lower 7 47

Attendance is included in the coursegrade

11 31

Grades reflect progress toward goalsof individual students

9 30

Effort is included in the course grade 3 30

Attitude and/or behavior is includedin the course grade

6 26

Students can raise grades through anextra credit project

4 15

*Reproduced with permission. Copyright 1994 by The College Entrance ExaminationBoard. All rights reserved.

Page 191: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 2Sample Sizes for Gender by Ethnicity

and School Program*

Male Female Total

Ethnicity

African American 268 323 591

Asian American 251 273 524

Hispanic 405 399 804

White 3201____

3270____

6471____

4125 4265 8390

School Program

Rigorous Academic 907 985 1892

Academic 2235 2415 4650

Academic-Vocational 345 282 627

Vocational 235____

167____

402____

3722 3849 7571

Total 4154 4300 8454

*Ethnicity was not available for 64 students. Program counts do not include 883students whose programs were not classified by NCES or a very small groupcharacterized as Rigorous Academic/Vocational. Information regarding ethnicitywas not available of 1% of the sample.

Page 192: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 3Student Characteristics and Teacher Ratings—

Illustrative Content and Missing Data*

______________________________________________________________________________Variable % Data Missing______________________________________________________________________________A. School Skills

1. Attendance (6: not absent/tardy—from student and school record) 02. Class participation (8: come prepared, pay attention, take notes) 0.13. Discipline problems (6: trouble with school rules, suspension, fights) 0.14. Work completed (5: turn in work on time, more than required) 0.15. Homework hours (2: hours per week—at home and school) 0.6

B. Initiative6. Courses completed (number, irrespective of grade earned) 07. Adv. electives (5 pts: any AP course, 12th grade Math though not required) 08. School activities (12 pts: with added points for awards/offices) 09. School sports (5 pts: with added points for achievement/leadership) 010. Outside activities (6: frequency or award) 0

C. Competing Activities11. Drugs/gangs (4: involvement with) 5.112. Killing time (4: TV, video games, talking, riding around) 0.513. Peer sociability (3: friends like parties, popularity, going steady) 4.614. Employment (20+ hours per week—yes/no) 2.815. Child care (20+ hours per week—yes/no) 0.416. Leisure reading (hours/week, not school related) 0

D. Family Background17. SES (NELS composite: parents’ education, occupation, and income) 0.118. Family intact (living with 2 parents/stepparents—from 1990 survey) 0.919. Parent relations (7: discuss school with parents, OK with parents) 0.120. Parent aspiration (4: want child to continue education) 0.921. Stress at home (10: parent lost job, died; family member on drugs) 0.7

E. Attitudes22. Teacher relations (3: thinks they do a good job, solicits their help) 0.123. Educational plans (6: plans college, plans career requiring college) 024. Self esteem (6: pride, optimism) 2.625. Locus of control (5: internal control—planning and effort pay off) 2.926. Peer studiousness (6: friends like school, studying, good grades) 4.3

F. Teacher Ratings27. Attendance (2: seldom absent or tardy) 9.928. Class behavior (2: usually attentive, seldom disruptive) 9.929. Consults teacher (talks with teacher outside class) 10.130. Educational motivation (3: works hard, going to college, will not drop out) 9.931. Work completed (usually completes homework assignments) 10.1

______________________________________________________________________________*Numbers in parentheses indicate that a variable is based on a mean or point count (pts.) for morethan one response, rating, or other item of information; following phrases illustrate the types ofcontent.

Page 193: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 4Regression of Total Transcript Average (HSA-T) and

Academic Average (HSA) on NELS Tests*

HSA(T) HSAr β r β

NELS Test

Reading .55 .17 .56 .14

Mathematics .64 .52 .66 .54

Science .51 −.07 .53 −.07

Social Studies .51 .07 .54 .10

Multiple R .65 .68

*N = 8454

Page 194: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 5Analysis of Course Grading Variations By Schools, Courses, and the

Interaction of Schools and Courses*

Source of Variation Sum ofSquares

Proportionof Total

Schools 8115.3 .41

Courses, controlling schools 2751.4 .14

Additive model 10,866.8 .54

Course-school interaction 9118.4 .46

Total between cells 19,985.2 1.00

*Based on 574 schools and 225 courses in four basic academic areas. Adjusted foroverfitting; see Note 8.

Page 195: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 6Reliabilities of Grade Averages and the Credit Hours upon

Which They were Based

Subject Area Averages 4-Year Averages

English Math Science SocialStudies

HSA HSA* HSA-T*

Reliabilities

Male .91 .87 .87 .91 .97 .97 .97

Female .90 .86 .87 .90 .96 .97 .97

African American .84 .76 .81 .83 .94 .94 .96

Asian American .91 .91 .85 .91 .97 .97 .96

Hispanic .89 .82 .84 .89 .95 .96 .96

White .92 .87 .87 .91 .97 .97 .97

Rigorous Academic .91 .88 .87 .89 .96 .97 .97

Academic .91 .87 .87 .90 .96 .97 .97

Academic/Vocational .87 .82 .79 .86 .94 .95 .95

Vocational .80 .63 .69 .79 .89 .89 .92

Total .91 .87 .87 .91 .97 .97 .97

Mean Credit Hours#HSAHours

% ofTotal

TotalHours

Male 4.2 3.6 3.2 3.6 14.7 67 21.9

Female 4.2 3.5 3.2 3.7 14.5 65 22.4

African American 4.4 3.6 3.1 3.7 14.7 68 21.7

Asian American 4.3 3.9 3.5 3.7 15.4 69 22.3

Hispanic 4.4 3.5 2.9 3.6 14.4 65 22.0

White 4.2 3.5 3.2 3.6 14.5 65 22.2

Rigorous Academic 4.3 3.9 3.8 3.8 15.8 69 23.0

Academic 4.3 3.6 3.3 3.7 14.9 67 22.1

Academic/Vocational 4.3 3.3 2.8 3.6 14.0 62 22.7

Vocational 4.0 2.5 2.2 3.0 11.6 54 21.3

Total 4.2 3.5 3.2 3.6 14.6 66 22.1*Corrected for range restriction. #Carnegie Units completed, irrespective of grades earned.

Page 196: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 7Correlations of Student Characteristics and Teacher Ratings with NELS Test,

High School Averages, and Grading Corrections*

NELS-T HSA(T) HSA SGF MCGF

A. School Skills1. Attendance .19 .35 .33 −.03 .122. Class participation .09 .22 .21 −.01 .123. Discipline problems −.22 −.34 −.31 .04 −.104. Work completed −.04 .19 .18 −.03 −.005. Homework hours .22 .20 .21 −.02 .13

B. Initiative6. Courses completed .28 .36 .32 −.05 −.107. Advanced electives .59 .54 .58 .03 .478. School activities .19 .30 .28 −.08 −.039. School sports .04 .04 .06 −.02 .14

10. Outside activities .15 .14 .14 −.01 .10C. Competing Activities 11. Drugs/gangs −.07 −.23 −.20 .07 −.00 12. Killing time −.10 −.16 −.16 .01 −.04 13. Peer sociability −.09 −.10 −.09 .01 .00 14. Employment −.12 −.13 −.13 .01 −.08 15. Child care −.10 −.06 −.07 .00 −.07 16. Leisure reading .20 .06 .06 .01 .00D. Family Background 17. SES .48 .33 .35 .08 .31 18. Family intact .11 .13 .13 −.02 .03 19. Parent relations .07 .18 .17 −.02 .09 20. Parent aspiration .34 .31 .33 .02 .32 21. Stress at home −.07 −.12 −.12 .03 −.04E. Attitudes 22. Teacher relations .21 .23 .23 −.00 .13 23. Educational plans .33 .34 .35 .02 .28 24. Self esteem .28 .29 .29 −.00 .21 25. Locus of control .24 .25 .25 −.01 .13 26. Peer studiousness .25 .29 .29 .00 .16F. Teacher Ratings 27. Attendance .22 .38 .37 −.07 .16 28. Class behavior .35 .51 .51 −.04 .20 29. Consults teacher .11 .15 .15 −.04 .05 30. Educ. motivation .45 .62 .63 −.05 .33 31. Work completed .33 .61 .61 −.11 .19

*N=8454.

Page 197: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 8The Relationship of Student Characteristics and Teacher Ratings to High

School Average (HSA)—with Progressive Controls Applied

βeta weights: reliability correction added.

Variable

CorrelationwithHSA

R, plusgradingcontrol

Plus test& grading

control26 studentvariables

and 5 teacherratings

A. School Skills1. Attendance .33 .39 .31# .11* .07*2. Class participation .21 .23 .23# .01 .003. Discipline problems −.31 −.35 −.23# −.03* .004. Work completed .18 .18 .30# .11* .09*5. Homework hours .21 .21 .06 −.04* −.04*

B. Initiative6. Courses completed .32 .45 .24# .07* .04*7. Advanced electives .58 .65 .34# .16* .12*8. School activities .28 .32 .18# .03* .03*9. School sports .06 .06 .03 −.00 −.01

10. Outside activities .14 .18 .06 −.01 −.01C. Competing Activities 11. Drugs/gangs −.20 −.24 −.21# −.02 −.00 12. Killing time −.16 −.16 −.14# −.03* −.03* 13. Peer sociability −.09 −.11 −.05 −.02 −.01 14. Employment −.13 −.13 −.08 −.01 −.01 15. Child care −.07 −.08 −.01 −.00 .00 16. Leisure reading .06 .07 −.08 −.06* −.04*D. Family Background 17. SES .35 .43 .11 .03* .02 18. Family intact .13 .15 .08 .02 .01 19. Parent relations .17 .20 .17 .02 .01 20. Parent aspiration .33 .39 .15 −.00 −.01 21. Stress at home −.12 −.13 −.10 −.00 .00E. Student Attitudes 22. Teacher relations .23 .19 .12 .01 .01 23. Educational plans .35 .41 .21 .04* .03* 24. Self esteem .29 .35 .17 .00 .00 25. Locus of control .25 .27 .13 .02 .02 26. Peer studiousness .29 .30 .18 −.01 −.01F. Teacher Ratings 27. Attendance .37 .43 .31 .01 28. Class behavior .51 .55 .41 .04* 29. Consults teacher .15 .18 .10 −.01 30. Educ. motivation .63 .68 .50 .12* 31. Work completed .61 .65 .56 .20*

* β ≥ .03. N=8454 #Variables used to define Scholastic Engagement

Page 198: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 9Partial Correlations Among Behavioral Measures That Suggest Scholastic

Engagement—By Subgroup and School Program*

Partial Correlations with HSAStudent

BehaviorRigor.Acad. Acad.

Acad./Voc. Voc. Male Female

Afr.Amer.

AsianAmer. Hispanic White

Advanced electives .20 .30 .20 −.06 .30 .27 .28 .27 .30 .29

Work completed .32 .30 .30 .21 .28 .27 .28 .32 .26 .30

Attendance .25 .27 .25 .18 .29 .28 .19 .37 .31 .27

Class participation .22 .25 .18 .03 .21 .21 .13 .27 .21 .24

Discipline problems (−) −.20 −.20 −.17 −.13 −.19 −.15 −.20 .01 −.15 −.21

Drugs/gangs (−) −.22 −.19 −.25 −.21 −.16 −.20 −.12 −.18 −.11 −.21

Killing time (−) −.12 −.16 −.13 −.08 −.11 −.12 −.15 −.02 −.10 −.14

School activities .12 .13 .11 −.06 .12 .06 .12 .11 .12 .12

Courses completed −.04 .07 .02 .02 .07 .11 .14 .12 .13 .12*Student behaviors are listed by size of the partial r in the total sample. Test scores and grading variations were controlled in computing partialcorrelations. All coefficients were corrected for range restriction. HSA and test scores were corrected for unreliability. (−) indicates disengagedbehavior.

Page 199: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 10Patterns of Scholastic Engagement by Subgroup and School Program

Mean Standard Score*Student

BehaviorRigor.Acad. Acad.

Acad./Voc. Voc. Male Female

Afr.Amer.

AsianAmer. Hispanic White

Advanced electives 55.7 50.9 44.0 39.4 49.6 50.4 46.7 55.1 46.9 50.3

Work completed 50.8 49.9 50.1 49.1 48.4 51.6 51.5 50.3 50.0 49.8

Attendance 52.2 50.3 49.1 46.8 49.9 50.1 51.2 50.4 47.2 50.2

Class participation 51.2 50.5 49.2 45.5 48.2 51.7 52.1 49.6 49.8 49.9

Discipline problems (−) 30% 36% 44% 56% 51% 26% 45% 31% 41% 37%

Drugs/gangs (−) 48.9 50.1 49.6 52.1 51.6 48.5 46.2 47.3 50.4 50.5

Killing time (−) 49.2 49.9 51.2 50.8 51.0 49.0 50.5 48.6 49.5 50.1

School activities 51.1 50.5 48.5 46.5 48.1 51.8 49.9 51.5 48.5 50.1

Courses completed 54.4 49.8 51.6 45.2 48.8 51.2 46.8 51.2 48.2 50.4

*Standard scales with X =50 and SD =10 were based on the total sample. (−) indicates disengaged behavior. Discipline here described as %reporting any infraction.

Page 200: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 11Regression of Scholastic Engagement on Family

and Attitude Measures—By Gender

Males Females

r βeta Wt. r βeta Wt.

Family Background

17. SES .24 .05 .23 .05

18. Family intact .10 .03 .10 .04

19. Parent relations .28 .09 .29 .09

20. Parent aspiration .31 .06 .28 .02

21. Stress at home −.22 −.10 −.17 −.06

Attitudes

22. Teacher relations .33 .16 .30 .14

23. Educational plans .40 .19 .38 .22

24. Self esteem .35 .07 .35 .05

25. Locus of control .27 .04 .30 .10

26. Peer studiousness .44 .25 .40 .23

Multiple R .60 .56

Page 201: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 12Actual and Predicted High School Average for Four Subgroups and

Four School Programs—By Amount of Predictive Information

Predictive Information*

1. NELS Test 2. Plus gradevariationscontrolled

3. Plus 26studentvariables

4. Plus Teacherjudgments

GroupWomen

Actual HSA 2.59 2.58 2.58 2.59Predicted HSA 2.47 2.47 2.52 2.55

(diff. pred.) (+.12) (+.11) (+.06) (+.03)African-American

Actual HSA 2.01 2.28 2.28 2.28Predicted HSA 2.11 2.29 2.31 2.30

(diff. pred.) (−.10) (−.01) (−.03) (−.02)Asian-American

Actual HSA 2.82 2.76 2.76 2.76Predicted HSA 2.69 2.62 2.70 2.72

(diff. pred.) (+.13) (+.14) (+.06) (+.04)Hispanic

Actual HSA 2.18 2.33 2.33 2.34Predicted HSA 2.23 2.36 2.37 2.37

(diff. pred.) (−.05) (−.03) (−.03) (−.02)School Program

Rigorous AcademicActual HSA 2.87 2.84 2.84 2.84Predicted HSA 2.75 2.71 2.78 2.80

(diff. pred.) (+.12) (+.13) (+.06) (+.04)Academic

Actual HSA 2.53 2.50 2.50 2.51Predicted HSA 2.53 2.51 2.51 2.52

(diff. pred,) (.00) (−.00) (−.01) (−.01)Academic Vocational

Actual HSA 2.22 2.29 2.29 2.30Predicted HSA 2.20 2.29 2.25 2.28

Diff. pred.) (+.03) (−.00) (+.03) (+.02)Vocational

Actual HSA 1.84 1.97 1.97 1.97Predicted HSA 1.98 2.10 1.99 1.96

(diff. pred.) (−.14) (−.13) (−.02) (+.01)*In each column, predictions also take into account the predictors used in previous columns. In columns 2-4,school means were subtracted from all measures. Differences between actual and predicted HSA (diff. pred.)reflect rounding. Total N = 7571 in each of columns 1-3; 6853 in column 4.

Page 202: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 13Actual and Predicted High School Average For Males and Females Within

Ethnic Groups—By Amount of Predictive Information

Predictive Information*

Group

1. NELS Test 2. Plus gradevariationscontrolled

3. Plus 26studentvariables

4. Plus teacherjudgments

White MaleActual HSA 2.43 2.39 2.39 2.40Predicted HSA 2.54 2.51 2.45 2.43 (diff. pred.) (−.12) (−.12) (−.06) (−.03)

White FemaleActual HSA 2.65 2.60 2.60 2.60Predicted HSA 2.52 2.49 2.54 2.57

(diff. pred.) (+.13) (+.11) (+.06) (+.03)African-American Male

Actual HSA 1.85 2.11 2.11 2.11Predicted HSA 2.09 2.27 2.22 2.17 (diff. pred.) (−.23) (−.16) (−.11) (−.06)

African-American FemaleActual HSA 2.14 2.42 2.42 2.42Predicted HSA 2.13 2.31 2.38 2.40

(diff.pred.) (+.02) (+.11) (+.04) (+.02)Asian-American Male

Actual HSA 2.71 2.66 2.66 2.67Predicted HSA 2.67 2.61 2.65 2.65 (diff. pred.) (+.03) (+.05) (+.02) (+.02)

Asian-American FemaleActual HSA 2.92 2.85 2.85 2.84Predicted HSA 2.71 2.64 2.76 2.79

(diff. pred,) (+.21) (+.21) (+.09) (+.05)Hispanic Male

Actual HSA 2.13 2.27 2.27 2.28Predicted HSA 2.27 2.39 2.35 2.33 (diff. pred.) (−.14) (−.13) (−.09) (−.05)

Hispanic FemaleActual HSA 2.23 2.40 2.40 2.41Predicted HSA 2.18 2.33 2.38 2.41 (diff. pred.) (+.04) (+.07) (+.02) (+.01)

*In each column, predictions also take into account the predictors used in previous columns. In columns 2-4,school means were subtracted from all measures. Differences between actual and predicted HSA (diff. pred.)reflect rounding. Total N = 8390 in each of columns 1-3; 7565 in column 4.

Page 203: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 14NELS Test Means and Correlations with High

School Average for Six Groups*

Afr.Amer.

AsianAmer. Hispanic White Male Female

Rigor.Acad. Acad.

Acad./Voc. Voc. Total

Correlation with HSA:

4 tests (Mul. R) .76 .84 .72 .80 .78 .82 .80 .79 .73 .56 .79

NELS composite .76 .78 .72 .79 .77 .81 .79 .78 .70 .52 .78

NELS Reading .69 .62 .66 .69 .64 .71 .70 .68 .61 .50 .68

NELS Math .75 .78 .71 .78 .77 .80 .78 .77 .68 .50 .77

NELS Science .65 .70 .64 .66 .67 .72 .70 .67 .53 .37 .66

NELS Social Studies .63 .69 .62 .67 .65 .71 .69 .67 .54 .41 .66

Means

NELS Reading 43.7 52.0 45.9 51.0 48.9 51.1 54.0 50.8 45.2 42.3 50.0

NELS Math 42.6 54.3 45.0 51.0 50.6 49.4 55.1 50.9 44.7 40.3 50.0

NELS Science 41.8 52.0 44.7 51.3 51.6 48.4 53.9 50.7 45.9 43.0 50.0

NELS Social Studies 44.2 52.8 45.9 50.9 50.9 49.1 54.0 50.9 45.4 42.6 50.0

*Correlations were corrected for range restriction, grading variations, and unreliability of tests and grades. The grade criterion wasHSA+2G. In this table test scores were scaled to a mean of 50 and a standard deviation of 10 based on the total study sample.

Page 204: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 15A Comparison of Within-School and Across-School Regression

of Grade Average on All 37 Variables

Beta WeightsWithin Schools Beta Weights Across Schools

(HSA + K) (HSA) (HSA + 2G)

A. School Skills1. Attendance .07 .06 .062. Class participation .00 −.01 −.003. Discipline problems .00 −.01 −.004. Work completed .09 .09 .095. Homework hours −.04 −.03 −.03

B. Initiative6. Courses completed .04 .02 −.017. Advanced electives .12 .12 .148. School activities .03 .04 .029. School sports −.01 −.01 −.00

10. Outside activities −.01 −.01 −.01C. Competing Activities 11. Drugs/gangs −.00 −.02 −.01 12. Killing time −.03 −.03 −.03 13. Peer sociability −.01 −.01 −.01 14. Employment −.01 −.01 −.01 15. Child care .00 −.00 −.00 16. Leisure reading −.04 −.04 −.04D. Family Background 17. SES .02 .00 .01 18. Family intact .01 .02 .01 19. Parent relations .01 .02 .02 20. Parent aspiration −.01 −.01 .00 21. Stress at home .00 .00 .01E. Attitudes 22. Teacher relations .01 −.01 −.01 23. Educational plans .03 .02 .03 24. Self esteem .00 .00 .01 25. Locus of control .02 .02 .02 26. Peer studiousness −.01 −.02 −.02F. Teacher Ratings 27. Attendance .01 .02 .02 28. Class behavior .04 .03 .04 29. Consults teacher −.01 −.01 −.01 30. Educ. motivation .12 .11 .12 31. Work completed .20 .20 .18G. Grading Factors 32. SGF − −.30 − 33. MCGF − −.02 −H. NELS Test 34. Reading .04 .09 .08 35. Mathematics .51 .43 .48 36. Science −.30 −.16 −.19 37. Social Studies .27 .16 .18

Multiple R .90 .88 .88

Page 205: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 16A Condensed Analysis: Intercorrelations and Beta Weights for

HSA Regressed on Four Major Factors*

NELS-C SGF Engage TRC

School Grading .09

Engagement .41 −.05

Teacher Rating .49 −.08 .52

HSA .71 −.29 .57 .69

Multiple R Beta Weights

.88 .50 −.30 .18 .33

*NELS-C and HSA were corrected for unreliability.

Page 206: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 17Condensed Analyses of Major Factors Related to Differences Between

Grades and Test Scores—By Gender and Ethnic Groups*

Male(4154)

Female(4300)

AfricanAmerican

(591)

AsianAmerican

(524)Hispanic

(804)White(6471)

Correlations

With HSA:NELS Test .70 .74 .68 .72 .62 .71Engagement .55 .57 .52 .61 .51 .59Teacher rating .69 .68 .64 .72 .60 .71School grading −.28 −.31 −.28 −.14 −.28 −.28

With NELS Test:Engagement .41 .45 .35 .41 .34 .44Teacher rating .50 .51 .44 .53 .41 .51School grading .11 .06 .08 .18 .16 .11

With Engagement:Teacher rating .50 .51 .49 .57 .43 .54School grading –.06 –.04 –.04 .04 –.03 –.06

With Teacher rating:School grading –.08 –.09 –.01 –.01 –.03 –.09

Beta weights inpredicting HSA:

NELS Test .49 .53 .50 .50 .48 .49Engagement .16 .17 .16 .24 .20 .17Teacher rating .34 .30 .34 .31 .31 .34School grading −.30 –.31 −.31 −.23 −.34 −.30

Multiple R based on:4 condensed variables .86 .89 .85 .87 .83 .8837 original variables .88 .89 .87 .88 .84 .89

*Corrected for range restriction in all variables and unreliability in HSA and NELS Test. (Total N=8454 except for teacher rating where N=7619; subgroup Ns in parentheses)

Page 207: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 18Condensed Analyses of Major Factors Related to Differences Between

Grades and Test Scores—By School Program*

RigorousAcademic

(1892)Academic

(4650)

AcademicVocational

(627)Vocational

(402)

Correlations:

With HSA:NELS Test .71 .70 .60 .40Engagement .48 .54 .42 .26Teacher rating .64 .67 .62 .58School grading −.30 −.30 −.35 −.31

With NELS Test:Engagement .34 .38 .15 –.03Teacher rating .41 .46 .33 .21School grading .10 .08 .10 .14

With Engagement:Teacher rating .41 .50 .40 .32School grading –.04 –.05 –.13 –.05

With Teacher rating:School grading –.09 –.08 –.14 –.09

Beta weights in predicting HSA:NELS Test .57 .52 .50 .35Engagement .15 .18 .16 .11Teacher rating .31 .31 .34 .44School grading −.32 −.31 −.33 −.32

Multiple R based on:4 condensed variables .88 .87 .83 .7337 original variables .89 .88 .86 .77

*Corrected for range restriction in all variables and unreliability in HSA and NELS Test. (Total N=8454 except for teacher rating where N=7619; subgroup Ns in parentheses)

Page 208: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 19Regression of HSA on all Variables—By Gender and Ethnic Groups*

Beta Weights for:

Males FemalesAfrican

AmericanAsian

American Hispanic WhiteA. School Skills

1. Attendance .07 .05 .01 .13 .09 .052. Class participation −.03 −.00 −.05 −.04 −.01 −.013. Discipline problems .01 −.01 −.02 .01 −.04 −.004. Work completed .09 .09 .15 .12 .12 .085. Homework hours −.04 −.03 −.06 .01 −.05 −.03

B. Initiative6. Courses completed .02 .01 .07 −.02 .08 .017. Advanced electives .13 .10 .08 .14 .14 .118. School activities .06 .01 .08 .04 .07 .039. School sports −.01 .01 −.03 −.02 .02 −.01

10. Outside activities .00 −.02 −.01 −.01 −.05 −.01C. Competing Activities 11. Drugs/gangs −.00 −.03 .02 −.06 .05 −.02 12. Killing time −.02 −.02 −.02 .02 .02 −.03 13. Peer sociability −.01 .00 .03 −.01 .01 −.01 14. Employment −.01 −.01 −.03 .01 −.06 −.01 15. Child care .00 −.01 .01 −.01 −.02 .00 16. Leisure reading −.04 −.04 −.09 −.04 −.06 −.04D. Family Background 17. SES −.01 .01 −.02 −.04 −.02 .00 18. Family intact .02 .01 −.00 .02 −.01 .01 19. Parent relations .02 .01 .00 .00 .02 .02 20. Parent aspiration −.01 −.00 .03 .02 .01 −.01 21. Stress at home −.00 .00 −.01 .03 −.01 .01E. Attitudes 22. Teacher relations −.01 .00 −.02 .02 .04 −.01 23. Educational plans .02 .03 .03 −.00 −.02 .04 24. Self esteem .03 −.01 −.00 .03 −.01 .01 25. Locus of control −.01 .04 −.01 −.00 .02 .02 26. Peer studiousness −.02 −.03 −.04 −.02 −.04 −.02F. Teacher Ratings 27. Attendance .00 .04 −.00 −.00 −.01 .02 28. Class behavior .02 .03 .03 .01 .06 .03 29. Consults teacher −.01 −.01 .01 −.01 −.01 −.01 30. Educ. motivation .10 .12 .08 .10 .09 .13 31. Completes work .23 .15 .27 .24 .19 .19G. Grading Factors 32. SGF −.29 −.30 −.31 −.22 −.35 −.29 33. MCGF .03 −.04 −.01 −.04 .06 −.02H. NELS Test 34. Reading .06 .03 .25 .12 .09 35. Mathematics .40 .43 .37 .25 .43 36. Science −.11 −.08 −.00 −.00 −.19 37. Social Studies .15 .19 −.06

[.50]#

.10 .18Multiple R .88 .89 .87 .88 .84 .89* Across-schools analysis corrected for unreliability, range restriction, and shrinkage.# NELS −C was substituted for this analysis because the matrix was singular with the four tests entered separately.

Page 209: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Table 20Regression of HSA on all Variables—By School Program*

Beta Weights for:RigorousAcademic Academic

Academic-Vocational Vocational Total

A. School Skills1. Attendance .05 .04 .08 .08 .062. Class participation −.03 .00 −.04 −.06 −.013. Discipline problems −.00 −.01 .04 −.00 −.014. Work completed .11 .08 .14 .15 .095. Homework hours −.01 −.03 −.03 −.15 −.03

B. Initiative6. Courses −.03 .00 −.01 .05 .027. Advanced electives .07 .11 .13 −.01 .128. School activities .04 .04 .02 −.04 .049. School sports −.00 −.02 .03 .00 −.01

10. Outside activities −.03 −.00 −.03 .00 −.01C. Competing Activities 11. Drugs/gangs −.03 −.01 −.01 −.05 −.02 12. Killing time −.01 −.05 −.01 −.00 −.03 13. Peer sociability −.03 −.00 −.01 .00 −.01 14. Employment −.04 −.00 −.03 −.01 −.01 15. Child care −.02 −.01 .00 .09 −.00 16. Leisure reading −.06 −.03 −.03 .02 −.04D. Family Background 17. SES −.02 .01 −.08 −.11 .00 18. Family intact −.02 .02 .04 .05 .02 19. Parent relations −.00 .03 .02 .02 .02 20. Parent aspiration .02 −.01 −.04 .00 −.01 21. Stress at home .00 .01 −.05 .03 .00E. Attitudes 22. Teacher relations −.00 −.01 .01 .02 −.01 23. Educational plans .03 .03 −.02 .00 .02 24. Self esteem −.01 .00 .03 .08 .00 25. Locus of control .03 .01 .01 −.09 .02 26. Peer studiousness −.01 −.02 −.01 −.14 −.02F. Teacher Ratings 27. Attendance .03 .02 −.06 −.02 .02 28. Class behavior .06 .02 .11 .09 .03 29. Consults teacher −.01 .00 .00 −.05 −.01 30. Educ. motivation .07 .11 .13 .26 .11 31. Work completed .19 .19 .18 .19 .20G. Grading Factors 32. SGF −.32 −.31 −.32 −.31 −.30 33. MCGF −.03 −.04 −.01 −.01 −.02H. NELS Test 34. Reading .07 .06 .25 .32 .09 35. Mathematics .58 .47 .57 .21 .43 36. Science −.37 −.15 −.40 −.03 −.16 37. Social Studies .33 .15 .15 −.06 .16Multiple R .89 .88 .86 .77 .88* Across-school analysis corrected for unreliability, range restriction, and shrinkage.

Page 210: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Appendices

A. Descriptive StatisticsTables A-1 to A-8

B. Student VariablesAcronyms and Specifications

C. Notes

Page 211: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Appendix Table A-1Student Characteristics for Four School Programs*

MeansRigorousAcademic Academic

Academic-Vocational Vocational

TotalMean

TotalS.D.

A. School Skills1. Attendance 3.92 3.78 3.70 3.52 3.76 .732. Class participation 3.84 3.80 3.72 3.50 3.76 .593. Discipline problems .08 .12 .15 .23 .12 .224. Work completed 3.26 3.20 3.21 3.15 3.20 .655. Homework hours 3.61 3.49 3.15 2.71 3.40 1.67

B. Initiative6. Courses completed 22.93 21.74 22.20 20.54 21.78 2.597. Advanced electives 2.46 1.81 .88 .26 1.69 1.358. School activities 2.74 2.57 2.08 1.58 2.46 2.499. School sports 2.28 2.10 1.52 1.20 1.98 2.60

10. Outside activities .73 .68 .59 .51 .67 .49C. Competing Activities 11. Drugs/gangs .95 1.00 .98 1.08 1.00 .42 12. Killing time 1.97 2.00 2.06 2.04 2.01 .41 13. Peer sociability 2.01 2.03 2.03 2.10 2.03 .52 14. Employment .09 .12 .22 .32 .14 .35 15. Child care .04 .05 .07 .07 .05 .22 16. Leisure reading 2.13 2.20 2.06 1.99 2.14 1.76D. Family Background 17. SES .34 .22 −.24 −.42 .14 .78 18. Family intact .87 .84 .82 .79 .83 .37 19. Parent relations 2.06 2.03 2.00 1.93 2.02 .41 20. Parent aspiration 7.94 7.42 6.21 5.35 7.21 2.12 21. Stress at home 1.34 1.39 1.36 1.44 1.39 .49E. Attitudes 22. Teacher relations 3.26 3.22 3.14 3.05 3.20 .49 23. Educational plans 2.63 2.53 2.38 2.23 2.51 .35 24. Self esteem 3.82 3.71 3.55 3.43 3.69 .52 25. Locus of control 3.13 3.06 2.96 2.91 3.05 .48 26. Peer studiousness 2.56 2.49 2.31 2.18 2.45 .43Scholastic Engagement 54.41 50.56 47.72 42.16 50.00 10.00

*Original score scales as described in text and Appendix B.

Page 212: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Appendix Table A-2Student Characteristics for Gender and Ethnic Groups*

Males FemalesAfrican

AmericanAsian

American Hispanic White

A. School Skills1. Attendance 3.75 3.77 3.85 3.79 3.56 3.782. Class participation 3.66 3.87 3.89 3.74 3.75 3.763. Discipline problems .18 .07 .16 .08 .13 .124. Work completed 3.10 3.31 3.30 3.23 3.21 3.195. Homework hours 3.30 3.50 3.12 3.79 3.29 3.41

B. Initiative6. Courses completed 21.47 22.08 20.95 22.10 21.32 21.897. Advanced electives 1.64 1.73 1.24 2.38 1.27 1.738. School activities 1.99 2.91 2.44 2.82 2.08 2.489. School sports 2.62 1.36 1.98 1.72 1.85 2.02

10. Outside activities .72 .61 .66 .68 .61 .67C. Competing Activities 11. Drugs/gangs 1.06 .94 .84 .88 1.02 1.02 12. Killing time 2.05 1.97 2.03 1.95 1.99 2.01 13. Peer sociability 2.08 1.98 1.95 1.94 2.01 2.04 14. Employment .16 .12 .10 .08 .14 .15 15. Child care .02 .08 .12 .06 .08 .04 16. Leisure reading 2.06 2.22 1.86 2.14 1.99 2.19D. Family Background 17. SES .16 .11 −.32 .33 −.39 .23 18. Family intact .83 .83 .62 .89 .81 .85 19. Parent relations 1.99 2.05 1.95 1.98 2.05 2.02 20. Parent aspiration 7.11 7.30 7.04 7.78 7.25 7.17 21. Stress at home 1.37 1.41 1.52 1.34 1.45 1.38E. Attitudes 22. Teacher relations 3.17 3.23 3.15 3.19 3.19 3.21 23. Educational plans 2.48 2.53 2.58 2.66 2.53 2.49 24. Self esteem 3.69 3.68 3.70 3.65 3.64 3.69 25. Locus of control 3.02 3.07 2.97 2.99 3.01 3.06 26. Peer studiousness 2.38 2.52 2.42 2.56 2.39 2.46Scholastic Engagement 47.65 52.27 49.69 53.03 47.67 50.09

*Original score scales as described in text and Appendix B.

Page 213: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Appendix Table A-3Grade Averages and Grade Corrections

for Four School Programs*

MeansRigorousAcademic Academic

AcademicVocational Vocational

TotalMean

TotalS.D.

Subject Grade AverageEnglish 2.93 2.62 2.26 1.88 2.56 .81Mathematics 2.69 2.38 2.12 1.77 2.34 .85Science 2.81 2.47 2.21 1.83 2.43 .84Social Studies 3.03 2.66 2.30 1.87 2.60 .85

School Grade AverageHSA 2.87 2.53 2.22 1.84 2.48 .76HSA (T) 2.99 2.70 2.48 2.17 2.66 .70HSA + K 2.85 2.50 2.17 1.77 2.45 .78HSA + 2G 2.74 2.36 1.97 1.59 2.30 .76

Grading CorrectionsK −.01 −.03 −.05 −.07 −.03 .05SGF .02 −.00 −.03 .00 .00 .26MCGF −.14 −.17 −.22 −.26 −.18 .08

*All entries are based on a 4.0 grade scale as described in the text.

Page 214: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Appendix Table A-4Grade Averages and Grade Corrections

for Gender and Ethnic Groups*

Males FemalesAfrican

AmericanAsian

American Hispanic White

Subject Grade AverageEnglish 2.37 2.74 2.07 2.89 2.23 2.62Mathematics 2.27 2.40 1.87 2.66 2.04 2.40Science 2.36 2.50 1.97 2.78 2.14 2.48Social Studies 2.50 2.70 2.14 2.93 2.30 2.66

School Grade AverageHSA 2.37 2.59 2.01 2.82 2.18 2.54HSA (T) 2.54 2.79 2.19 2.95 2.38 2.72HSA + K 2.34 2.55 1.97 2.79 2.13 2.51HSA + 2G 2.21 2.39 1.90 2.68 2.00 2.35

Grading Corrections K −.03 −.04 −.04 −.02 −.05 −.03 SGF .01 −.01 .07 .01 .01 −.01 MCGF −.17 −.19 −.18 −.15 −.19 −.18

*All entries are based on a 4.0 grade scale as described in the text.

Page 215: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Appendix Table A-5NELS Test Scores and Teacher Ratings

for Four School Programs*

MeansRigor.Acad. Acad.

Acad.-Voc. Voc.

TotalMean

TotalS.D.

NELS TestsReading 56.2 53.3 47.9 45.2 52.5 9.48Mathematics 58.0 54.0 48.0 43.9 53.1 9.52Science 56.5 53.4 48.7 45.9 52.7 9.66Social Studies 56.3 53.4 48.2 45.6 52.6 9.47

NELS-T (total) 56.8 53.5 48.2 45.1 52.7 8.52NELS-C(weighted comp.) 3.30 3.09 2.75 2.53 3.03 .51

Teacher RatingsAttendance 4.80 4.68 4.63 4.48 4.68 .47Class behavior 4.90 4.74 4.61 4.37 4.71 .61Consults teacher .40 .39 .34 .32 .38 .40Educational motivation .92 .82 .70 .54 .80 .27Work completed 4.40 4.15 3.97 3.69 4.12 .78TRC (weighted comp.) 2.88 2.70 2.54 2.29 2.67 .47

*Original score scales as described in text and Appendix B.

Page 216: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Appendix Table A-6NELS Test Scores and Teacher Ratings for

Gender and Ethnic Groups*

Male FemaleAfr.

Amer.AsianAmer. Hispanic White

NELS TestsReading 51.4 53.5 46.5 54.3 48.6 53.4Mathematics 53.7 52.6 46.1 57.2 48.3 54.1Science 54.2 51.2 44.7 54.6 47.6 53.9Social Studies 53.5 51.7 47.1 55.3 48.7 53.4

NELS-T (total) 53.2 52.2 46.1 55.4 48.3 53.7NELS-C (weighted comp) 3.05 3.03 2.66 3.24 2.78 3.09

Teacher RatingsAttendance 4.69 4.66 4.63 4.81 4.54 4.69Class behavior 4.59 4.83 4.59 4.95 4.65 4.72Consults teacher .36 .40 .34 .35 .33 .39Educational motivation .76 .83 .71 .89 .73 .80Work completed 3.96 4.28 3.92 4.38 3.98 4.14TRC (weighted comp.) 2.59 2.76 2.53 2.85 2.56 2.69

*Original score scales as described in text and Appendix B.

Page 217: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Appendix Table A-7Means and Standard Deviations for Five Composite Measures—

By Gender and Ethnic Groups*

Male(4154)

Female(4300)

AfricanAmerican

(591)

AsianAmerican

(524)Hispanic

(804)White(6471)

Mean for:HSA 48.6 51.4 43.8 54.4 46.0 50.8NELS Test 50.2 49.8 42.7 54.1 45.0 51.0Engagement 47.6 52.3 49.7 53.0 47.7 50.1Teacher rating 48.1 51.8 47.0 53.8 47.6 50.3School grading 50.3 49.7 52.7 50.5 50.5 49.7

S.D. for:HSA 10.1 9.7 8.8 9.5 8.8 9.9NELS Test 10.3 9.7 9.2 9.9 9.3 9.6Engagement 10.4 9.0 8.6 10.2 9.4 10.1Teacher rating 10.6 9.0 10.7 8.3 10.8 9.8

School grading 10.0 10.0 9.7 8.8 9.7 10.1

*Each measure is scaled to a mean of 50 and standard deviation of 10 for all students in the full sample.(Total N=8454 except for teacher rating where N=7619; subgroup Ns in parentheses)

Page 218: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

Appendix Table A-8 Means and Standard Deviations for Five Composite Measures—

By School Program*

RigorousAcademic

(1892)Academic

(4650)

AcademicVocational

(627)Vocational

(402)

Mean for:HSA 55.1 50.7 46.6 41.5NELS Test 55.2 51.0 44.4 40.2Engagement 54.4 50.6 47.7 42.2Teacher rating 54.4 50.6 47.1 41.9School grading 50.7 49.9 48.7 50.0

S.D. for:HSA 8.5 9.6 8.6 7.1NELS Test 7.5 9.7 9.0 7.6Engagement 8.6 9.5 9.1 8.7Teacher rating 6.8 9.5 10.3 10.7School grading 10.0 9.9 10.6 9.9

*Each measure is scaled to a mean of 50 and standard deviation of 10 for all students in the fullsample (Total N=8454 except for teacher rating where N=7619; subgroup Ns in parentheses)

Page 219: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

B-1Appendix B: Student Variables

Acronyms for variables and corrections:

CGF – Course Grading Factor, computed across schools

CGR – Course Grading Residual, computed within schools

CU – Carnegie Unit

HSA – High School Average based on four “new basic” subject areas, computedwithin schools or across schools as indicated

HSAw – HSA specified as a within-school deviation score (school mean removed)

HSA+K – Within-school HSA, plus a correction for course grading variations

HSA+2G – Across-school HSA, plus corrections for school and course grading variations

HSA(T) – High School Average based on all courses on the transcript, save physical education and service courses like driver training

K – Mean Course Grading Residual (computed within schools) for the courses takenby each student

MCGF – Mean Course Grading Factor (computed across schools) for the courses takenby each student

NELS-C –NELS Test Composite; 4 tests, best weighted predictors of HSA

NELS-T – NELS Test total score; 4 tests unweighted

SGF – School Grading Factor, computed across schools

SE – Scholastic Engagement; composite of 9 behavioral measures

SES – Socioeconomic Status

TR – Teacher Rating

TRC – Teacher Rating Composite; 5 ratings, best weighted predictors of HSA

Page 220: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

B-2

Specifications for Student Characteristics and Teacher Ratings*

School Skills

F2RAB90 (transcript): Number of days absent 90-91, grouped as in codebookfrequency distributions, but changed from 1-7 to 0-6 range (Source 2)

F2RAB91 (transcript) Same as above, for 91-92 (Source 2)

F2S9A Late for school: code 0=none, code 5=15+

F2S9B Cut/skip classes: code 0=none, code 5=15+

F2S11A Last unexcused absence: recode so that: code 1(never)=0;code 2 (this term)=3; code 3 (first term this year)=2;code 4 (last year) and code 5 (2 or more years ago)=1other or omit=0

1. Attendance

(average of 6 vars.)reflected so0=poor attendance5=good attendance

(could have avg=6;reflected avg= -1 ifonly RAB vars arepresent and highestcategory)

F2S11B Days missed, last unexcused absence: code 0=1-2; code 5=21+

F2S15bc Copy science notes 1=never/very rarely; 5=every day

F2S19bc Copy math notes 1=never/very rarely; 5=every day

F2S17a Pay attention, science class, 1=never; 5=always

F2S21a Pay attention, math class 1=never; 5=always

F2S17d Participate actively, science class 1=never; 5=always

F2S21d Participate actively, science class 1=never; 5=always

F2S24A Come to class without pencil/paper 1=usually; 4=never

2. Class participation

(average of 8 vars.)

1=low participation5=high particip.

F2S24B Come to class without books 1=usually; 4=never

F2S.8f Fight at school: 0=never; 1=once or twice; 2= more than twice

F2S.8g Fight to or from school ”

F2S.9d Trouble re school rules: 0=never; 1= 1-2 times; recoded 2-6 becomes2=3+ or multiple response

F2S9e In-school suspension ”

F2S9f Suspended from school ”

3. Discipline problems

Average of 6 vars.

0=no problem2=more than 2problems

F2S9h Was arrested ”

* The six NELS data sources used in this study are cited on page B8 Most variables come from #1, The SecondFollowup Student Component. Other sources are indicated on the list of variables where appropriate.

Page 221: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

B-3

F2S17b Complete science work on time 1=never; 5=always

F2S21b Complete math work on time 1=never; 5=always

F2S17c Do more science work than required 1=never; 5=always

F2S21c Do more math work than required 1=never; 5=always

4. Work completed

(average of 5 vars.)

1=never complete/more than req.5=always complete/ more than req. F2S24c Come to class without homework done 1=usually; 4=never

F2S25F1 Total homework time per week in school: codes 0-85. Homework hours

(average of 2 vars.,range 0-8)

F2S25F2 Total homework time per week out of school: codes 0-8

Initiative

6. Courses completed Total graded “course values”: credits for all courses graded 1-13 (credits imputedfrom term duration for failed courses); service courses excluded (Source 3)

F2S13E Ever in AP program 1=yes

F2S18a Taking science this term 2=yes, but not required

F2S22a Taking math this term 2=yes, but not required

F2RENG_C Units in English (NAEP): 5 units or more

7. Advanced electives

sum of 1 point eachfor specified codes

F2RFOR_C Units in Foreign Language (NAEP): 3 units or more

F2S29a Class officer 1=yes (2 points)

F2S29c Award in math/science 1=yes (2 points)

F2S29f Recognition of writing 1=yes (2 points)

F2S30ac School spirit group 5=captain (2 points); code 4 (varsity) 1 point

F2S30ba Music group 4=officer/leader (2 points); code 3 (participated) 1 point

F2S30bb Drama group 4=officer/leader (2 points); code 3 (participated) 1 point

F2S30bc Student govt 4=officer/leader (2 points); code 3 (participated) 1 point

F2S30be Publication 4=officer/leader (2 points); code 3 (participated) 1 point

F2S30bf Service club 4=officer/leader (2 points); code 3 (participated) 1 point

F2S30bg Academic club 4=officer/leader (2 points); code 3 (participated) 1 point

F2S30bh Hobby club 4=officer/leader (2 points); code 3 (participated) 1 point

8. School activities

2 points for each ofthe responses in var. 7

plus 1 point forparticipation alone asnoted

maximum=12

(=12 if more than 12points indicated)

F2S30bi Professional club 4=officer/leader (2 pts); code 3 (participated) 1 point

Page 222: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

B-4

F2S29g Most valuable player on a team: code 1=yes: 2 points

F2S30aa Team sport: code 3 or 4 (jv, varsity) 2 points; code 5 (captain) 3 points

F2S30ab Individual sport ”

F2S30bj Intramural team code 3 (participated)=1 point; code 4 (leader) 2 points

9. School sports

sum of points asnoted

F2S30bk Intramural individual sport ”

F2S33b How often work on hobbies (1=never...4=every day)

F2S33c How often attend religious activities ”

F2S33d How often attend youth groups ”

F2S33e How often community service ”

F2S33L How often play sports ”

10. Outside activities

average of 6 variablescoded 1-4

then the compositerescaled to 0-3

F2S29h Received a community service award code 1(yes) recoded to 4; code2(no), 6 (multiple) and 8(missing) recoded to 1

Competing Activities

F2S81b Alcohol in 12 months: 0=none; 3=20+ occasions Legitimate skip (code 9) = never in lifetime--recoded to 0

F2S83b Marijuana in last 12 months: ” (legit skip as above)

F2S70 Number of friends in gangs: 1=none, 2=some, 3=all

11. Drugs/gangs

average of 4 vars:

0=never/none3=frequent/all

F2S71 I belong to a gang: recoded so 1(yes) becomes 3; 2(no) becomes 1

F2S33f Riding around: recode 1&2→1 (< 1/week); 3→2 (1-2/week); 4→ 3 (every day)

F2S33g Talking, doing things with friends ”

F2S34a Video/computer games (weekdays): recode 0→1 (none); 1,2→2 (< 2 hours); 3-5→3 (2+ hours/day)

12. Killing time

average of 4 vars.

1=lowest amount oftime3=highest amount oftime F2S35a TV (weekdays):

recode 0,1→1 (<1 hour/day); 2,3→2 (1-3 hours/day); 4-5→3 (3+ hours/day)

F2S68e Friends: how important to be popular

F2S68g Friends: how important to have a steady

13. Peer sociability

average of 3 vars.1= not important3=very important F2S68m Friends: how important to go to parties

Page 223: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

B-5

14. Employment

0=0 this school year9= >40 hours/weekRecoded to:0= codes 0-4 ≤20 hr1= codes 5+ >20 hr

F2S88 Usual hours/week during this school year 0=none, 9= >40recodes:if F2S88 Missing or legitimate skip and F2S86a (ever worked)=no → hours=0;

if F2S86a (ever worked) = 3 (not currently employed) and F2S86bmo/byr (lastmonth and year worked) is NOT Sept-Dec/91 or any 1992 → hours=0

15. Child care

0=none; 6=10+ hr/dayRecoded to:0=codes 0-2:<3hr/day1=codes 3+: ≥3hr/day

F2S94 Hours per day: 1=<1 hour; 6=10+ hours impute 0 for legit skip (Q93=2: do not babysit) impute 1 for Q93=1, yes babysit, but hours per day not answered

16. Leisure reading

impute 0 if missing

F2S32 Reading hours not school related: 0=none, 7=10+ hours/week

Family Background

17. SES F2SES1 Socioeconomic status composite (developed by NELS))

18. Family intact

(2-parent family)

=1 if living in household with father and mother, father and stepmother, ormother and stepfather; =0 for all other combinations.Data is from first followup; if missing, base year data was used (Source 4&5)

F2S40h Important to live near parents 1=not important, 3=very important

F2S40m Important to get away from parents; reverse coding so 1=very, 3=not

F2S99a Discussed school courses with parent (1=never, 2=sometimes, 3=often)

F2S99b Discussed school activities with parent ”

F2S99c Discussed things studied in class with parent ”

F2S100f Home life will be similar: 1=false...6=true: recoded by dividing by 2

19. Parent relations

average of 7 vars.

-1=worst relationship3=best relationship

F2S101 Ran away from home: recode 1 (yes)→ -1; 2(no)→2

Page 224: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

B-6

F2S41a Father’s desire after h.s. recode: 01 does not apply becomes blank 02 college becomes 8.5 03-07 job, trade, mil,marry,what I want become 4 08,09 don’t care; don’t know become 0

F2S41b Mother’s desire after h.s. recode same as above

F2S42a How far in school father wants recode: 0 does not apply becomes blank 01-07 <hs to 2+ years college use as coded 08-10 college, MA, PhD, etc. use as coded 11 don’t know becomes 0

20. Parent aspiration

average of 4 variables,recoded as specified

F2S42b How far in school mother wants

F2S96b Parents divorced/separated recode 1(yes) becomes 5; 2(no) becomes 1

F2S96d Parent lost job ”

F2S96g I became ill ”

F2S96h Parent died ”

F2S96k Sibling dropped out of school ”

F2S96n Family member became ill ”

F2S96o Family member used drugs ”

F2S96p Family member in drug/alcohol rehab ”

F2S96q Family member was crime victim ”

21. Stress at home

Average of 10variables(if any part of q.96 isanswered yes andothers are blank,impute “no” forblanks)

1=low stress (good)

5-6 = high stress(bad)

F2S100e Parents get along well: original scale 1=false ... 6=true Reverse scale so 1=true (low stress) ... 6=false (high stress)

Student Attitudes

F2S7c Teaching is good at school 1=strongly agree, 4=strongly disagree

F2S7d Teachers are interested in students “

22. Teacher relations

Average of 3 vars.Reflected so 1=bad,4=good F2S26A Teacher helped with homework: 1=yes; 2=no; (before reflection,”no”

recoded to 3, so after averaging with other variables, then reflecting, yes=4 andno=2 as before)

Page 225: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

B-7

F2S40n Important to become expert: 1=not important 3=very important

F2S40o Important to get good education: ”

F2S43 How far in school think you will get: recode to 1=<hs., hs only; 2=voc.Through 2+ years college; 3=college degree, 4=masters/PhD/professional

F2S49 and 50a combined: 3= go to school right after h.s. 2= no or don’t know or omit school plans; but “don’t like school”=no 1=no or don’t know school plans; “don’t like school” omitted also “legit skip” group, who are entering military 0=no or don’t know of omit school plans; “don’t like school”=yes

F2S59L How important is reputation of college: 1=not; 3=very

23. Educational plans

Average of 6 vars

0=lowest aspirations4=highest aspirations

F2S64B Occupational plan--age 30: 4=code 10 3=codes 6,9,11,13,14,16 2=codes 1,7 1=codes 2,3,4,5,8,12,15 (other codes: omit)

F2S66a Feel good about myself: reflected so 1=str.disagree; 4=strongly agree

F2S66j Think I am no good at all: original code ok: 1=str. agree; 4=str.disagree

F2S67 Chances I will go to college: 1=very low...5=very high

F2S67e Chances I will have a job I enjoy: 1=very low...5=very high

F2S67i Chances I’ll be respected in community: 1=very low...5=very high

24. Self esteem:

average of 6 vars.

1=low self esteem4 or 5=high

F2S100d Think I will be a source of pride to parents: recoded so that1=false; mostly false; more false; 2=more true; 3=mostly true; 4=true

F2S66b Don’t have control over life: original code ok: 1=str. agree;4=str.disagree

F2S66c Good luck more important than work ”

F2S66f Try to get ahead; something stops me ”

F2S66g Plans hardly every work out ”

25. Locus of control

average of 5 vars.

1=external locus4=internal locus

F2S66k Can make my plans work: reflected so 1=str.disagree; 4=stronglyagree

Page 226: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

B-8

F2S68a Friends attend classes regularly: 1=not important, 3=very important

F2S68b Friends: how important to study ”

F2S68d Friends: how important to get good grades ”

F2S68h Friends: how important to continue education past h.s. ”

F2S69a Friends dropped out of HS: recoded to: 1=some/most/all; 2=few; 3=none

26. Peer studiousness

average of 6 vars.

1=education notimportant3=very important

F2S69c Friends plan to work after HS recoded to: 1=most/all; 2=some;3=few/none

Teacher Ratings

F1Tx_16 How often student is absent (original coding: 1=never...5=all the time)recoded to never=6; all the time=2

27. Attendance

(average of 2 vars.)then reflected so1=poor attendance6=good attendance

F1Tx_17 How often student is tardy (original coding: 1=never...5=all the time)recoded to never=5; all the time=1

F1Tx_18 How often student is attentive in class (original coding: 1=never...5=allthe time) recoded to never=2; all the time=6

28. Class behavior

(average of 2 vars.)1=poor behavior6=good behavior

F1Tx_20 How often student is disruptive in class (original coding:1=never...5=all the time) recoded to never=5; all the time=1

29. Consults teacher

0=no, 1=yes

F1Tx_5 Student talks with Respondent (teacher) outside of class (originalcoding: 1=yes, 2=no) recoded to 1=yes, 0=no

F1Tx_2 Student usually works hard (original coding: 1=yes, 2=no) recoded to1=yes, 0=no

F1Tx_4 Student will probably go to college (original coding: 1=yes, 2=no)recoded to 1=yes, 0=no

30. Educationalmotivation

(avg. of 3 ratings)0=low motivation1=high motivation

F1Tx_22 Student is at risk of dropping out of H.S. (original coding: 1=yes,2=no) receded to 1=no, 0=yes

31. Work completed F1Tx_15 How often student does homework (1=never...5=all of the time

Page 227: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

B-9

Data Sources1. Second Followup Student Component (Ingels, et al., 1994)

2. Transcript Component Student File (Ingels, et al., 1995)

3. Transcript Component Course File

4. First Followup Student Component (Ingels, et al., 1992b)

5. Base Year Student Component (Ingels, et al., 1990)

6. First Followup Teacher Component (Ingels, et al., 1992a)

Page 228: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

C-1

Appendix C

Notes

1. In the case of tests external to the classroom, the data reported here suggest that it

is probably never safe to view the test as a surrogate of teachers’ grades. When, if ever, are

grades or scores a surrogate of the other? Technically, one might say that a course grade is a

less precise surrogate of a score on a course examination when the final grade is based solely

on the examination. Even here, there is room for doubt. Does a grade of B fairly represent an

examination score slightly below the cut for an A grade versus a B grade? Teacher-student

disputes on this point are time-honored. The important question is how the validity and

fairness of assessment are affected by factors that cause observed grades and test scores to

differ.

2. These two terms—content relevance and fidelity—evoke the historical watchwords

for quality assessment, validity and reliability. Content relevance and fidelity are used here as

broad umbrellas to cover these major public concerns in high-stakes assessment: Are we

measuring the right knowledge and skills? Are we doing so without egregious errors?

Technically, the two terms overlap and they include several discernible features that are

critical to valid assessment. In Messick’s (1995) framework of six aspects of construct

validity, content relevance subsumes not only content features, but also substantive rationales

and evidence bearing on convergent, discriminant, and criterion relatedness. Fidelity bears

particularly on the appropriateness of the scoring structure, the generalizability of score

meaning, and the consequences of score use on individuals and groups.

Page 229: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

C-2

3. Using high school grade average to predict college grade average, Willingham

(1961, 1962b, 1963d) showed several analytical relationships among correlations before and

after the grade scale had been adjusted in ways similar to what are referred to here as the

within-school and residual methods. He used a residual method (from the pooled within-

school regression line, not the overall regression line used here) to develop an “Adjustment I”

for the grade scales of individual high schools that might be used in college admissions.

Since this method treats all between-group residual differences as systematic error, such an

adjustment results in a between-group correlation of 1.00 (as illustrated here in the bottom

panel of Figure 2), and an inflated overall correlation r1 due to overfitting. Willingham

(1962b) developed a coefficient r* as an estimate of the overall correlation that would result if

all school grade scales were comparable; i.e., assuming chance school residual effects not

systematically affected by school differences. Thus, r* was derived on the assumption that

the between-school estimates of variance and covariance conform to expectations based on

within-school statistics. It was then shown that r* was algebraically equivalent to the pooled

within-school correlation corrected for restriction in range. Finally, r* was shown to be

algebraically equivalent to r1 corrected for shrinkage due to overfitting.

4. In order to avoid some occasional inconsistency in NELS records as to where a

course was actually taken, computation of CGRs was based only on those students who did

not change schools.

5. Due to the very large and very spotty course by school data matrix, many course-

school cells had no data or insufficient data on which to base a course grade residual. Several

exploratory analyses were undertaken in search of an effective way to stabilize estimates of

course grade residuals within schools. These included trying different limits for the size of

Page 230: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

C-3

course-school datasets, different ways of collapsing schools and courses, and combinations

thereof. Effectiveness of these methods was judged by whether they resulted in course grade

corrections that improved the relationship between grades and test scores. None of the

alternatives proved to be any better than course grade residuals based upon all students

enrolled in the course irrespective of school attended. As it turned out, this was an important

shortcoming in the data base. As Table 5 indicates, differences in the pattern of course

grading from school to school can be a major source of grading variations. As a result,

correction for variations in course grading (without benefit of that interaction) played a minor

role in the analyses reported here.

6. To the extent that students in a school may have taken more or fewer strictly-

graded courses than have students in another school, some part of MCGF is already present in

SGF. Thus the additive procedure represented in HSA+2G could be an overcorrection. That

MCGF and SGF are correlated only .18 suggests that such an effect would likely be minor.

7. The reliability of KHSAw + was computed in several stages. Using the fact that

HSA is the mean of four subject averages, split-half reliabilities were computed for each of

these averages, using odd-even splits and the Spearman-Brown formula.

Next, these reliabilities were combined to estimate the reliability of a composite by applying

the general formula

)var(

)](1)[var(1)(

2

Y

XrelXwYrel iii∑ −

−= ,

where the composite, Y, is defined as ∑= ii XwY . In our case, Y is HSA , the iX are the

four subject averages, and the iw are each ¼.

Page 231: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

C-4

To obtain the reliability of KHSAw + from the reliability of HSA , it is useful to observe that

the former is a composite of HSA, the mean HSA for the student’s school, and K. If the

school mean HSA and K are considered to be perfectly reliable, then a special case of the

above formula for the reliability of a composite may be used to give

)var()](1)[var(

1)(KHSAHSArelHSA

KHSArelw

w +−

−=+ .

Assuming perfect reliability for the school mean HSA and K will result in a liberal estimate

for the reliability of KHSAw + . However, when this estimate is used to correct correlations

for attenuation, the result will be conservative with respect to any corrected multiple

correlations with KHSAw + .

Finally, to correct this reliability for the restriction of range imposed by the within-schools

analysis, a version of the formula given by Ramist et al. (1994, p. 10) was used:

−−+=

2.

2.2

.2

. 1

)()1()(

xy

xyXYXY R

RyrelRRYrel .

In this expression, lower case variables represent the restricted population, while those in

capitals represent the unrestricted population. Thus rel(y) is the within-schools reliability for

KHSAw + obtained in the previous step, while rel(Y) is this reliability corrected for

restriction of range. The squared multiple correlation 2.xyR represents the proportion of

within-schools variance of KHSAw + accounted for by the NELS Composite and SES, while

2.XYR is the corresponding squared multiple after correcting for restriction of range.

8. Three different models were fit to the residual course grades: a model including

only a school effect, one with school and course main effects, and a model with both main

effects and school by course interaction. The fitting used weighted least squares, with each

Page 232: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

C-5

residual course grade being weighted by the credits associated with that grade. Each of the

fitted models had an associated weighted sum of squares accounted for by that model.

Since each of these model fits involved the estimation of large numbers of parameters,

it is reasonable to expect substantial over-fitting in this sample, relative to what might be

expected in the population. To correct for this over-fitting, the expected value for each of the

sums of squares was obtained and used to derive an unbiased estimate for the corresponding

population variation accounted for by that model. These estimates simply involved

subtracting from the original sum of squares a term consisting of the within-cells mean square

times the degrees of freedom associated with the model.

The variation associated with schools was estimated by the corrected sum of squares

for the first model. The variation for courses, controlling for schools, was estimated by the

difference between the corrected sums of squares for the first two models. The variation

associated with the interaction was estimated by the difference between the corrected sums of

squares for the second and third models. Finally, the variation between cells was estimated by

the corrected sum of squares for the third model.

To describe this process in more detail, some notation is helpful. Suppose there are

n ij residual grades for course j in school i. Also suppose there are a total of N s schools and

N c courses, with a total of N sc school-course combinations with at least one residual grade.

Denote the kth residual grade for course j in school i by yijk , and denote its expectation by

µ ij . In the following development, it is assumed that the variance of this residual grade can

be written as σ 2 / wijk , where the w ijk > 0 are known (and are proportional to the credits

associated with the grade), and that the residual grades are mutually independent. (These

assumptions would certainly not be appropriate for the original grades obtained by any

Page 233: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

C-6

particular student, but may be more nearly appropriate for the residuals from predicted HSA

based on the four NELS tests.)

The analysis begins by obtaining the weighted sum of squares within cells, which is

given by

SS w y yW ijk ijk i jk

n

j

N

i

N ijcs

= − ⋅===

∑∑∑ ( )2

111

,

where y w y wij ijk ijk ijk

nij

⋅ ⋅=

= ∑ /1

, with w wij ijkk

n ij

⋅=

= ∑1

, if w ij⋅ > 0 . (Otherwise, let yij⋅ = 0 .)

The degrees of freedom within cells is

df n NW s c= −⋅⋅ ,

where n n ijj

N

i

N cs

⋅⋅==

= ∑∑11

.

The mean square within cells is

M S S S d fW W W= / .

Under the assumptions described above, the expected value of M S W is given by

Now consider the first of the three models described above, namely the one including

only school effects. It may be written as

Model 1: µ µ αij i= + .

The sum of squares associated with this model in the sample is given by

SS w y yS i ii

N s

= −⋅⋅ ⋅⋅ ⋅⋅⋅=∑ ( )2

1

,

( ) 2.WE MS σ=

Page 234: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

C-7

where w wi ijj

N c

⋅⋅ ⋅=

= ∑1

, y w y wi ij ij ij

N c

⋅⋅ ⋅ ⋅ ⋅⋅=

= ∑ /1

,

and y w y wij ijj

N

i

N cs

⋅⋅⋅ ⋅ ⋅ ⋅⋅⋅==

= ∑∑ /11

, with w wijj

N

i

N cs

⋅⋅⋅ ⋅==

= ∑∑11

.

The corresponding degrees of freedom are

d f NS s= − 1,

and the “corrected” sum of squares for schools (an unbiased estimate of the school effect in

the population) is given by

CSS S S d f M SS S S W= − ⋅ .

Turning to the second of the three models, the additive model consisting of school and

course main effects, but no interactions, it may be written as

Model 2: µ µ α βij i j= + + .

Fitting this model yields a sum of squares in the sample that may be written as

S S wS C ij i jj

N

i

N cs

+ ⋅==

= +∑∑ ( $ $ )α β 2

11

,

where $ $α βi j and denote the weighted least squares estimates of the corresponding model

parameters (and the corresponding weighted sums of the parameters are constrained to equal

zero).

The degrees of freedom for this model are

df N NS C s c+ = + − 2 ,

and the corrected sum of squares is

CSS S S d f M SS C S C S C W+ + += − ⋅ .

Page 235: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

C-8

Now it is also possible to obtain an expression for the corrected sum of squares associated

with courses, controlling for schools:

CSS C S S C S SC S S C S| = −+ .

The third model (main effects plus interaction) is equivalent to the unconstrained

between cells model, and has a sum of squares given by

S S w y yB ij ijj

N

i

N cs

= −⋅ ⋅ ⋅⋅⋅==

∑∑ ( )2

11

,

with degrees of freedom

df NB sc= − 1 .

As with the earlier models, its corrected sum of squares is equal to

CSS S S d f M SB B B W= − ⋅ .

The corrected sum of squares for interaction may now be found by subtraction:

CSS C S S C S SS C B S C× += − .

Because of the way they were obtained, it is possible to write the corrected sum of squares

between cells as the sum of three components:

CSS C S S CSS C S SB S C S S C= + + ×| .

Consequently, it is appropriate to express the components as proportions of the total, as was

done in Table 5.

9. Humphreys (1968) first pointed out the simplex nature of a correlation matrix

based on semester grade averages. The highest correlations appeared on the diagonal.

Correlations were successively smaller on diagonals approaching the upper right corner where

the two grade averages in question were maximally separated in time. Willingham (1962a,

Table 3A) reported a similar matrix of correlations in simplex form based on 12 successive

Page 236: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

C-9

quarterly grade averages. In his data the average correlations were .50, .42, .35, and .30

where the grade averages were separated by one to two terms, three to five terms, six to eight

terms, and nine to eleven terms, respectively. A variety of explanations—cognitive,

curricular, grading, motivational—have been suggested for the pattern of progressive changes

in grade-test correlations through the undergraduate years (Hulin, Henry, & Noon, 1990;

Humphreys, 1968; Humphreys & Taber, 1973; Willingham, 1985).

10. Correcting for the unreliability of the NELS Test and HSA increased the multiple

R between the two by .047. Since the reliability of the grade average was somewhat the

higher of the two (.97 vs. .95), it is not unreasonable to attribute slightly less than half of the

correction for attenuation to HSA.

11. D was here defined as the subgroup mean difference divided by the square root of

the unweighted average subgroup variance.

12. The results of differential prediction in Table 12 are moderated for African-

American and Hispanic students compared to results sometimes observed in college level

analyses where such data have received most attention (i.e., less tendency here for these

groups to earn lower grades than predictions based on test scores). Bowen and Bok (1998,

Figure 3.10) reported that Black students in highly selective colleges were considerably less

likely to achieve a median rank in class than were White students with comparable SAT

scores (see also Vars & Bowen, 1998). In a second large study, Ramist et al. found freshman

GPA to be overpredicted by the SAT within colleges by an average of .13 and .23 for

Hispanic and Black students, respectively (Ramist et al., 1994, Table 8, line 4). The most

nearly comparable analysis in the present data shows minimal overprediction of –.03 and

–.01 for Hispanic and African American students, respectively (the within-school analysis

Page 237: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

C-10

reported in Table 12, column 2, where a negative result indicates overprediction). But in a

third recent study, results for differential prediction were much like those reported here

(Bridgeman et al., 2000). Earlier studies tended to indicate a pattern of grade overprediction

for Black students (Linn, 1982), though findings for Hispanic students have been inconsistent

(Linn, 1982; Pennock-Roman, 1990). Our analysis of differential prediction for women and

Asian American students in high school gave results similar to those reported for college

students by Ramist et al. (1994). There are a number of possible reasons why some groups

might show some discrepancy in differential prediction results for high school and college

data. Perhaps the most plausible explanation stems from the selected nature of the college

samples. Linn (1983) showed that college selection procedures are likely to create a spurious

negative result regarding differential prediction for low scoring groups. Also, Lewis and

Willingham (1995) have shown that selected samples of more able students can exhibit quite

different patterns of group difference, depending upon the nature of the samples and the

dynamics of the selection process. See Hoover and Han (1995), Makitalo, (1994), and

Willingham and Cole (1997, p. 95f) for hypothetical as well as practical examples of such

effects. Furthermore, constraints on the nature of the NELS sample employed here could

have affected the pattern of group differences. Results might be expected to differ because

the NELS Test is not the same as the ACT or the SAT, though the pattern of group differences

here is quite similar to those found in national program data for college-bound students on the

SAT (College Board, 1992). Different clustering of ethnic subgroups in school and college

could be a factor due to different effects of grading variations. Finally, somewhat different

results could be traced to different patterns of academic behavior among subgroups in school

versus college, or perhaps some types of colleges.

Page 238: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

C-11

13. Subtracting school means from all scores would necessarily reduce differential

prediction to zero if all students in a particular school belong to a specific subgroup. In

theory, a sharply imbalanced representation of subgroups by school could make the results of

differential prediction spuriously smaller in a within-school analysis. An examination of

subgroup representation showed only moderate imbalances of that sort in the NELS sample.

Nevertheless, to check on the possibility of such effects, an analysis of differential prediction

was repeated for each gender and ethnic group within each of five bands of schools differing

in the proportion of that group in the student body; i.e., 0-20%, 21-40%, etc. No spurious

effects were noted. The final results for differential prediction were within .01 of the original

analysis for all groups.

14. Note that recent state initiatives to improve representation of minority students in

higher institutions by giving more weight to school performance in admissions decisions have

focused on rank in class, not grade average. Using high school class rank as a selection

measure puts all schools on the same percentile scale regardless of average test score level.

Thus, school differences between class rank and test scores are guaranteed, and the pattern of

such differences across schools and subgroups may be quite different from the pattern of

differences between grades and test scores.

15. As described in Note 11, the residual method is likely to result in a somewhat

inflated multiple correlation due to overfitting associated with SGF, the School Grading

Factor. In these data, correcting school grading variations through the within-school analysis

raised the correlation between NELS-C and HSA by .071, while making that correction

through the residual method by adding SGF in an across-school analysis raised the correlation

by .083. The difference in the two adjusted correlations suggests an inflation of .012 in the

Page 239: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

C-12

latter. Willingham (1963d) found an inflation of .01 with a similar residual method in a

similarly structured data set with yearly samples of about 1000. A shrinkage of .01 was

confirmed in cross-validation.

16. In the case of the Vocational group in Table 14, the correlation of the NELS

Composite with HSA is apparently lower than the multiple R because of a different pattern of

correlations for the individual tests in that group as compared to the other groups. In the

Asian American group (also Table 14), the multiple R is probably inflated due to collinearity

among the four tests, which was higher in this group than in any of the other gender, ethnic, or

program groups. Collinearity is evidently the same reason that the 37-variable regression

analysis for the Asian group was inadmissible (see Table 19).

17. Because it was computed within school, the course grading factor K is arguably a

somewhat closer relative of Z (from Ramist et al., 1994) than is MCGF. In any event, K and

MCGF showed very similar relationships to other variables. K had somewhat higher and

similarly ordered correlations with HSA and NELS-C (.51 and .62) and was most highly

related to the same three student variables as was MGCF.

18. We deliberately avoided attempting a factorial definition of possible components

of Scholastic Engagement. The a priori grouping of school skills, initiative, and competing

activities was based on the logic of the relationships and consideration of potential

usefulness. As explained earlier in the text, Competing Activities necessarily involved

statistical constraints that would mitigate against those variables clustering. The

intercorrelations verified that assumption, while the other two components appeared to be

cohesive and reasonably well differentiated.

Page 240: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

C-13

19. Student Characteristics pertaining to attendance and discipline showed

correlations with grade and test performance that were closer to the corresponding Teacher

Ratings than was true of measures concerning homework and motivation. This result was

likely due in part to the former two measures having a factual basis in school records.

Another factor may be the tendency for self-report opinion items (such as personal work

habits) to have relatively low reliability and validity (Fetters, Stowe, & Owings, 1984).

20. Subject match (Factor 1) and reliability (Factor 3) can have consequential effects

in particular situations. If some subjects are included in a school leaving examination at the

expense of others, large group differences in graduation rate can result (Willingham & Cole,

1997, p. 241-43). Such fairness issues are less likely in tests used as predictors in selective

admissions because of the compensating effect of previous grade average, almost always used

as a second predictor (Willingham & Cole, 1997, p. 321). Unreliability can also be a serious

weakness in a high stakes measure, especially if it involves untraditional assessment (Koretz

et al., 1994).

21. Cannell (1988) caused considerable excitement a few years ago when he

described a “Lake Wobegone” effect in the results of K-12 achievement tests; i.e., a strong

tendency for most if not all states to report that their students are above average on national

test norms. In an independent evaluation, Linn, Graue, & Sanders (1990) reported data

supporting Cannell’s assertion, though their findings and conclusions were not so sensational.

The phenomenon is most commonly attributed to the repeated use of the same test form and

comparison of results with old norms. Such data raise a number of questions, particularly as

to what extent the score gains may be generalizable, and whether they are due to improved

Page 241: RR-00-15 RESEARCH R GRADES AND TEST SCORES: E … · Accounting for Observed Differences ... Why do grades and test scores often differ? A framework of possible ... substitution of

C-14

achievement or simply increasing familiarity with the specific material, enhanced by teaching

to the test (Shepard, 1990).

22. See particularly the wide range of topics in the series of CRESST reports

(www,ucla,edu/CRESST/pages/reports.htm) and issues of Assessment in Education:

Principles, Policy, and Practice.