scoring direct writing assessments: what are the alternatives?

3
Scoring Direct Writing Assessments: What Are the Alternatives? Ina V. S. Mullis Educational Testing Service Before a scoring alternative can be selected for a direct writing assessment, certain crucial questions must be asked and decisions made about the type of information desired. To select the best scoring alternative, this must be done prior to conducting the assessment. Such questions can be categorized as what, who, and why questions. The “what do we want to know” questions seek to define the major purposes for collecting achievement data. The “who do we want to know it about” ques- tions try to determine the level of information required. The“do we want to know why”questions center on col- lecting data to generate or test hypo- theses about why particular results occurred. Alternatives for scoring open ended responses to writing samples fall into three broad categories: (a) systems based on “sca1es”or score point values; (b) systems based on counting features or attributes (e.g., mechanical errors, cohesive ties, or syntactical structures) present in the writing; and (c) systems based on descriptive classifications. This article addresses systems based on score point values. Counting systems are expensive and descriptive systems do not provide the qualitative results required by most direct writing as- sessments. In score point systems, student re- sponses are arranged according to es- tablished quality criteria. Raters place each response at the score point along the scale best describing that paper and others like it. Of course, the kinds of score point systems currently used in writing evaluation are quite varied. Differencesinclude the number of score points, the detail of the criteria for the score points, and whether passing cri- teria are inherent in the scale. Proce- dures used to define score points also are varied, ranging from analysis of the corpus of papers to be scored to estab- lish relative definitions to using abso- lute definitions established prior to col- lecting students’ responses. Some sys- tems consider all characteristics of writ- ing, some consider several of them, and some focus on single characteristics. Three major classes of score point evaluation systems are holistic scoring, primary trait scoring, and analytic scoring. Each is described in the fol- lowing sections. Ina V.S. Mullis is associate director of the National Assessment of Educational Progress (NAEP) at Educational Jesting Service, Princeton. Previously she was director of Assessment Development, NAEP, at the Education Commission of the States. A specialist in assessment methods and procedures, she has co- ordinated NAEP’s writing assessment program for the last 12 years. She is a member of the National Conference on Research in English and the Na- tional Council of Teachers of English Standing Committee on Research. Holistic Scoring Holistic scoring is based on the the- ory that a piece of writing is greater than any of its parts and that English teachers can recognize good writing when they see it (Conlan, 1978). Read- ers are asked to make a single, global quality judgment about each paper, reading rapidly for total impression. They purposely do not focus upon par- ticular aspects of a paper such as organization, mechanics or ideas. Originally, readers judged papers with little or no guidance as to detailed standards. They scored papers accord- ing to their own standards for the pop- ulation being assessed and the topic included in the assessment. General standards were established prior to the evaluations by having readers rate some selected sample papers individuallyand then compare their scores. Because little training time and mini- mum reading time was involved, this sytem was efficient, yet it had obvious flaws. Although it was more reliable than one might expect, any differences in standards among readers contrib- uted to the measurement error. Expe- rienced readers have internalized crite- ria for good and bad writing, but these criteria differ from reader to reader. Because the reliability of scores is primarily a function of the number of different topics and the number of dif- ferent readings included, efficiencies gained in actual reading time were often offset by having each paper read several times. For example, if one included as many as five different top- ics and each topic was read by five different readers, the reading reliability of the total score approximated .92 and the score (i.e., inter-topic) reliabil- ity approximated .84 for the samples. In contrast, for one topic read by one reader, the corresponding figures were .40 and .25 respectively (Godshalk, Swineford, & Coffman, 1966). 16 Educational Measurement: Issues and Practice

Upload: ina-v-s-mullis

Post on 28-Sep-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scoring Direct Writing Assessments: What Are the Alternatives?

Scoring Direct Writing Assessments: What Are the Alternatives?

Ina V. S. Mullis Educational Testing Service

Before a scoring alternative can be selected for a direct writing assessment, certain crucial questions must be asked and decisions made about the type of information desired. To select the best scoring alternative, this must be done prior to conducting the assessment.

Such questions can be categorized as what, who, and why questions. The “what do we want to know” questions seek to define the major purposes for collecting achievement data. The “who do we want to know it about” ques- tions try to determine the level of information required. The“do we want to know why”questions center on col- lecting data to generate or test hypo- theses about why particular results occurred.

Alternatives for scoring open ended responses to writing samples fall into three broad categories: (a) systems based on “sca1es”or score point values; (b) systems based on counting features or attributes (e.g., mechanical errors, cohesive ties, or syntactical structures) present in the writing; and (c) systems based on descriptive classifications. This article addresses systems based on score point values. Counting systems

are expensive and descriptive systems do not provide the qualitative results required by most direct writing as- sessments.

In score point systems, student re- sponses are arranged according to es- tablished quality criteria. Raters place each response at the score point along the scale best describing that paper and others like it. Of course, the kinds of score point systems currently used in writing evaluation are quite varied. Differences include the number of score points, the detail of the criteria for the score points, and whether passing cri- teria are inherent in the scale. Proce- dures used to define score points also are varied, ranging from analysis of the corpus of papers to be scored to estab- lish relative definitions to using abso- lute definitions established prior to col- lecting students’ responses. Some sys- tems consider all characteristics of writ- ing, some consider several of them, and some focus on single characteristics.

Three major classes of score point evaluation systems are holistic scoring, primary trait scoring, and analytic scoring. Each is described in the fol- lowing sections.

Ina V.S. Mullis is associate director of the National Assessment of Educational Progress (NAEP) at Educational Jesting Service, Princeton. Previously she was director of Assessment Development, NAEP, at the Education Commission of the States. A specialist in assessment methods and procedures, she has co- ordinated NAEP’s writing assessment program for the last 12 years. She is a member of the National Conference on Research in English and the Na- tional Council of Teachers of English Standing Committee on Research.

Holistic Scoring Holistic scoring is based on the the-

ory that a piece of writing is greater than any of its parts and that English teachers can recognize good writing when they see it (Conlan, 1978). Read- ers are asked to make a single, global quality judgment about each paper, reading rapidly for total impression. They purposely do not focus upon par- ticular aspects of a paper such as organization, mechanics or ideas.

Originally, readers judged papers with little or no guidance as to detailed standards. They scored papers accord- ing to their own standards for the pop- ulation being assessed and the topic included in the assessment. General standards were established prior to the evaluations by having readers rate some selected sample papers individually and then compare their scores.

Because little training time and mini- mum reading time was involved, this sytem was efficient, yet it had obvious flaws. Although it was more reliable than one might expect, any differences in standards among readers contrib- uted to the measurement error. Expe- rienced readers have internalized crite- ria for good and bad writing, but these criteria differ from reader to reader.

Because the reliability of scores is primarily a function of the number of different topics and the number of dif- ferent readings included, efficiencies gained in actual reading time were often offset by having each paper read several times. For example, if one included as many as five different top- ics and each topic was read by five different readers, the reading reliability of the total score approximated .92 and the score (i.e., inter-topic) reliabil- ity approximated .84 for the samples. In contrast, for one topic read by one reader, the corresponding figures were .40 and .25 respectively (Godshalk, Swineford, & Coffman, 1966).

16 Educational Measurement: Issues and Practice

Page 2: Scoring Direct Writing Assessments: What Are the Alternatives?

As developed more recently by the Educational Testing Service (ETS) to score the English composition test (Conlan, 1978), the Advanced Place- ment test in English (Smith, 1975), and National Assessment of Educational Progress (NAEP) writing exercises (NAEP, 1980), standard papers are selected to represent the various score points and raters are carefully trained to become calibrated to reach consen- sus. In addition, raters are trained to use the full range of scores available to approximate a normal distribution. For example, if a 6-point scale is to be used, papers are selected to illustrate six levels of quality. These standard or “anchor” papers are sometimes accom- panied by brief guidelines, rubrics, or feature lists describing general attri- butes of papers in each quality level.

Training involves discussion of spe- cific features of papers, beginning with presentation of the anchors and guide- lines and proceeding to practice scor- ing of selected sample papers. Through- out the subsequent scoring, periodic discussions and continual monitoring insure that the standards do not vary. Using these procedures, reader consis- tency for two readers at ETS and NAEP tends to range from .80 to .95.

Holistic scoring has been developed primarily to rank people according to overall writing proficiency; thus, the improved standard setting and training procedures are appropriate. Yet, it should be remembered that initially, experts rated papers according to their standards for good and bad writing, but the students themselves now set the standards. Thus, the score point values are dictated by the quality of the papers. collected in the assessment.

A paper with a higher score is better than a paper with a lower score, and a paper with a particular score is, in a broad sense, similar to the standard papers (and guidelines) that represent that score point on the scale. There- fore, holistic scoring is an excellent way to demonstrate the range of qual- ity that exists in a particular popula- tion of students and to rank those students.

The concern arises when the data are used immediately to establish profi- ciency in a concrete sense. Because the standards are relative to the papers col- lected, one cannot assume that the bet- ter papers are good or that the poorer papers are bad. For decisions about

proficiency or competency, standards must be defined and additional analy- sis conducted to determine which score points meet various levels of quality.

Primary Trait Scoring System The primary trait scoring system as

developed by NAEP is based on the theory that most writing is addressed to an audience and is done for a pur- pose, and that degrees of success in accomplishing that purpose are defin- able in concrete terms. Hence, the major steps in implementing the pri- mary trait scoring system are to develop a model or broad definition of the uni- verse of writing purposes, select and refine areas of that universe to mea- sure, design valid writing tasks that sample those areas, devise useful and workable scoring guides, and imple- ment those guides reliably (Lloyd- Jones, 1977).

Many schemes exist for describing the reasons people write, but three major areas have formed the founda- tion of the primary trait scoring sys- tem. lnformational or explanatory writing is used to share knowledge and convey messages, instructions, and ideas. Persuasive writing attempts to bring about some action or change. Literary or imaginative writing pro- vides a special way of sharing expe- riences and understandings in a variety of forms.

In the primary trait system, the development of a writing task is more than devising an engaging prompt. A purpose for writing is determined (e.g., persuasion--convincing a reticent au- dience), a task that requires writing is developed (a letter to the landlord to let you keep your puppy), the basic trait of a successful response to the task is identified (reasons or appeals appro- priate to landlords) and then a ration- ale describing the relationship of the task to the purpose and the primary trait is written. Finally, a scoring guide outlining levels of success on the trait is developed (Mullis, 1980).

Generally, scoring guides define four levels of proficiency in the primary skill being assessed. Level 1 indicates little or no evidence of the skill; Level 2, marginal evidence; Level 3, competent or solid performance; and Level 4, very good performance. For persuasive writ- ing, presentation of compelling rea- sons and evidence is the broadly defined primary trait. Generally, a “1” paper would present little or no evidence, a

“2” would have few or inappropriate reasons, a “3” would be well thought out with several appropriate reasons, and a “4”would be well organized with reasons supported by compelling de- tails. Although guidelines are consist- ent across similar kinds of writing, the specific task assigned determines the exact criteria.

Guides are developed from a priori expectations about what successful re- sponses would include and are then adjusted and amplified on the basis of field test and assessment papers. The final guides given to scorers consist of a one-page description of the trait, ra- tionale, and levels of proficiency, as well as sample papers typifying re- sponses in each score point.

When a reader is rating papers for primary trait scoring, he or she is rating each paper against the criteria spelled out in the scoring guide. Thus, the scor- ing guides must be clearly understood, and readers must be able to make deci- sions based directly on the philosophy underlying each category.

Such understanding is achieved through thorough and systematic train- ing that includes discussion of the task, the guide, and the features of papers exemplifying each category. Practice scoring is done with prescored sample papers. Questions are answered and distinctions between categories are clarified until readers are scoring sam- ple papers consistently. By construct- ing tasks and scoring guides carefully and training rigorously, NAEP main- tains percentages of exact score point agreement of over 90% between read- ers. Development and training time is somewhat greater than with the holis- tic method, but the actual time taken to evaluate papers is about the same. The high reader agreement rate makes use of a single reader a reasonable alter- native.

Because scoring is conducted accord- ing to clearly defined goals, the results can provide concrete, reliable, specific, and useful information. It is clear why a paper received the score it did and what would have to be done to improve that score. In addition, primary trait scoring distributes papers according to their relationship to the scoring criter- ia. If few papers meet the criteria for the highest rating, then instruction in the task area is warranted. If all or most of the papers fall in the three or four range, the objective can be consid- ered achieved.

Spring 1984 17

Page 3: Scoring Direct Writing Assessments: What Are the Alternatives?

Analytic Scoring In analytic scoring, prominent single

characteristics of writing are identified and each is rated according to quality. Scores from individual characteristics are then totaled to produce an overall score. The number of characteristics generally ranges from about four to twelve, with two to eight score points defined for each feature. For example, a well-known analytic system deve- loped by ETS (Diederich, 1974) in- cludes eight features: ideas, organiza- tion, wording, flavor, usage, punctua- tion, spelling, and handwriting. Each feature is rated on a five-point scale and the ratings for ideas and organiza- tion receive double weight in the over- all score. In contrast, the system used by Illinois and described in this issue is based on five characteristics, with each rated on a six-point scale.

Whereas holistic scoring is designed to describe the overall effect of charac- teristics working in concert, or the sum of the parts, analytic scoring is designed to describe individual characteristics or parts and total them in a meaningful way to arrive at an overall score. Although the effect of a whole piece of writing may be different from the sum of its parts, analytic scoring provides an analysis of the strengths and weak- nesses of each paper and a record of why the paper received the score it did.

The fundamental task in analytic scoring is choosing the characteristics that will be rated and then defining the score values for various levels of suc- cess or quality. Characteristics chosen are generally those important to any piece of writing in any situation (e.g., organization, content, and mechanics). However, results are often more useful if the characteristics are derived from writing done for particular purposes and audiences. For example, charac- teristics selected for persuasive writing might include clarity of position, sup- port, and tone, whereas characteristics for a story might include plot, sequence, and character development. Unfortu- nately, there is often a tendency to choose characteristics such as gram- mar and mechanics (e.g., spelling, punc- tuation, capitalization, agreement, word usage). Such characteristics are easy to define and rate, but the net effect is one of heavily weighting or defining the quality of writing in terms of conventions rather than in terms of content, thoughtful presentation, and effectiveness of communication.

Once characteristics are selected, score values must be defined and crite- ria established. Readers learn to use the scales by studying the descriptions of the values and sample papers illus- trative of those descriptions. Training proceeds much like that for primary trait scoring, with raters scoring previ- ously rated papers and discussing the results until agreement is reached and raters appear to be scoring consist- ently. As with holistic and primary trait scoring, careful training can pro- duce a high degree of reader reliability (Diederich, 1974). However, the time necessary for training and scoring often is increased, particularly if a large number of characteristics make up the scale.

Summary Holistic scoring provides informa-

tion about the range of overall writing quality exhibited by a population of students and determines which stu- dents are more proficient than others. However, it does not provide specific prescriptive or diagnostic information. One never knows precisely why papers received certain scores. Subsequent analysis of the papers is required to determine particular proficiencies and deficiencies of students.

Primary trait scoring provides spe- cific information about student success in accomplishing particular kinds of writing tasks but does not provide information about mechanical aspects of writing or general fluency. It focuses on whether the content, ideas, and organization of the writing communi- cate the necessary information in an effective way.

Analytic scoring, if implemented thoughtfully, can be a relatively effi- cient way to provide diagnostic infor- mation about the specific strengths and weaknesses of each student. However, examining writing only as a collection of parts has obvious drawbacks. Char- acteristics such as content, focus, or- ganization, elaboration, and clarity are often inextricably intermeshed in a single piece of writing. The whole can be better or worse then the sum of the parts. Too, the overall score tends to be reported as overall quality, whereas the analysis may have omitted some cru- cial characteristics.

For certain purposes, the most effi- cient and beneficial scoring system may be an adaptation or modification of an existing system. For example, the

focused holistic system used by the Texas Assessment Program described in this issue can be thought of as a combination of the impressionistic holistic and primary trait scoring sys- tems. Although this system examines all aspects of the writing, ability to respond appropriately to purpose and audience in the given writing situation is given primary consideration. In other modified holistic methods, characteris- tics important to writing are defined much as they would be for analytic scoring, yet the characteristics are rated together holistically in one set of score point values. In some cases, one of the characteristics is the primary trait.

Finally, it should be emphasized that the varying strengths and weak- nesses of individual scoring systems suggest that is is best to use several systems whenever possible. For exam- ple, after the 1980 assessment NAEP used three score point systems-pri- mary trait, holistic, and an analytic scale for coherence-and two counting systems-mechanical error analysis and T-unit analysis. Many states are using both a holistic or primary trait system and an analytic system to obtain addi- tional specific information about the papers in specific score point values.

References Conlan, G. (1978). How the Essay in the

College Board English Composition Test is Scored. An introduction to the reading for readers. Princeton, NJ: Educational Testing Service.

Diederich, P.B. (1974). Measuring Growth in English. Urbana, IL: National Coun- cil of Teachers of English.

Godshalk, F.I., Swineford, F., & Coffman, W.E. (1966). The Measurement of Writ- ing Ability. Princeton, NJ: College Entrance Examination Board.

Lloyd-Jones, R. (1977). Primary Trait Scoring. In C. Cooper & L. Odell (Eds.), Evaluating Writing: Describing, Measuring, Judging. Urbana, IL: Na- tional Council of Teachers of English.

Mullis, I . (1980). Using the Primar.v Trait St,.stem,for Evaluation Writing (No. 10- 2-51). Denver, CO: NAEP. Education Commission of the States.

National Assessment of Educational Pro- gress. (1980). Writing Achievement 1969-79: Results from the Third Na- tional Writing Assessment (Volumes I , 11, and 111). Denver, CO: NAEP, Edu- cation Commission of the States.

Smith, R. (1975). Grading the Advanced Placement English Examination. Prince- ton, NJ: College Entrace Examination Board.

18 Educational Measurement: Issues and Practice