Download - Observing Chinese raters’ - British Council · Observing Chinese raters’ scoring performance ... Directions: ³ In many regions in China, industrial growth has brought about serious

Observing Chinese raters’

scoring performance

and scoring results

in EFL writing

Huiyuan Chen

陈慧媛

2013.11.5

1

Purpose of the study:

To explore

1. Chinese scoring teachers or raters’ scoring tendency by means of

quantitative writing performance measures (wpm);

2. to compare and to observe different ways of combining raters’ scoring

results so as to verify the best way of combining raters’ scores and to

achieve comparatively better and more reliable combined scores for

different groups of raters.

On the whole, it is a methodological exploration of how raters’ scoring

performance could be observed somewhat more objectively and directly, so that we

may learn more about how raters might differ specifically and what comparable and

observable results might be produced by comparing different ways of combining

raters’ scores.

3

The rationales:

This methodological exploration of how we could achieve a more objective

and more direct ways of observing and combining different raters’ scores has

been carried out because of the awareness of the existence of the following

problems among raters in EFL writing:

the existence of rater variability (Eckes, 2008; Knock, 2011)

the difficulty to achieve inter-rater consistency (Engelhard, 1992; Weigle,

1994)

there was a difference between what the raters thought they were doing

and what they actually did McNamara (1996)

Even with the methods of think-aloud, we may not be able to get exactly

what actually guides raters’ scoring or what actually goes on in a rater’s

mind (Vaughan, 1991).

the need of combining multiple raters’ scores and the predicting scores in

the testing of writing

3

The research questions:

1. Can raters’ performance be measured as we measure learners’ writing

performance in terms of the writing performance measures? In other words,

can raters’ scoring tendency underlying their different scoring results be

observed somewhat directly by means of wpm?

2. How can the application of the writing performance measures be of help in

finding a comparatively better way of combining different raters’ scores?

3

Assumptions for adopting wpm for the study

1. Quantify the learners’ performance in a measurable way may reveal more

detailed, more specific and comparable information about not only

learners’ performance, but the scoring of the writing

2. By proper application of statistical methods on linguistic measures and

the scoring results we may be able to identify individual scoring teacher’s

tendency or priority in scoring students’ writing without appealing to the

“think-aloud” method.

3. By quantitative linguistic performance measures it may be possible to

find out what measures are closely related to the holistic scoring results

and how much the linguistic measures can be accounted for or match the

holistic scorings

4. counter check the results of holistic scoring and get more detailed

information about the raters in relation with their scoring, and thus it

might be helpful in increasing the reliability of writing assessment or

evaluation (for example: the idea of the study may be useful in rater

training and rater grouping)

3

Defining writing performance measures (wpm):

Quantitative writing performance measures (wpm) in this study means an illustration

or indication of linguistic features or traits, either generally or specifically in a

quantitative manner found in EFL learners’ writing texts.

Examples of the wpm measures:

W: the total number of word in students’ written texts

C/T: clauses per T-unit

E/W: the number of error per word

EFC/C: error free clauses per clauses

T/T: type/token ratio: the total number of different words used (type) divided by

the total number of words in the text (tokens)

3

Methods:

How the writing samples and two sets of data are obtained:

1. writing samples are obtained form local universities

2. the first set of data are holistic scores given by three raters or teachers:

from two universities; all have the experience of scoring in the large scale

national examinations such as CET and TEM

3. the second set of data are the coding results of quantitative measures in

students’ writing: the coding of students’ written text were done by a

group of teachers and graduates.

3

Obtaining the writing samples

Students in different grade of different majors in a 211 university in the province were

asked to do two writing tasks at one to two weeks’ interval

150 students’ writings were randomly picked out for each of the two tasks writing (see

next slide for the task writing instruments).

In the end, 300 pieces of task writing were collected and ready for the analyses. The

exact writing implementation and collection of the writing samples are as follows:

Graph 1. Illustration of writing task implementations and the writing sample collection

3

English

Non-English

1styear 3

rd year 1

st year 2

nd year

T2 T1 T2 T1 T2

务 2

T1 T1 T2 T1 T2

3rd

year

Writing task implementation &

Writing sample collection

3

Two writing tasks:

Task 1

Directions: “In many regions in China, industrial growth has brought about serious problems of

water pollution. If you were a policy maker and had to choose between promoting economic

development and saving your precious water, which one would you put in the first place?” Please

write an assay about the issue and make clear your position or decision with adequate examples

and explanations.

3

Task 2

Instructions: The table below shows the content of Cadmium (镉) in water in three districts in an

area as well as children’s physical build and growth (体格发育) in those three districts. Please

describe and discuss the findings given in the table and make your own conclusions based on the

information given in the table.

Content of Cadmium Children’s physical build and growth Districts

Drinking

water

(ppm)

Water for

irrigation

(ppm)

Normal

( %)

Thin

( %)

Stout (Short

and fat)

( %)

South 0.03 0.33 38.71 33.71 22.58

North 0.0115 0.0293 46.69 42.05 11.36

Central 0.0071 0.008 57.57 30.79 10.27

( ppm = 百万分之一)

3

Holistic scoring of students’ writing by three raters

Six scales on the following aspects:

Content

Structure and organization

Vocabulary

Grammar and sentence structure

Writing format & mechanics

(See the chart for details)

Holistic scoring descriptions

Notes to the raters doing the holistic scoring:

1. Traits and point for the scoring or rating are given in the table below. Altogether 6 scales

representing six levels of scores from 100 down to 40 are recommended

2. Content and language are to be considered half to half in proportion in cases when you find it

difficult to decide or judge the writing in hand.

3

Obtaining the data for writing performance measures

300 writing samples were coded for 66 measures (the 66 measures are

based on a large scale study of the wpm), then nearly 20,000 values are

obtained.

The data were checked for reliability and Cronbach’s alpha was obtained

at 0.847 (above .8), which indicated the data’s inner consistency is well

acceptable.

Checking data for colinearity, or high correlation among measures. To

ensure the validity or trustworthy of the data, further data treatment is

needed. For the specific data treatment, see the next slide.

3

Data treatment and Statistical analysis

1. Applying correlation analysis on all the data. If a measure was found

showing a correlation higher than .70 with another one, one of them would

be taken off.

2. With data treatment, 37 measures were kept for further statistical analysis;

3. Based on the 37 measures (which also represent linguistic features

obtained in students’ writing), the regression was applied with 37

measures as the independent variable and the different raters holistic

scoring results as the dependent variable.

3

Results and comments:

Results concerning the first research question:

The relevant results were shown in table 1 and table 2.

Table 1 shows the measures that have entered the regression model for each individual

rater. The measures which have been included in the model indicate that those

measures have been selected by the model as having linear relations with that rater’s

scoring.

Table 1, Measures that have entered the models for three raters respectively

Raters Measures that entered the model

Rater 1 W, EW/W, EFC/C, CNP/NP, Pas/FinV, EFT/T

Rater 2 W, EW/W, VT/V, CNP/NP, EFT/T, Pas/FinV

Rater 3 W, E/W, Pas/FinV, wm

Note: Measures that are present in all raters are given in bold type.

3

The explanations for each measure in Table 1 are given as follow:

1. W= total number of words in the text; (F)

2. EW/W= number of words involved in the errors to the total number of words; (A)

3. E/W= number of errors to the total number of words; (A)

4. EFC/C= error-free clauses per clause; (A)

5. EFT/T= error-free T-units per T-unit; (A)

6. wm= Writing mechanics (sentence division problem such as punctuation, upper case,

etc.).(A)

7. Pas/FinV= number of passives to the number of finite verbs; (SC)

8. CNP/NP= number of complex noun phrases per noun phrase; (LC)

9. VT/V= verb types per total verbs; (LC)

Notes: The capital letters in the bracket represent the categorization or the aspect

each measure belong to. “F” stands for the aspect of “fluency”, “A” represents

“Accuracy”, “SC” syntactical complexity and “LC” lexical complexity.

3

Explanations and comments:

There are similarities as well as differences among this group of raters in

terms of wpm.

The similarities: 1) There are at least two accuracy measures in each

rater’s model. For Rater 1 there are three (EW/W, EFC/C, EFT/T). 2)

The number of words (W), which indicates the aspect of fluency are with

all raters; 3) The ratio of passives marked by “Pas/FinV” indicating the

syntactic complexity is in all the raters’ models, too.

Differences or variation among the raters: 1) some raters tend to focus on

the writing mechanics (wm) (R3), while some on the range of verb types

(VT/V) (R2).

The results here may imply: 1) Raters may have given more and persistent

attention to the aspect of linguistic accuracy in their scoring. 2) Despite the

focus on accuracy, their scorings are also balanced among the three major

aspects: accuracy, fluency and complexity (both or either SC or LC, because

there is at least one measure of those aspects in their model.

As to the differences among raters, the results show each rater did have their

priority or tendency, but it is not known why. Future studies are needed for

the question.

3

Table 2: A comparison of the regression results for the three raters

Raters R R square DW Mean StD

Deviations

Residual

minimum

Residual

maximum

Rater 1 .656 .430 1.635 69.56 6.038 -20.42 18.85

Rater 2 .594 .351 1.486 67.02 6.045 -24.95 26.01

Rater 3 .632 .399 1.468 54.095 8.596 -31.4 26.47

R square: the proportion of variance in Y accounted for by X or Xs. That is to say, R

square indicates the amount or proportions of each rater’s predicted scores that can be

accounted for by the measures entering the regression equation. It tells us how

accurate the predicted scores in the regression model for each rater may be. The

higher the R square is, the better the predictions will be.

Residuals: the range of residuals for both the minimum and the maximum, the

smaller the better. It is Rater 1 who has the smallest range of value. This means that

for Rater 1, the range between the real score and predicted scores is the narrowest.

3

Summary of table 1 and table 2:

1. Raters’ scoring tendency can be observed and revealed with the help of

wpm and adequate statistical analyses

2. Similarities among raters indicate that Chinese raters are accuracy

oriented because each rater has got at least two measures of accuracy in

their model. For Rater 1 they are: EW/W, EFC/C, EFT/T; For Rater 2:

EW/W, EFT/T,; and for Rater 3: E/W, wm

3. R square shows Rater 1 has got the highest vale, but it is not high enough

(common acceptance of it is .7). Thus it is necessary to see whether the

combination of the three raters’ scores can achieve better models or better

values in the regression model. That also leads to the second research

questions.

3

Three ways of combining all the raters’ scores:

1) Pure mathematical average scores marked as “average”;

2) Scores obtained by running multifaceted Rasch analysis on Facets 3.68.1, marked

as “Facets” ;

3) Weighted average with Rater 1 taking 40 percent and the other two raters each

taking 30 percent; this way of score combination is marked as “weighted.”

Table 3. Measures in the regression model for the each of the three ways of score combination

Ways of obtaining

collective Scores

Measures entered the model in stepwise regression

Average W, EW/W, EFC/C, Pas/FinV, EFT/T, CNP, spE

Facets W, EW/W, VT/V, CNP/NP, EFT/T, Pas/FinV

Weighted W, EW/W, E/W, Pas/FinV, wm, CNP/NP,

3

Comment on table 3:

Table 3 shows the similarities and differences in the measures selected by the

stepwise method for each way of score combination.

1. One particular result is that measures appearing in the facets and

weighted models also appear in three raters’ individual model (see Table

However, for the average, two new measures entered. they were CNP (the

total number of complex noun phrases) and spE (spelling errors). This

shows that the average contains or introduces new elements and thus

deviated from all the other models.

2. In regard of the results of Table 3 alone, it can’t be uncertain which one

would be the best way of combination to be adopted and which one

should be selected for further statistical analyses for the next research

question. Again, for making the decision here, certain indices of the

regression are considered important, and those indices are presented in

Table 4.

3

Table 4: Important values of regression for the three ways of combining the raters’ scores

Raters R R square DW Mean StD

Deviations

Residual

minimum

Residual

maximum

Average .742 .550 1.672 63.63 6.516 -16.07 16.42

Facets .715 .511 1.672 63.926 6.496 -16.69 16.72

Weighted .736 .542 1.704 64.20 6.424 -15.31 16.32

Comments on table 4:

1. Though the highest R square is with the Average, which is a little bit

higher than the Weighted. But by the other indices, “Average” is not the

best.

2. Though the difference in values among the three is not great, the best

model seems to be with the Weighted because it has a comparatively

higher R square, a more reasonable Durbin-Watson value (closer to 2)

and the smallest minimum residual value.

3

Table 5. Summary of regression results in comparison with the true scores (the weighted average)

Minimum Maximum Mean Std.

Deviation

R

square

Residual

minimum

Residual

maximum

True scores 40 86 63.64 8.831

Predicted scores 43.81 83.39 64.2 6.424 .54 -15.31 16.72

3

Conclusions:

1. First, the study shows that a rater’s performance or specific rating

tendency can be revealed by making use of writing performance

measures and statistical tools. In fact, it shows not only how raters differ

but also how they may be similar. Because of that, the implications of the

study are obvious for rater training and rater selection or grouping for the

purpose of scoring.

2. Second, combining raters’ score do produce better scores as indicated by

higher R squares than depending on just one individual rater’s scores.

3. Third, by comparing different ways of combining raters’ scores, the study

also found that for this group of raters, weighted averages were a little

better than the scores of Facets. This does not mean that Facets is not as

good; rather, it just shows that there are alternatives for obtaining the

combined scores and what might be the best way of combining rater

scores may be case specific, that is, it is necessary to check what kind of

combination is the best for each different group of raters.

4. As an exploration for the methods of finding or identifying a way to

reveal rater variability in relation to the scoring results, the specific

results of the study is not that important, it is the idea of how raters’

scoring performance as well as their scoring principles that underlies

their scoring results that really matters.

3

References:

Eckes, T. 2008. Rater types in writing performance assessments: A classification

approach to rater variability. Language Testing, 25(2), 155-185.

Engelhard, G. 1992. The measurement of writing ability with a many-faceted Rasch

model. Applied Measurement in Education, 5(3), 171-191.

Knoch, U. 2011. Investigating the effectiveness of individualized feedback to rating

behavior – a longitudinal study. Language Testing, 28(2), 179-200.

McNamara, T. 1996. Measuring second language performance. London & New York:

Longman.

Vaughan, C. 1991. Holistic assessment: What goes on in the rater’s mind? In L.

Hamp-Lyons (Ed.), Assessing second language writing in academic contexts (pp.

111-125). Norwood, NJ: Ablex.

Weigle, S.C. 1994. Effects of training on raters of English as a second language

compositions: Quantitative and qualitative approaches. Unpublished PhD,

University of California, Los Angeles.

Download - Observing Chinese raters’ - British Council · Observing Chinese raters’ scoring performance ... Directions: ³ In many regions in China, industrial growth has brought about serious

Top Related