Observing Chinese raters’
scoring performance
and scoring results
in EFL writing
Huiyuan Chen
陈慧媛
2013.11.5
1
Purpose of the study:
To explore
1. Chinese scoring teachers or raters’ scoring tendency by means of
quantitative writing performance measures (wpm);
2. to compare and to observe different ways of combining raters’ scoring
results so as to verify the best way of combining raters’ scores and to
achieve comparatively better and more reliable combined scores for
different groups of raters.
On the whole, it is a methodological exploration of how raters’ scoring
performance could be observed somewhat more objectively and directly, so that we
may learn more about how raters might differ specifically and what comparable and
observable results might be produced by comparing different ways of combining
raters’ scores.
3
The rationales:
This methodological exploration of how we could achieve a more objective
and more direct ways of observing and combining different raters’ scores has
been carried out because of the awareness of the existence of the following
problems among raters in EFL writing:
the existence of rater variability (Eckes, 2008; Knock, 2011)
the difficulty to achieve inter-rater consistency (Engelhard, 1992; Weigle,
1994)
there was a difference between what the raters thought they were doing
and what they actually did McNamara (1996)
Even with the methods of think-aloud, we may not be able to get exactly
what actually guides raters’ scoring or what actually goes on in a rater’s
mind (Vaughan, 1991).
the need of combining multiple raters’ scores and the predicting scores in
the testing of writing
3
The research questions:
1. Can raters’ performance be measured as we measure learners’ writing
performance in terms of the writing performance measures? In other words,
can raters’ scoring tendency underlying their different scoring results be
observed somewhat directly by means of wpm?
2. How can the application of the writing performance measures be of help in
finding a comparatively better way of combining different raters’ scores?
3
Assumptions for adopting wpm for the study
1. Quantify the learners’ performance in a measurable way may reveal more
detailed, more specific and comparable information about not only
learners’ performance, but the scoring of the writing
2. By proper application of statistical methods on linguistic measures and
the scoring results we may be able to identify individual scoring teacher’s
tendency or priority in scoring students’ writing without appealing to the
“think-aloud” method.
3. By quantitative linguistic performance measures it may be possible to
find out what measures are closely related to the holistic scoring results
and how much the linguistic measures can be accounted for or match the
holistic scorings
4. counter check the results of holistic scoring and get more detailed
information about the raters in relation with their scoring, and thus it
might be helpful in increasing the reliability of writing assessment or
evaluation (for example: the idea of the study may be useful in rater
training and rater grouping)
3
Defining writing performance measures (wpm):
Quantitative writing performance measures (wpm) in this study means an illustration
or indication of linguistic features or traits, either generally or specifically in a
quantitative manner found in EFL learners’ writing texts.
Examples of the wpm measures:
W: the total number of word in students’ written texts
C/T: clauses per T-unit
E/W: the number of error per word
EFC/C: error free clauses per clauses
T/T: type/token ratio: the total number of different words used (type) divided by
the total number of words in the text (tokens)
3
Methods:
How the writing samples and two sets of data are obtained:
1. writing samples are obtained form local universities
2. the first set of data are holistic scores given by three raters or teachers:
from two universities; all have the experience of scoring in the large scale
national examinations such as CET and TEM
3. the second set of data are the coding results of quantitative measures in
students’ writing: the coding of students’ written text were done by a
group of teachers and graduates.
3
Obtaining the writing samples
Students in different grade of different majors in a 211 university in the province were
asked to do two writing tasks at one to two weeks’ interval
150 students’ writings were randomly picked out for each of the two tasks writing (see
next slide for the task writing instruments).
In the end, 300 pieces of task writing were collected and ready for the analyses. The
exact writing implementation and collection of the writing samples are as follows:
Graph 1. Illustration of writing task implementations and the writing sample collection
3
English
Non-English
1styear 3
rd year 1
st year 2
nd year
T2 T1 T2 T1 T2
务 2
T1 T1 T2 T1 T2
3rd
year
Writing task implementation &
Writing sample collection
3
Two writing tasks:
Task 1
Directions: “In many regions in China, industrial growth has brought about serious problems of
water pollution. If you were a policy maker and had to choose between promoting economic
development and saving your precious water, which one would you put in the first place?” Please
write an assay about the issue and make clear your position or decision with adequate examples
and explanations.
3
Task 2
Instructions: The table below shows the content of Cadmium (镉) in water in three districts in an
area as well as children’s physical build and growth (体格发育) in those three districts. Please
describe and discuss the findings given in the table and make your own conclusions based on the
information given in the table.
Content of Cadmium Children’s physical build and growth Districts
Drinking
water
(ppm)
Water for
irrigation
(ppm)
Normal
( %)
Thin
( %)
Stout (Short
and fat)
( %)
South 0.03 0.33 38.71 33.71 22.58
North 0.0115 0.0293 46.69 42.05 11.36
Central 0.0071 0.008 57.57 30.79 10.27
( ppm = 百万分之一)
3
Holistic scoring of students’ writing by three raters
Six scales on the following aspects:
Content
Structure and organization
Vocabulary
Grammar and sentence structure
Writing format & mechanics
(See the chart for details)
Holistic scoring descriptions
Notes to the raters doing the holistic scoring:
1. Traits and point for the scoring or rating are given in the table below. Altogether 6 scales
representing six levels of scores from 100 down to 40 are recommended
2. Content and language are to be considered half to half in proportion in cases when you find it
difficult to decide or judge the writing in hand.
3
Obtaining the data for writing performance measures
300 writing samples were coded for 66 measures (the 66 measures are
based on a large scale study of the wpm), then nearly 20,000 values are
obtained.
The data were checked for reliability and Cronbach’s alpha was obtained
at 0.847 (above .8), which indicated the data’s inner consistency is well
acceptable.
Checking data for colinearity, or high correlation among measures. To
ensure the validity or trustworthy of the data, further data treatment is
needed. For the specific data treatment, see the next slide.
3
Data treatment and Statistical analysis
1. Applying correlation analysis on all the data. If a measure was found
showing a correlation higher than .70 with another one, one of them would
be taken off.
2. With data treatment, 37 measures were kept for further statistical analysis;
3. Based on the 37 measures (which also represent linguistic features
obtained in students’ writing), the regression was applied with 37
measures as the independent variable and the different raters holistic
scoring results as the dependent variable.
3
Results and comments:
Results concerning the first research question:
The relevant results were shown in table 1 and table 2.
Table 1 shows the measures that have entered the regression model for each individual
rater. The measures which have been included in the model indicate that those
measures have been selected by the model as having linear relations with that rater’s
scoring.
Table 1, Measures that have entered the models for three raters respectively
Raters Measures that entered the model
Rater 1 W, EW/W, EFC/C, CNP/NP, Pas/FinV, EFT/T
Rater 2 W, EW/W, VT/V, CNP/NP, EFT/T, Pas/FinV
Rater 3 W, E/W, Pas/FinV, wm
Note: Measures that are present in all raters are given in bold type.
3
The explanations for each measure in Table 1 are given as follow:
1. W= total number of words in the text; (F)
2. EW/W= number of words involved in the errors to the total number of words; (A)
3. E/W= number of errors to the total number of words; (A)
4. EFC/C= error-free clauses per clause; (A)
5. EFT/T= error-free T-units per T-unit; (A)
6. wm= Writing mechanics (sentence division problem such as punctuation, upper case,
etc.).(A)
7. Pas/FinV= number of passives to the number of finite verbs; (SC)
8. CNP/NP= number of complex noun phrases per noun phrase; (LC)
9. VT/V= verb types per total verbs; (LC)
Notes: The capital letters in the bracket represent the categorization or the aspect
each measure belong to. “F” stands for the aspect of “fluency”, “A” represents
“Accuracy”, “SC” syntactical complexity and “LC” lexical complexity.
3
Explanations and comments:
There are similarities as well as differences among this group of raters in
terms of wpm.
The similarities: 1) There are at least two accuracy measures in each
rater’s model. For Rater 1 there are three (EW/W, EFC/C, EFT/T). 2)
The number of words (W), which indicates the aspect of fluency are with
all raters; 3) The ratio of passives marked by “Pas/FinV” indicating the
syntactic complexity is in all the raters’ models, too.
Differences or variation among the raters: 1) some raters tend to focus on
the writing mechanics (wm) (R3), while some on the range of verb types
(VT/V) (R2).
The results here may imply: 1) Raters may have given more and persistent
attention to the aspect of linguistic accuracy in their scoring. 2) Despite the
focus on accuracy, their scorings are also balanced among the three major
aspects: accuracy, fluency and complexity (both or either SC or LC, because
there is at least one measure of those aspects in their model.
As to the differences among raters, the results show each rater did have their
priority or tendency, but it is not known why. Future studies are needed for
the question.
3
Table 2: A comparison of the regression results for the three raters
Raters R R square DW Mean StD
Deviations
Residual
minimum
Residual
maximum
Rater 1 .656 .430 1.635 69.56 6.038 -20.42 18.85
Rater 2 .594 .351 1.486 67.02 6.045 -24.95 26.01
Rater 3 .632 .399 1.468 54.095 8.596 -31.4 26.47
R square: the proportion of variance in Y accounted for by X or Xs. That is to say, R
square indicates the amount or proportions of each rater’s predicted scores that can be
accounted for by the measures entering the regression equation. It tells us how
accurate the predicted scores in the regression model for each rater may be. The
higher the R square is, the better the predictions will be.
Residuals: the range of residuals for both the minimum and the maximum, the
smaller the better. It is Rater 1 who has the smallest range of value. This means that
for Rater 1, the range between the real score and predicted scores is the narrowest.
3
Summary of table 1 and table 2:
1. Raters’ scoring tendency can be observed and revealed with the help of
wpm and adequate statistical analyses
2. Similarities among raters indicate that Chinese raters are accuracy
oriented because each rater has got at least two measures of accuracy in
their model. For Rater 1 they are: EW/W, EFC/C, EFT/T; For Rater 2:
EW/W, EFT/T,; and for Rater 3: E/W, wm
3. R square shows Rater 1 has got the highest vale, but it is not high enough
(common acceptance of it is .7). Thus it is necessary to see whether the
combination of the three raters’ scores can achieve better models or better
values in the regression model. That also leads to the second research
questions.
3
Three ways of combining all the raters’ scores:
1) Pure mathematical average scores marked as “average”;
2) Scores obtained by running multifaceted Rasch analysis on Facets 3.68.1, marked
as “Facets” ;
3) Weighted average with Rater 1 taking 40 percent and the other two raters each
taking 30 percent; this way of score combination is marked as “weighted.”
Table 3. Measures in the regression model for the each of the three ways of score combination
Ways of obtaining
collective Scores
Measures entered the model in stepwise regression
Average W, EW/W, EFC/C, Pas/FinV, EFT/T, CNP, spE
Facets W, EW/W, VT/V, CNP/NP, EFT/T, Pas/FinV
Weighted W, EW/W, E/W, Pas/FinV, wm, CNP/NP,
3
Comment on table 3:
Table 3 shows the similarities and differences in the measures selected by the
stepwise method for each way of score combination.
1. One particular result is that measures appearing in the facets and
weighted models also appear in three raters’ individual model (see Table
However, for the average, two new measures entered. they were CNP (the
total number of complex noun phrases) and spE (spelling errors). This
shows that the average contains or introduces new elements and thus
deviated from all the other models.
2. In regard of the results of Table 3 alone, it can’t be uncertain which one
would be the best way of combination to be adopted and which one
should be selected for further statistical analyses for the next research
question. Again, for making the decision here, certain indices of the
regression are considered important, and those indices are presented in
Table 4.
3
Table 4: Important values of regression for the three ways of combining the raters’ scores
Raters R R square DW Mean StD
Deviations
Residual
minimum
Residual
maximum
Average .742 .550 1.672 63.63 6.516 -16.07 16.42
Facets .715 .511 1.672 63.926 6.496 -16.69 16.72
Weighted .736 .542 1.704 64.20 6.424 -15.31 16.32
Comments on table 4:
1. Though the highest R square is with the Average, which is a little bit
higher than the Weighted. But by the other indices, “Average” is not the
best.
2. Though the difference in values among the three is not great, the best
model seems to be with the Weighted because it has a comparatively
higher R square, a more reasonable Durbin-Watson value (closer to 2)
and the smallest minimum residual value.
3
Table 5. Summary of regression results in comparison with the true scores (the weighted average)
Minimum Maximum Mean Std.
Deviation
R
square
Residual
minimum
Residual
maximum
True scores 40 86 63.64 8.831
Predicted scores 43.81 83.39 64.2 6.424 .54 -15.31 16.72
3
Conclusions:
1. First, the study shows that a rater’s performance or specific rating
tendency can be revealed by making use of writing performance
measures and statistical tools. In fact, it shows not only how raters differ
but also how they may be similar. Because of that, the implications of the
study are obvious for rater training and rater selection or grouping for the
purpose of scoring.
2. Second, combining raters’ score do produce better scores as indicated by
higher R squares than depending on just one individual rater’s scores.
3. Third, by comparing different ways of combining raters’ scores, the study
also found that for this group of raters, weighted averages were a little
better than the scores of Facets. This does not mean that Facets is not as
good; rather, it just shows that there are alternatives for obtaining the
combined scores and what might be the best way of combining rater
scores may be case specific, that is, it is necessary to check what kind of
combination is the best for each different group of raters.
4. As an exploration for the methods of finding or identifying a way to
reveal rater variability in relation to the scoring results, the specific
results of the study is not that important, it is the idea of how raters’
scoring performance as well as their scoring principles that underlies
their scoring results that really matters.
3
References:
Eckes, T. 2008. Rater types in writing performance assessments: A classification
approach to rater variability. Language Testing, 25(2), 155-185.
Engelhard, G. 1992. The measurement of writing ability with a many-faceted Rasch
model. Applied Measurement in Education, 5(3), 171-191.
Knoch, U. 2011. Investigating the effectiveness of individualized feedback to rating
behavior – a longitudinal study. Language Testing, 28(2), 179-200.
McNamara, T. 1996. Measuring second language performance. London & New York:
Longman.
Vaughan, C. 1991. Holistic assessment: What goes on in the rater’s mind? In L.
Hamp-Lyons (Ed.), Assessing second language writing in academic contexts (pp.
111-125). Norwood, NJ: Ablex.
Weigle, S.C. 1994. Effects of training on raters of English as a second language
compositions: Quantitative and qualitative approaches. Unpublished PhD,
University of California, Los Angeles.