chapter 4 research design - repository.tufs.ac.jp
TRANSCRIPT
CHAPTER 4 RESEARCH DESIGN
In the previous chapters, ways of approaching how reading ability could be
defined from the perspective of test item specifications were explored. In Chapter 2,
it has been examined and emphasized that, in investigating the nature of reading test
with its relation to the latent structure of reading ability, the scope ofthe present study
is on the“product”of FL reading as a result of FL reading“process”. Furthermore,
Chapter 3 had described a way in which a constnlct of reading ability could be
defined by developing test items that elicit certain types of reading product in test
takers’reading comprehension. Reading“competence”was termed to be a facet that
constitutes a major part of reading“performance”, and in defining the reading
construct fbr the purpose of reading test item development, it was proposed that,
although a test item is defined to be a tool which elicits a reading performance, that
performance should be accepted as something that allows the testers to draw
inferences and make generalizations about what sort of reading activities the test taker
might be able to do. Furthermore, this should be considered analytically as an
interaction of his competence and the context rather than considering it as something
holistic and content-representative. Tb continue along the same lines of apProach,
the significance of specifying the components of a test item,‘‘question types”in
particular, in operationalizing the reading construct to be tested was discussed. This
was f[耐her explored by reflecting on item diffriculty, or a quantitative aspect of a test
item. The discussion had concluded in suggesting a possibility of a link between the
question type of a test item and its difficulty, which provides the fbllowing research
questions to the present research.
4.1Research questions
Research Question 1:
Is it valid to employ‘question types, as a prime component that constructs test
items used in eliciting test takers, L2 reading performances?
46
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
What are the factors that constitute the L2 reading performances of leamers of
English at secondary education in Japan, when they are extracted from factor anal)rtic
studies of reading products elicited using reading test items? Would they differ
across learners with different reading abilities?
In an attempt to come up with a test item specification that effectively
operationalizes different reading performances to be tested, inspired by Negishi
(1996)and Wada(2003), the present study proposes the‘question typ♂of a test item
to be a prime component to constitute such a丘amework. At the same time, however,
because Negishi(1996)and Wada(2003)had not accommodated the interactions of
these constructing components with the latent reading structure of test takers, an
attention will be rendered to this aspect in much greater depth, as it is possible that
the prime factors could change in accordance with the test takers’reading abilities.
Research Question 2:
Is it valid to assume a certain relationship betWeen question types and item
difficulty in eliciting test takers, L2 reading performances?
Is the item difficulty of a test item, calibrated using Item Response Theory,
affected by its question type? If so, how? Wbuld this relationship differ across
learners with different reading abilities?
With an intere st in suggesting the facets of a reading te st item that would allow
the writers of test items to predeterrnine the difficulty of a test item, the present study
investigates the possibility of a link between the item diflriculty of a test item and its
question type. Attention will also be given to cases with different abilities of test
takers to see if the orders of perceived dif6culties across different question types
differ according to the different ability groups of test takers.
47
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
42Data Collection
4.2.1Subjects
Asample of 8301earners of English from senior high school and university in
Japan had participated in the main part of the present study. Of these,280 were
third-year high school students and 550 were first-year undergraduate students in
皿lverslty.
The maj ority of high school students had five years of English education under
the Course of Study provided by the Ministry of Education, Culture, Sports, Science
and Technology in English in a foreign language environment. They were told that
the test was administered to collect data on individual’s English proficiency. The
students had five English classes in a week;nothing was done in the classroom that
would help the students to prepare fbr the tests administered in this study.
For the university students, the circumstances were the same as high school
students except that the duration of time English was leamed was mostly six years.
All of the皿iversity students maj ored in one foreign language other than English and
were given the test early in April, immediately after they had entered university, as a
placement test fbr their English classes that were prerequisite in the university
curriculum. This was to ensure that the test takers did not have any special
knowledge of English or of any other academic field that would distort the outcome
of data collections.
There were some variations in both high school and university students’
background of how and how long English was learned(e.g. students who had
overseas experiences), however, the variation in the number of years they had spent
time abroad or the intensity of how much English they had leamed were so great that
it was not possible to come up with any generalizable criterion fbr omitting the scores.
Moreover, it could be assumed that tho se variations would be an inherent factor in
leamers’reading ability that enables them to score high on the test, so the present
author had decided to disregard such factors in the process of data collection as long
as it did not affect the distribution of scores too greatly.
48
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
4.2.2 Materials
The two sets of test instrument were employed in the main study.
4.2.2.1Test Set/l
Test Set A(presented in Appendix A)consists of nine passages, each passage
with three multiple-choice test items(one correct option and three distracters
provided)to be responded on the base of its comprehension. These nine passages
were selected after an item selection was done in the pilot study, providing 27 reading
test items. The features of these nine passages are as follows:
Table 4-1 The features of passages employed in Test Set A
TEXT Item# REase Gr Level Words#1 1-3 56.8 8.7 952 4-6 66.4 7.3 1093 7-9 55.2 10 1085 13-15 65.2 9.6 1106 16-18 64.8 7.5 957 19-21 65.7 7.6 101
8 22-24 68.1 7.4 1049 25-27 55.8 8.4 97
10 28-30 57.1 9.5 103
61.68 8.44 102.44
(TeXt 4, as well as ltem 10,11 and 12 are missing from the table because they were omitted after the
item SeleCtiOn.)
All of the passages are taken from Reading Comprehension Section(advanced
level)of Global Test of English Comm皿ication(GTEC)developed by Benesse
Corporation. The pre sent author had determined GTEC to be an appropriate source of
reading texts since it was designed to test English proficiency of high-intermediate
leamers in senior high schools and皿iversities in Japan, which is at an equivalent
level of the subjects to be tested and also of what the Course of Study provided by
Ministry of Education, Culture, Sports, Science amd Teclmology aims for.
49
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
In Table 4-1,‘‘R. Ease”indicates the Flesch Reading Ease and‘‘Gr. Level”
indicates the Flesch-Kincaid Grade Leve1. They both indicate a readability index, a
means of describing how easily written materials could be read and皿derstood.
Although they employ the same core measures(word length and sentence length)to
calculate the index, they have different weighting factors, which sometimes create
incoherence in the outcome of calculations. The indices provided by the Flesch
Reading Ease indicates the easiness of reading a passage from the scale of zero to one
h皿dred, zero being the most difficult to one h皿dred being the easiest.
Flesch-Kincaid Grade Level expresses the readability in a grade level of US
educational system, making it easier to j udge the readability level of various books
and texts. Observing these indices fbr the nine passages used in Test Set A, the
present author assumes the diffriculty of passages were appropriate for the subj ects
and fbr the purpose of the present research(see 4.3.1 for fUrther explanations on how
the subj ect groups were predetermined for the main study).
The number of words in each passage was co皿ted so as to regulate the
characteristics of each passage. The present author had selected passages that were
around 100 words in total, considering the time constraint of testing environments.
The numbers at the bottom indicate the means for each index.
As fbr the three multiple-choice test items that were to be answered after
reading each passage, the present author had written the questions and four options・
The validity of which question type(see 3.4.2 fbr detailed explanations)each item
represented was checked by her colleagues(two teachers at a senior high school)and
their assessment had sufflcient correlation of.76. For the items where
disagreements were fbund, they were discussed and revised so that all three people
(the two colleagues and I)were satisfied with the decision.
For each passage, the first item was written so that the question elicits a
“global-inferential”comprehension of the passage. These were the items numbered
1,4,7,13,16,19,22,25,and 28, and they asked fbr the main idea of the passage.
For example, item l of Test Set A(‘‘1.What is the main idea of this passage?”)can be
answered correctly if a test taker comprehends that the main idea in the passage is the
50
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
growing seam in the seafloor of the Atlantic Ocean. The wording and phrases used
in each question may vary, but all nine questions(items 1,4,7,13,16,19,22,25, and
28)are made to elicit“global-inferentiar’type of reading.
The second item was written so that the question asks fbr a‘‘local-literal
comprehension. These were the items numbered 2,5,8,14,17,20,23,26, and 29,
and they asked fbr the information which is directly interpreted丘om a relatively
small amount of text source. With regard to the first passage which appears in Te st
Set A, item 2 is such test item. Item 2 requires a test taker to complete the sentence,
‘‘
Q.The speed at which the seafloor is spreading is_” The correct option‘‘(C)half
as fast as human fingemails grow,”can be chosen if the test taker can spot and
understand the last sentence in the passage,“This spreading occurs in half of a speed
of how fast fingernails grow,”as it i s, Without any fUrther inferring from the text.
The last item was composed so that the question provokes a“local-inferentialうう
皿derstanding of the passage. These were items 3,6,9,15,18,21,27,30, and they
called for the information which could be obtained after making an inference from a
relatively small amount of text source. With regard to the first passage which
appears in Test Set A, item 3(‘‘3. The break-off of Pangaea started because...”)
requires such type of comprehension and asks fbr the cause of the growing seam in
the seafloor ofthe Atlantic Ocean. In order to choose the correct option,‘‘(B)aplate
started to develoP皿derwater and the land was separated,”atest taker needs to
understand the sentence,‘‘Since that time, the Atlantic Ocean has widened along a hot,
rock-producing seεm in the seafloor,e’and infer that the‘rock-producing searn’is the
cause the break-off of Pangaea.
The three questions fbr each passage were asked so that the global-inferential
question would come first, the local-literal question second, and the local-inferential
third. The present author had chosen to provide them in this order because this is
the order in which the questions seem to appear in the reading sections of common
standardized proficiency tests, such as TOEFL or TOEIC.
As fbr the time allocated to this test, because one class period in senior high
schools is usually 50 minutes,50 minutes was the maximum length of time allowed
51
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
to implement Test Set A. Ideally, sufficient time should be given to the test takers
since the fbcus of the present study is in the test takers’‘power’, rather than their
‘speed’. Therefbre, special attention was given so that the test takers would be able
to complete the test set within the time allocated.
Prior to the test implementation fbr the main study, a pilot test was carried out
in order to validate the test items developed by the procedures described above. The
subj ects were 143 students from a senior high school which is considered to be ofthe
equivalent academic level to the high school at which Test S et A was implemented in
the main study.
The main interest in canying out the pilot test was to find and edit the test
items that exhibit problems with its item discrimination indices. Item discrimination
is‘‘the capacity of test items to differentiate among candidates possessing more or
less of the trait that the test is designed to measure.”(Davies et. al.1999:96) In
developing a test instrument, it is essential that the test items have high levels of item
discriminability to ensure a reliable measurement of test takers’ability. Items with
low item discrimination index are usually eliminated丘om a test or edited. In the
present study, item discriminability was calculated using classical test theory
(point-biserial correlation calculated by ITEMAN)due to the small number of
subj ects and items.
In Table 4-2,‘‘PBs”indicates point-biserial correlation, and‘‘PCう’indicates the
percentage of test takers who correctly answered each item. Indices fbr
point-biserial correlation are used to indicate how well an item discriminates test
takers who are more capable with those who are not so capable. It is often defined
that point biserial correlations of.25 and above are acceptable(Henning 1987:53),
and most of the items surpassed this criterion. Percentage correct is used to show
how easy(or difficult)atest item is because the higher(lower)the percentage of
people who correctly answered a test item, the easier(more diffiT cult)atest item had
been perceived by the test takers.
As it is apparent, items 10,11,12 were considered to be problematic because
they show negative or very low discrimination. These were the items provided fbr
52
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
the sarne passage, so it could be presumed that the passage itself was problematic for
this level of test takers. For this reason, the present author had decided it best to
eliminate all three items along with the passage. Items 1,2,3,9and l 6 also had low
discriminability, so the present author had reviewed and revised each item. Test Set
Apresented in Appendix A is the final version of these items after the revision.(The item
numbers were left as they were when the test set was implemented in the main stUdy, and this
was announced orally to test takers by the proctors.)
Table 4-2 The discrimination indices of test items in the pilot version of Test Set A
ITEM# PBs PC1 0.05 0.46
2 0.13 0.43
3 0.21 0.33
4 0.51 0.8
5 0.49 0.7
6 0.39 0.4
7 0.35 0.76
8 0.42 0.64
9 0.01 0.13
10 一〇.02 0.19
ll 一〇.1 0.12
12 0.18 0.38
13 0.41 0.47
14 056 0.36
15 0.51 0.69
16 0.13 0.4
17 0.49 0.43
18 0.49 0.48
19 0.42 0.54
20 0.42 0.57
21 0.39 0.28
22 0.51 0.62
23 0.58 0.45
24 0.44 0.53
25 0.51 0.62
26 0.54 0.43
27 0.38 0.68
28 0.47 0.42
29 0.56 0.47
30 0.51 0.42
In order to compare the reading abilities of test takers who took this set of test
53
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
and, more importantly, to observe the alteration of latent ability structure among test
takers with different reading abilities, items 1,2 and 3 reappear in Test Set B as items
1,2,and 3, items 7,8,9as items 4,5, and 6 and items 10,11, and 12 as items 7,8,
and 9. However, as it was stated in the previous paragraph, because items 10,11,
and 12 were omitted from Test S et A, items 7,8, and 9 had to be omitted from Test
Set B as well.
As fbr the time allocated fbr the completion of the test, it was reported from the
teachers who had proctored for the pilot study that most of the test takers appeared to
have reached the last item of the te st, which proves that 50 minutes was a sufficient
time fbr the test takers in the present study.
4.2.2.2Test Set B
Test S et B i s presented in Appendix B. In total, there are 27 test items in the
test set;nine passages are provided, each with three multiple-choice test items to test
test takers’comprehension. Each item has one correct option and three distracters.
These nine passages were selected after an item selection was done in the pilot study.
The features of these nine passages are presented in Table 4-3.
Table 4-3 The features of passages employed in Test Set B
TEXT 1tem# REase Gr Level Words#1 1-3 56.8 8.7 952 4-6 55.2 10 108
4 10-12 34.1 12 157
5 13-15 35.3 12 142
6 16-18 34.8 12 1607 19-21 37.7 12 160
8 22-24 38.9 12 152
9 25-27 33.4 12 155
10 28-30 33.6 12 151
40 11.4 142.22
(「reXt 3, as well as ltem 7,8and g are missing from the table because they were omitted after the
item seleCtion.)
Text l is the same passage as Text l in Test Set A, Text 2 is the same passage as
54
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Text 3 in Test Set A, and Text 3 is the same passage as Text 4 in Test Set A. This
was done to compare the reading abilities of test takers who took this set of test, Test
Set B, with the test takers who took the Test Set A and, in particular, to see if any
alteration would emerge with regard to test takers’latent ability stmcture among
different ability groups. The rest of the passages were taken丘om Reading
Comprehension S ection of TOEFL Test Preparation Kit〃Morkbook(ETS 1998). The
present author had determined TOEFL Test preparation material to be an appropriate
source of reading passages because, since TOEFL was designed to test English
proficiency of students who are seeking to study at an undergraduate or graduate level
in the English-speaking environment, the level of English proficiency required to
succeed in completing them would be the same as that of advanced learners in Japan,
which is at an equivalent level of the subj ects to be tested by Test S et B.
In the Table 4-3,“R. Ease”indicates Flesch Reading Ease and‘‘Gr. Leverう
indicate s Flesch-Kincaid Grade Level. The number of words were counted so as to
regulate the characteristics of each passage. The present author had selected
passages that were around 150 words in total for Texts 4 to 10, considering the time
constraint of testing environments. The numbers at the bottom indicate the means
fbr each index.
As fbr the three multiple-choice test items that were to be answered after
reading each passage, the present author had written the questions and fbur options・
These questions and options were examined fbr their validity by her two colleagues.
After each passage, a‘‘global-inferentia1”question,‘‘local-literalう’question, and
‘‘撃盾モ≠戟|inferential”question(see 3.4.2 fbr detailed explanations of‘question types’)
are presented in the same manner as these questions are presented in Test Set A.
Thi s means that, fbr each passage, a‘‘global-inferential” question is the first item that
comes after the passage, a‘‘local-literal”question the second, and a‘‘local-inferential”
que stion the last. Therefore, items numbered 1,4,10,13,16,19,22,25, and 28 are
‘‘№撃盾b≠戟|inferential’うquestion which asked fbr the main idea of the passage, items
numbered 2,5,11,14,17,20,23,26, and 29 are“local-literal”questions which asked
fbr the information which is directly interpreted from a relatively small amount of
55
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
text source, and items 3,6,12,15,18,21,27,30 are‘‘local-inferentialううquestions
asked for the information which could be obtained after making an inference from
relatively small amo皿t of text source(see pP.51-52 fbr detailed explanation and
examples of how these questions were presented). The validity of which question
type each item represented was confirmed by the two colleagues who had worked on
the question types of Test Set A, and their correlation was.71. For the items where
disagreements were fbund, they were discussed and revised so that all three people
(the two colleagues and I)were satisfied with the decision.
For Test Set B, the time allocated to the test was 50 minutes in order to parallel
Test Set A. In writing and revising Test Set B, special attention was also given so
that the test takers would be able to complete the test set within the time allocated.
Prior to the test implementation fbr the main study, a pilot test was carried out
in order to validate the test items developed by the procedures described above. The
subjects were 156 students from the same皿iversity at which Test Set B was
implemented in the main study. They were of the same academic background as the
subj ects who had participated in the main study.
The main interest in carrying out the pilot test was to find and edit the test
items that exhibit problems with its item discrimination indices. As it was done in
the pilot study for Test S et A, item discriminability was calculated using classical test
theory(point-biserial correlation calculated by ITEMAN)due to the small number of
subj ects.
In Table 4-4,‘‘PBs”indicates point-biserial correlation fbr item discriminability,
and‘‘PC”indicates the percentage of test takers who correctly answered each item to
show item di伍culty. Items 7,8,9were automatically eliminated because they were
the same items as those eliminated from Test Set A(items 10,11, and 12). The
present author had originally intended to use these three items fbr level comparison
across different subject groups but decided to discard them fbr this reason and also
due to the time constraint expected in the testing environment. Furthermore, items l
and 2, which reveal low item discrimination in Table 4-4, were revised because they
were the items presented as items l and 2 in Test Set A and had also shown low item
56
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
discrimination in the pilot test fbr Test Set A. The same was true fbr item 4 and 6
which were numbered 7 and 9 in Test S et A. Items 23 and 29 also had low
discriminability, so they were reviewed and revised accordingly. Test Set B which is
presented in Appendix B the final version after these revisions.(The item numbers
were left as they were when the test set was implemented in the main study, and this
was㎜o皿ced orally to test takers by the proctors.)
Table 44 The discrimination indices of test items in the pilot version of Test Set B
ITEM# PBs PC1 0.27 0.94
2 0.18 0.99
3 0.30 0.86
4 0.29 0.43
5 0.43 0.81
6 0.20 0.44
7 0.38 0.80
8 0.56 0.63
9 0.18 0.67
10 0.45 0.71
ll 0.39 0.84
12 0.49 0.57
13 0.42 0.84
14 0.54 0.31
15 0.41 0.36
16 0.30 0.36
17 0.33 0.63
18 0.32 0.21
19 0.20 0.97
20 0.28 0.91
21 0.33 0.81
22 0.23 0.65
23 0.15 0.36
24 0.42 0.51
25 0.47 0.52
26 0.27 0.40
27 0.36 0.22
28 0.29 0.91
29 0.18 0.84
30 0.33 0.75
As fbr the time allocated fbr the completion of the test, it was reported from the
57
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
teachers who had proctored for the pilot study that most of the test takers appeared to
have reached the last item of the test, which proves that 50 minutes was a sufficient
time fbr the test takers in the present study.
4.2.3 Test Administration
Test Set A and Test Set B were both administered in 50 minutes. Senior high
school students were given Test Set A. It was implemented as a reading proficiency
test in a 50-minute class period, proctored by the teachers who taught the class in the
regular lesson.
For皿iversity students, the test was administered as a part of a placement test
fbr their required English classes which consisted of a listening comprehension
section and a reading comprehension section. They were given either Test Set A or
Test Set B, depending on the date they were taking the test. Those students who
took the test on the first day of the placement test were given the test which included
Test Set A as the reading comprehension section, and those who took the test on the
second day, Test Set B. The scores on the reading comprehension section of the test
were not counted in the placement itself because of the difference in difficulty
between the two test sets. In the first half of the testing time, students were given 50
items that tested their listening skills. In this part of the test, the time was regulated
by the listening material. At the end of this section, which was announced by the
listening material itself, students were told to begin the reading section. The
students were given 50 minutes fbr the reading section. The test was proctored by
the teachers who teach the required English classes.
Both high school students and皿iversity students were asked to provide their
answers on mark-sheets. These mark-sheets were scored electrically on the
mark-sheet sca皿er.
4.3Data Analysis
4.3.1 Predetermining Ability Groups
58
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Prior to the data analyses, three groups of different abilities were determined
based on the results of the data collection above. The three groups are:Group
A-Low, Group A-High, and Group B.
Group A-Low and Group B were to represent the groups of test takers who
were responding to the items that had a difficulty that is equivalent to their reading
ability, and Group A-High to represent the test takers who were responding to the
items that were considered to have a difficulty lower than their reading ability. In
this way, the results of Group A-Low and Group A-High could be compared to
investigate the differences exhibited by test takers with different reading abilities
tackling the test items of the same difficulty. R耐hermore, the results of Group
A-Low and Group B were to be compared to observe the differences presented by test
takers with different reading abilities responding to the test items that had the
difficulty equivalent to their ability.
Here, an explanation of what is meant by‘‘test takers with different reading
abilities responding to the test items that had the difficulty equivalent to their ability”
for Group A-Low and Group B and‘‘the te st takers who were re sponding to the items
that were considered to have the difficulty lower than their reading ability”fbr Group
A-High may be necessary.’In Item Response Theory(IRT), the theory on which the
calculation of item difficulty was based in the analyses of Section 5.3, the idea is to
find the relationship between the difficulty of a test item, the ability of a test taker,
and the probability of a test taker answering a test item correctly(Ohtomo 1996:69).
The difficulty of a test item is determined by its“item characteristic curve”, a graph
which is drawn after the calibration using logistic fUnction. On this graph, the point
where it meets where the probability of a person responding to that item is O.50(50%)
indicates the ability level of that person, the person whose probability of answering
that test item correctly is O.50, and that ability index is employed as the difficulty of
the test item. Therefbre, the index provided as‘‘theta”in Appendix C-1, D-1, and
E-1,indicates the ability level(from-3.O to 3.0)of a person whose probability of
responding to that item is O.50 and that also represents the difficulty of the test item.
This relationship between the ability of a test taker and the difficulty of a test item
59
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
brings the present reader to characterize each subj ect group as having an ability that is
‘‘?曹浮奄魔≠撃?獅煤@to’うor‘‘higher than”the difficulty level oftest items.
Originally, the present author had chosen to give Test Set A to high school
students and half of the university students, so that high school students would
represent Group A-Low and university students, Group A-High. Test Set B was
given to the rest of the university students to represent Group B. At this point, the
author had assumed that university students would possess higher ability in English
reading comprehension since they had had an extra year of English education along
with their preparatory learning experience fbr皿iversity entrance examinations.
However, this method of predetermining the ability groups did not fUnction for the
present study because, virtually, no difference could be fbund between the scores of
high school students and university students on Test Set A;the mean scores were 17.6
fbr the high school students and 17.9 fbr皿iversity students. One possibility which
could have caused this to hapPen is the fact that皿iversity students were given the
reading comprehension test after they had worked on the listening comprehension
section in the placement test. The cognitive load which was imposed on the test
takers while working on the listening comprehension could have exhausted them
cognitively and impeded their performances on the reading section, rendering the
result above. However, when the listening test material was evaluated, it was
determined that it did not appear to exhibit the diffriculty that would influence test
takers’performance in the latter section of the test. Therefbre, it was presumed that
there indeed was little difference in reading ability between high school students and
university students who were given Test Set A. For this reason, at this point, the
present author decided to look at the results of test takers who worked on Test Set A
as a whole, regardless of whether they were high school students or university
students, and predetermine the ability groups based on their test scores on Test Set A.
Adetailed description of how these groups were decided is presented in Chapter 5.
No change was made in predetermining Group B since the university students who
worked on Test Set B had averaged 16.3, which showed that the test takers who were
given Test Set B were advanced leamers who are at the same ability level as the
60
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
reading ability expected to correctly respond to the test items in Test Set B.
4.3.2Statistical Proc…edures
Three stati stical procedures were taken in order to analyze the data collected.
4.3.2.1Descriptive Statistics
For each test set, mean and standard deviation were calculated. KR20 was
used to estimate the intemal consistency of each test set to ensure its reliability in
measuring students’reading ability. For the purpose of test validation, the facility
value(percentage correct)and discrimination index(point-biserial correlation)
calculated using Classical Test Theory by ITEMAN (Assessment Systems
Corporation)was also provided.
4.3.2.2Factor/lnalytic Studies
In an attempt to come up with a test item specification that effectively
operationalizes different reading performances to be tested, the present study
proposes that the‘‘question typeう’of a test item could be a prime component to
constitute such a framework. In order to identifシthe components, or factors, that
constitute L2 reading performances,飴ctor analyses are done fbr the collected data in
each Test Set. The nature ofthe factors generated is consulted qualitatively.
Full-information factor analysis was applied in factor analytic studies of both test
sets via TESTFACT 2(Scientific Software Intemational). Although some problems
are pointed out in using traditional factor analysis methods with binary data(i.e. items
that are scored dichotomously by judging right or wrong), fUll-information factor
analysis has been evaluated to accommodate such circumstances(Negishi 1996;Bock
1984).
4.3.2.2」rtem A n alyses
To discover which facets of a reading test item would allow the writers of test
items to predetermine the diffriculty of a test item, the present study investigates the
61
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
possibility of a link between the item difficulty of a test item and its question type.
For this purpose, test items are analyzed by consulting their item difficulty indices
calculated via Rasch Analysis using RASCAL(Assessment Systems Corporation)in
relation with question type. Other information in the final parameter estimates as
well as a raw score conversion table, an item by person distribution map, a test
characteristic curve, and a test information curve are provided in this section of
analysis.
62
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)