results - tokyo university of foreign...
TRANSCRIPT
CHAPTER FOUR
RESULTS
This chapter reports the results o f quantitative and qualitative analyses o f
examinees’perf()rmances and their attitudes toward testing speaking in the computer
mode and face-to-face mode. In order to answer the research questions posed in this
study, data were gathered from multiple sources:(1)examinees’scores on the two tests,
(2)examinees’speech samples, and(3)po st-exam questionnaire results.
The analyses fbcused primarily on the differences in test scores, speech samples
and examinee attitudes between the computer mode and the face-to-face mode.
Quantitatively, t-test and克ctor analysis results are examined to determine the
relationships between test scores across modes. The results of ANOVAs are discussed
to explore the relationships between delivery mode and speech sample. Moreover, t-test
and chi-square test results from analysis of questionnaire items are reported.
Qualitatively, examinees’comments on open-ended questions are discussed.
This chapter is corpposed of fbur main sections. The first section presents the
results of analyzhlg the comparability ofraw scores across the two modes. The second
section reports the comparability of underlying constructs measured across modes.
Examinees’speech samples are examined in the third section, in which results regarding
the effect of computer delivery mode on speech samples and the relationship between
delivery mode and examinees’proficiency are reported. The final section deals with
questio皿aire results. Specifically, examinees’responses to questionnaire items
regarding the two modes are analyzed. The comments from the examinees on
64
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
comparisons o f the two modes are categoriZed and illustrated in detail.
4.1 Comparing the magnitude of raw scores
This section first reports the rater reliability fbr ratings assigned to the monologic
tasks delivered in the computer mode and the face-to-face mode. It then examines the
existence of order effect, which is a concem when using a counterbalanced design.
Finally, it describes the results ofcomparing the mean scores ofratings across modes.
4.1.1 1nter-rater reliability
In speaking test scores that were obtained from two raters, a source of error
typically lies in the inconsistency of the ratings. Thus, inter-rater reliabilities using two
types of indexes were calculated to measure the consistency between the two ratings
awarded to each rating element f()r each task. First, Pearson product moment correlation
coefficients were computed between the ratings. In addition, given that a high
correlation coefficient could be obtahled despite relatively different rat01gs being
awarded by the two raters, the inter-rater agreement percentage was also calculated.
Exact agreement indicates that the two raters assigned the same score;a(lj acent
agreement means that the rating differenc e between the two raters was one.
As can be seen in Table 4.1, the inter-rater reliability estimates(Pearson
correlation coefficients)fbr the computer mode ranged from.52 to.75. Except fbr
vocabulary(r=.52), fluency(r=.53)and pronunciation(r=.54)fbr the opmion task,
these estimates are sufficiently high. Moreover, the agreement between the ratings
awarded by the two raters showed satisfactory results in total agreement, ranging from
96.2%to 100%fbr all cases, with moderately high exact agreement percentages
65
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
(49.5%-72.2%)and a(lj acent agreement percentages(26.6%-45.6%).
Table 4.11nter-rater reliabilities fbr the ratings in the computer mode
Rating element Task Exact A(lj acent Totalb
agreement%agreement%agreement%
Grammar
Vocabulary
Fluency
Pronunciation
Narrative
Opinion
Narrative
Opinion
Narrative
Opinion
Na1Tative
Opinion
5482939476657555
72.2
58.2
57.0
49.4
70.9
59.5
62.0
55.7
26.6
38.0
40.5
36.7
29.1
30.4
38.0
40.5
98.8
96.2
97.5
86.1
100
89.9
100
96.2
N・te. N =79. apears・n c・rrelati・n c・efficient between the tw・ratings. bT・tal
agreement percent is the sum of the exact and a(lj acent agreement percent.
Table 4.2 displays the results of inter-rater reliabilities fbr the face-to-face mode.
Pearson correlation coefficients ranged from.60 to.74. Except fbr pronunciation figures
fbr the opinion task(r=.60)being slightly low, other figures are suf五ciently high.
Similar to fmdings for the computer mode, rater agreement tumed out to be satisfactory
fbr all cases in terms of exact agreement(54.4%-68.4%), a(ljacent agreement
(29.1%-45.6%),and total agreement(96.2%-100%).
The results of rater agreement indicate that the two ratings assigned to both
modes were almost all within one score difference. Taking the results ofboth types of
inter-rater reliability indexes into account, the consistency ofthe ratings in both modes
Was considered to be acceptable fbr this study. Thus, the two ratings awarded to each
rating element were averaged for each task. They are named element scores in this study
and are used in the fo llowing quantitative analyses.
66
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Table 4.21nter-rater reliabilities for the ratings in the face-to--face mode
Rating element Task a Exact A(lj acent Totalb
agreement%agreement%agreement%
Grammar
Vocabulary
Fluency
Pronunciation
Narrative
Opinion
Narrative
Opinion
Narrative
Opinion
Narrative
Opinion
2304454077777676
60.8
65.8
54.4
60.8
68.4
67.1
65.8
55.7
39.2
31.6
45.6
39.2
31.6
29.1
34.2
41.8
100
97.4
100
100
100
96.2
100
97.5
N・te. N-79. apears・n c・rrelati・n c・efficient between the tw・ratings. bT・tal
agreement percent is the sum of the exact and a(lj acent agreement percent.
4.1.2 0rder effect
Prior to comparing raw scores awarded to the two modes, order effect was
examined to address the concem about the counterbalanced research design that was
adopted in this study. Table 4.3 presents the means and standard deviations fbr the
average ofelement scores across the two tasks and total scores by mode and order. The
means oftest scores in Table 4.3 showed that the two groups assigned to different test
orders seemed to perf()rm differently acro ss the two modes. The different trends for the
two groups were most evident in the total scores. That is, fbr the group who took the
computer mode first, the total score was much higher in the face-to-face mode(M=
9.05,5D=2.44)than that in the computer mode(M=7.68, SD=2.58). However, total
scores fbr the other group indicate only a small discrepancy between the computer mode
(ルf=7.55,SD=2.30)and the face-to-face mode(M=7.59,5D=2.55). It seems that a
practice effect occurred for the group who took the computer mode first.
67
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Table 4.3 Descriptive statistics ofscores by mode and order
Rating element/Order Computer Face-to-face
M SD M SDGrammar C→F
F→C
Vocabulary C→F
F→C
Fluency C→F
F→C
Pronunciation
C→F
F→C
Total score
C→F
F→C
1.90
1.83
2.06
1.98
1.79
1.89
1.94
1.84
7.68
7.55
0.72
0.60
0.70
0.63
0.62
0.70
0.67
0.54
2.58
2.30
2.19
1.86
2.48
2.04
1.99
1.70
2.39
1.99
9.05
7.59
0.67
0.70
0.73
0.71
0.67
0.60
0.55
0.69
2.44
2.55
Note. C→F:Computer test first/face-to・・face test second;F→C:Face-to-face test
first/computer test second.
In order to test whether the practice effect was statistically significant, repeated
measures ANOVAs were carried out on element scores across tasks and total score.
Table 4.4 shows that the mode-by-order interactions were statistically significant on all
types ofscores. There were also significant main effects of mode on the element scores
except fbr fluency and total score. However, in this case, the interpretation of
interactions between order and mode should take precedence over the main effect.
Table 4.4 ANOVA results ofscores by mode and order
Rating element Mode Order Mode*order
F F FGrammar
Vocabulary
Fluency
Pronunciation
Total score
1 14.62 .00
1 21.22 .00
1 0.01 .92
1 41.95 .00
1 25.04 .00
1.83 .18
3.13 .08
0.47 .49
3.67 .06
2.22 .14
9.31 .00
12.04 .00
16.71 .00
11.09 .00
22.31 .00
68
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
4.1.3 Comparing test scores
Given the significant interaction between delivery mode and test order, it was
decided not to comi)ine the data from the two administrations of the tests in different
orders. Instead, in order to compare the magnitude of test scores across modes,
independent t-tests were conducted separately on the scores of the first and the second
test administered to the examinees.
Table 4.5 presents the results ofthe t-test on the first test. As shown in Table 4.5,
fbr the narrative task, the means of all the element scores in the computer mode were
slightly higher than those in the face-to-face mode. On the other hand, the opinion task
showed an opposite trend;that is, the means ofall the element scores were higher in the
face-to-face mode. For the total score, the mean was higher in the computer mode(M=
7.68)than in the face-to-face mode(M=7.59). Further, all the differences in the means
of element scores and total scores between the two modes were small, being no greater
than O.21.
The t-test results confirmed that none of the differences in the means of element
scores and total scores between the two modes were significant. This indicates that fro m
the data of the first test administered, delivery mode did not make a difference on the
magnitude ofexaminees’ test scores in terms ofeither element scores or total score.
69
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Table 4.5 lndependent t-test results ofscores ofthe first test administered acro ss modes
Rating element TaskComputer(n=41)
Face-to-face
(n=38)t
M SD M SDGrammar
Vocabulary
Fluency
Pronunciation
Total score
Narrative
Opinion
Narrative
Opinion
Narrative
Opinion
Narrative
Opinion
2.05
1.74
2.21
1.91
2.01
156
2.05
1.83
7.68
0.81
0.71
0.77
0.77
0.79
0.54
0.68
0.71
2.58
1.86
1.87
2.05
2.03
1.80
1.59
2.05
1.92
7.59
0.76
0.72
0.72
0.75
0.63
0.62
0.75
0.69
2.55
1.09
-0.77
0.92
-0.65
1.29
-O.24
-0.02
・・n.58
0.17
846201867243528958
On the other hand, the results of the t-test based on the data of the second test
administered revealed a different pattem from that ofthe frrst test. Table 4.6 shows that
there were significant differences acro ss modes in the scores ofgrammar on the opinion
task and hl those of vocabulary and pronunciation on both the narrative task and the
opinion task. The total score was also significantly different across modes. These results
were considerably different丘om tho se of the first test, where no significant difference
was fbund in any type of scores across modes. The disparity of the results seems to
provide evidence fbr the concern about the interaction between the delivery mode and
test order. Thus, to evaluate the effects ofthe delivery mode on the magnitude ofthe test
score, it was decided to only use the results from the analysis ofthe frrst test, which are
considered to be more valid, being without the contamination ofthe order effect.
70
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Table 4.6 lndependent t-test results of scores ofthe second test administered acro ss
modes
Rating element TaskComputer(n=38)
Face_to_face
(n== 41)t
M 5D M SDGrammar
Vocabulary
Fluency
Pronunciation
Total score
Narrative
Opinion
Narrative
Opinion
Narrative
Opinion
Narrative
Opinion
1.91
1.75
2.12
1.84
2.08
1.71
1.97
1.71
7.55
0.62
0.70
0.70
0.71
0.81
0.74
0.52
0.64
2.30
2.15
2.23
2.51
2.45
2.10
1.89
2.40
2.38
9.05
0.74
0.73
0.79
0.76
0.78
0.70
0.57
0.62
2.44
155
2.97
2.34
3.66
0.10
1.11
3.48
4.69
2.82
302027001100092000
4.2 Comparing psychometric constructs
One ofthe purposes ofthe present study was to investigate the effect ofcomputer
delivery mode on the underlying constructs in comparison to the face-to-face mode. To
this end, a series of exploratory factor analyses were perf()rmed to explore statistically
whether there are co mponents that are shared in common by the mono lo gic tasks
delivered in the two modes. In this section, first, the results ofchecking the assumptions
of the exploratory factor analysis are presented, and then the results of the exploratory
factor analyses are reported.
Given that analysis in the previous section revealed a practice effect, to deal with
this proble叫the original data was transfbrmed by subtracting it from the mean scores
on each variable fbr each group assigned to the different test廿lg orders. For reportmg
purposes, all the eight variables fbr each mode used in the fbllowing analyses were
71
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
assigned labels as shown in Table 4.7. For example, CGN refers to the grammar score
f()rthe narrative task of the computer mode, and FVO represents the vocabulary score
for the opinion task ofthe face-to-face mode.
Table 4.7 Measured variables used in factor analysis
Measured variable Task Label
Computer〃mode
Grammar scores in
Vocabulary scores in
Fluency scores in
Pronunciation scores in
Face-to-face〃mode
Grammar scores in
Vocabulary scores in
Fluency scores in
Pronunciation scores in
Narrative
QpinionNarrative
QpinionNarTative
QpinionNarrative
Qpinion
Narrative
QpinionNarrative
QpinionNarrative
OpinionNarTative
Qpinion
CGNCGOCVNCVOCFNCFOCPNCPO
㎝GOWW日m州mFFFFFFF
4.2.1 Preliminary data analyses
Table 4.8 presents descriptive statistics ofall the variables. They were computed
to check the assumptions ofthe exploratory factor analysis.
Univariate normality of the 16 0bserved variables was assessed through
examination ofthe skewness and kurtosis fbr each variable. As seen in Table 4.8, none
of the observed variables deviated from normality, with values fbr the skewness and
kurtosis within an acceptable range of-2 to 2. The Pearson product-moment correlations
72
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
among all the variables were calculated(see Appendix I). The correlation figures ranged
from 56 to.84, indicating that all the variables correlated fairly well with each other
and none of the correlation coefficients were particularly large. This suggests that
multicollinearity is not a problem fbr the present data. Univariate outliers were also
examined;no subject was fbund to be a univariate outlier(z<-30r・z>3)fbr all the
variables.
Table 4.8 Descriptive statistics for all variables
Variable Min Max M SD Skewness Kurtosis
CGNCGOCVNCVOCFNCFOCPNCPOFGNFGOFVNFVOFFNFFOFPNFPO
25
S6
ノ〉と)旋≧.1V=79.
4.2.2 EXPRoratory factor analyses
First, an exploratory factor analysis, using the principal axis method(Principal
factor analysis)and a varimax rotation pattem, was carried out to explore the number of
factors underlying the eight observed variables fbr the computer mode. The solution
73
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
revealed only one factor. The factor was produced based on eigenvalues greater than 1.O
as shown in Table 4.9. The scree plot(see Figure 4.1)confirmed the mea血1gfUhless of
the factor. The factor accounted fbr about 75%of the total variance. As seen in Table
4.10,the magnitudes ofall the factor loadings were substantial, ranging丘om.82 to.90.
This suggests that all the eight variables, which represent fbur rating elements in each
task, are reasonably good indicators ofthis factor.
Table 4.9 Exploratory factor analysis o f data from the computer mode
Factor Eigenvalue Percentage of variance Cumulative percentage12345678
6.04
0.74
0.41
0.30
0.17
0.17
0.10
0.08
75.45
9.23
5.14
3.73
2.12
2.09
1.29
0.95
75.45
84.68
89.82
935595.67
97.76
99.05
100.00
4u肩きω
1’_s45678 Factol lmmber
Figure 4. l Scree plot for data from the computer mode
Table 4.10 Factor loading for the data from the computer mode
74
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Variable Factor loading
CGNCVNCFNCPNCGOCVOCFOCPO
7748992088888889
Aprincipal factor analysis with varimax rotation was also conducted for the
face-to-face mode, and the results obtained were similar to those fbr the computer mode.
As indicated in Table 4.11, the results showed that only one factor with eigenvalues
greater than l.O was extracted. The scree test also suggested one factor(see Figure 4.2).
Table 4.11 shows that about 77%of the variance was explained by this factor. Table
4.12 presents the factor loadings ofthe variables, which demonstrate that the factor was
well defined by the variables since factor loadings were high within a range of.83
to.91.
Table 4.11Exploratory factor analysis o f data from the face-to-face mode
Factor Eigenvalue Percentage ofvariance Cumulative percentage
6.12
0.62
0.43
0.25
0.19
0.16
0.13
0.09
76.54
7.70
5.36
3.18
2.42
1.96
1.66
1.19
76.54
84.24
89.60
92.77
95.19
97.16
98.81
100.00
75
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
ω肩≧ω置
1-d,45678 FactOi’ ntvanb er
Figure 4.2 Scree plot for data from the face-to-face mode
Table 4.12 Factor loading for the data from the face-to-face mode
Variable Factor loading
㎝W“日G。vomp。FFFFFF
.87
.90
.86
.85
.90
.91
.83
.86
The analyses to this point revealed that both modes seemed to measure only one
factor. Further, given that factor loading re flects the portion of the total variance that
each variable contributes to the factor, a comparison o f the factor loadings o f 8 variable s
across modes shows that each pair of observed variables generally has equivalent
loading on the factor.
76
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
In order to support the above analyses fbr the two modes, another principal factor
analysis with varimax rotation was perf()rmed with the 16 observed variables o f both the
computer mode and the face-to-face mode. Again, the solution produced only one
component on the basis of eigenvalues greater than l.0(see Table 4. 13), which was
confirmed by the scree plot in Figure 4.3. This single factor accounted f()r most of the
total variance(71%). As can be seen in Table 4.14, all the variables loaded highly on
the factor, ranging from.78 to.88. Further, they seemed to contribute similarly to the
major component with similarly high values of factor loadings when coml)ared across
modes.
Taken together, the results described in this section indicate that monologic tasks
delivered in the computer mode and the face-to-face mode seem to measure the same
psychometric construct.
Table 4.13 Exploratory factor analysis o f combined data from both modes
Factor Eigenvalue Percentage ofvariance Cumulative percentage
11 71.12
5.41
4.97
3.79
2.78
2.55
2.03
1.40
1.17
1.07
0.92
0.83
0.62
0.52
0.42
0.40
71.12
76.53
81.50
85.29
88.07
90.62
92.65
94.05
95.22
96.29
97.21
98.04
98.65
99.17
99.60
100.00
77
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
・肩旨u躍
1 _ ⊃ 4 5 6 7 8 9 10 11 1⊃ 13 14 15 16
Factor ntu lb er
Figure 4.3 Scree plot f()r the combined data fro m both modes
Table 4.14 Factor loading for the combined data from both modes
Variable Factor loading
ぽ㎝㎝㎜㎝σ℃σ℃蕊蕊㎜㎜㌶
78
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
4.3 Comparing speech samples
In this section, reliabilities ofthe codings of speech samples between two coders
are first reported. The results of grouping examinees according to their scores on
computer-delivered tasks are then introduced. Finally, the results of comparing speech
samples between the two modes are presented.
4.3.1 1nter-coder reliability
The inter-coder reliability was examined through agreement between the codings
from the two coders. Table 4.15 summarizes the inter-coder reliabilities for all the
coding units. As can been seen in Table 4.15, the achieved levels were high in almost all
cases, ranging fヒom 87%to 99%.
Table 4.15 Summary o f inter-coder reliability
Category Coding units Inter-coder agreement(%)
Fluency
Accuracy
Complexity
Speech time
Length ofpause time
Unfilled pauses
Filled pauses
Words of repetition
Words of self-correction
Words of false starts
Total words
AS-unit
㎞dependent clause
Subordinate clause
ErTors
TypeTokenGrammatical words
Lexical words
High-frequency lexical words
Low-frequency lexical words
00000000
ハUOOO
000000
8799977798999989
つ」88∠0
79
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
4.3.2 Grouping of examinees based on test scores
In order to examine a possible interaction effect between language proficiency
and delivery mode, participants were categorized into three groups based on their total
score on computer-delivered tasksl7. As shown in Table 4.16, those who scored over
66.6%were assigned to the high proficiency group(n==26;、ルf=9.69), and those who
scored between 66.6%and 33.3%were assigned to the middle proficiency group(n=
26;、M=15.17). The rest were assigned to the low proficiency group(n=27;M=
20.63).The results ofaone-way ANOVA revealed a significant effect f()r the placement
in a proficiency group,、F(2,76)=227.82, p<.00. Post hoc analyses(Tukey)indicated
that each group was significantly different in their total scores at p<.00.
Table 4.16Descriptive statistics fbr proficiency groups
Pro ficiency group M 5D Min MaxLowMidHighTotal
9.69
15.17
20.63
15.23
1501.27
2.54
4.87
8.00
13.00
17.50
8.00
12.00
17.50
26.50
26.50
4.3.3 Effect of delivery mode and interaction With proficiency
In order to examine the effect ofdelivery mode on examinees’speech sample and
interaction ofdelivery mode with examinees’proficiency, repeated measures ANOVAs
were conducted. The results were presented in the respect of fluency, accuracy, and
cornplexity.
17fiven that the GTEC fbr STUDENTS is a computer-delivered test and the correlation figure between
the total scores for the computer and face-to-face version was quite high(r=.95), it was decided to use
the total scores from the computer version.
80
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
4.3.3.1 Fluency
Table 4.17 displays the means and standard deviations for fluency measures by
delivery mode and proficiency level. As can be seen in Table 4.17, fluency is higher in
computer-delivered monologic tasks fbr measures of speech rate and dysfluent words,
whereas it is higher in the face-to-face mode regarding filled pauses and unfilled pauses.
Table 4.17 Descriptive statistics o f fluency measures by mode and proficiency
Measure
(per 60 seconds)
Computer Face_to_face
Pr(~f M SD M SDNo. of words
No. ofunfilled pauses
No. of filled pauses
No. of dysfluent words
No. ofrepetition words
No. of self-correction words
No. of false start words
㎞隠M㎞㎜麗㎞㎜盟㎞㎜鵠㎞嚇鵠㎞堀鵠㎞醐盟 51
U4
V9
U5
16
P0
P3
P7
S」
47
U0
V8
U2
12
P3
Q0
Q0
Note.」Prof=Proficiency groups.
81
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Table 4.18 summarizes the statistics o f measures of fluency by means of the
repeated measures ANOVA. As shown in Table 4.18, a main effect of delivery mode
was found fbr three measures, and the results were somewhat mixed. There were
significant differences in the number o f dysfluent words per 60 seconds(F(1,2)=15.16,
p<.00)and the number ofrepetition words per 60 seconds(F(1,2)=16.05, p<.00).
That is, examinees used more repetition words in face-to-face monologic tasks than in
those delivered by coMI)uter. This means that examinees were more fluent with the
computer mode than with the face-to-face mode. A significant difference was also
observed for the measure of filled pauses per 60 seconds(F(1,2)=7.55,p<.01)but in
the oppo site direction. That is, examinees used more filled pauses in computer-delivered
tasks, indicating that they were more fluent in the face-to-face mode. There was no
significant hlteraction effect between delivery mode and proficiency level, suggesting
that fluency of examinees’speech produced with the two modes was not affected
differently by their proficiency.
Table 4.18 ANOVA results for fluency measures
Measure
(per 60 seconds)
Mode Level Mode*level
F F FNo. ofwords
No. ofunfilled pauses
No. of filled pauses
No. ofdysfluent words
No. ofrepetition words
No. of self-correction words
No. of false start words
2.14
0.11
7.55
15.16
16.05
2.83
2.06
222222(∠
38.46
36.33
0.92
0.77
1.35
0.27
0.05
2222222
0.16
0.48
1.69
0.32
0.14
0.41
1.26
82
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
4.3.3.2 Accuracy
Accuracy was examined in terms oftwo measures:ratio of error-free clauses and
ratio of error-free AS-units. Table 4.19 presents the means and standard deviations fbr
accuracy measures by delivery mode and proficiency level. According to Table 4.19,
the face-to-face tasks yielded a slightly higher accuracy on both measures. However, the
results of the repeated measures ANOVAs failed to show these differences to be
statistically significant(see Table 4.20). This means that linguistic accuracy was not
affected by the delivery mode. In addition, no interaction effect between delivery mode
and proficiency level was statistically significant. This demonstrates that linguistic
perf()rmance across modes was not different in the aspect of accuracy among the three
proficiency groups.
Table 4.19Descriptive statistics ofaccuracy measures by mode and proficiency
Measure Computer Face_to .. face
Prof M SD M SDPercentage of error-free clauses
Percentage of error-free AS-units
LowMidHighTotal
LowMidHigh
Total
0.28
0.47
0.65
0.47
0.18
0.35
0.54
0.36
0.21
0.19
0.19
0.24
0.19
0.21
0.21
0.25
6679667922272227
0.28
0.54
0.63
0.49
0.19
0.44
0.56
0.40
0.24
0.19
0.18
0.25
0.23
0.19
0.18
0.25
6679667922272227
N()te.」Pr(’f=Proficiency groups.
Table 4.20 ANOVA results for accuracy measures
Measure Mode Level Mode*level
F F FPercentage oferror-free
clauses
Percentage of error- free
AS-units
1 0.51 .48
1 2.69 .10
230.05.00
2 30.41 .00
〔∠(∠
1.50
1.13
?」つJ〔∠2」
83
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
4.3.3.3 Comple】dty
Complexity was measured in terms of both syntactic complexity and lexical
coMI)lexity. Table 4.21 shows descriptive statistics fbr complexity measures by delivery
mode and pro ficiency level. For the measures ofsyntactic complexity, the means for the
coMI)uter mode were slightly higher than those fbr the face-to-face mode. Lexical
complexity also increased in the computer mode but only with respect to Guiraud’s
Index, while two other measures oflexical density went in the opposite direction.
Table 4.21 Descriptive statistics ofcomplexity measures by mode and proficiency
Measure Computer Face_to-face
Prof M SD M SDPercentage of clauses
Percentage of subordinate clauses
No. ofwords
Guiraud’s Index
Lexical density
Weighted lexical density
㎞㎜鴇㎞隠霊㎜鵠㎞㎜麗㎞隠M㎞㎜麗
ノVと)te.」Pr(~∫=Proficiency group s.
84
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
As shown in Table 4.22, the repeated measures ANOVAs revealed these
differences not to be significant in terms ofthe main effect ofthe delivery mode. That is,
there was no significant difference in syntactic complexity and lexical complexity o f the
language produced in the two modes. Again, no significant interaction effect between
proficiency level and delivery mode could be established. This implies that examinees
at different proficiency levels did not perform differently in terms of linguistic
complexity across modes.
Table 4.22 ANOVA results for complexity measures
Measure Mode Level Mode*level
F F FPercentage of clauses
Percentage of subordinate
clauses
No. ofwords
Guiraud’s Index
Lexical density
Weighted lexical density
0.62
0.01
1.16
0.07
1.67
1.94
(∠(∠
35.17
27.15
40.33
50.59
11.77
9.42
ハUO
OハUOO
OハU
OOO∩U
(∠(∠
0.53
1.96
0.14
0.25
1.73
1.83
く∨-
4.4 Comparing examinee attitudes
In order to explore the face validity o f the computer-delivered speaking test from
the examinees’perspective, two questionnaires, described in Section 3.2.3, were
administered immediately after the two tests were completed. The questionnaires aimed
to collect infbrmation in two areas:general attitudes toward speaking tests delivered in
the computer and the face-to-face modes(Questionnaire l)and a direct comparison of
the two modes(Questionnaire 2)(see ApPendix E and ApPendix F fbr the
questionnaire s).
85
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
4.4.1 Examinee attitudes toward the two modes
Table 4.23 presents the means and standard deviations for the five statements in
Questionnaire l on examinee attitudes and perceptions regarding the computer mode
and the face-to-face mode. As shown in Table 4.23, mean scores for examinees’
responses were all above 3, except for those on favorableness(Q4)fbr the computer
mode(M=2.95), indicating that examinees generally showed agreement with the
statements fbr both modes. Specifically, examinees reported that they felt nervous on
both tests. They considered both tests to be difficult but fair. They held a slightly neutral
position toward the computer mode but showed favorable attitudes toward the
face-to-face mode. Finally, they perceived both tests to be accurate measures of their
spoken English.
Table 4.23 comparative results on Questionnaire 1
Statement Computer Face_to_face t
M SD M SD
Ifelt nervous when I was taking the test.
1 feel this test was difficult.
Ifeel the test was fair.
Iliked the format ofthe test.
The test reflects accurately how well I
speak English.
3.13
3.68
3.57
2.95
3.14
1.23
1.07
0.89
1.05
0.96
3.56
3.82
3.76
3.16
3.40
1.17
0.99
0.83
0.98
0.92
2.78**
1.49
1.97
2.20*
2.32*
64./0106
No te.*p<.05;**liP<.01.
In order to evaluate differences in examinees’responses to the statements
regarding the two modes, a dependent t-test was carried out. The results in Table 4.23
revealed that there were significant differences in examinees’responses to three
statements. That is, examinees reported a lower level of nervousness in the computer
86
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
mo de(M=3.13,5Z)=1.23)than in the face-to-face mo de(M=3.56,5D=1.17). The
computer mode was viewed as less favorable(M=2.95,5D=1.05)than the
face-to-face mode(ル1=3.16,5D=0.98)and less accurate in reflecting the廿English
speaking level(M=3.14,5D=0.96)than the face-to-face mode(M=3.40, SD=0.92).
However, the two modes were not fbund to be significantly different in test difficulty
and test fairness.
4.4.2 Direct comparisons of the two modes
Questio皿aire 2 aimed to gather infbrmation enabling a direct co叫)arison of
examinees’attitudes and perceptions conceming testing speaking in the computer mode
and the face-to-face mode. Table 4.24 presents the portion of the examinees that chose
each ofthree options hl the six questions.
Chi-square tests were performed to statistically test the difference in percentages
ofexaminees that chose each option. The results in Table 4.24 showed that the observed
frequencies differed from the expected frequencies at a statistically significant level fbr
all the questions except that on test difficulty. Specifically, more frequently than
expected, examinees preferred the face-to-face mode and fbund it more favorable and
more valid though also more nerve-racking. As fbr Question 2, examinees did not find
the tests in the two modes to be significantly different in difficulty. Regarding Question
3,the results revealed that significantly more examinees than expected considered the
fairness of the two tests to be the same. overall, the results of Questio皿aire 2
corroborate those ofQuestionnaire 1.
87
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Table 4.24 comparative results on Questionnaire 2
Question
Face-to- BothComputer face the same (%) (%) (%)
ρ
1Which test did you feel more nervous
taking?
2 Which test did you find more difficult?
3 Which test did you feel was fairer?
4Which test do you like better?
5Which test do you think re且ected your
English level more accurately?
6 Which type oftest do you prefer to take
in the fUture?
19.1
26.3
34.1
30.9
11.6
30.2
51.1
31.6
10.6
49.0
50.5
61.6
29.8
42.1
55.3
19.1
37.9
8.2
14.89* 94
3.68
25.51*
13.68*
22.51*
554.50/80/0ノ
37.28* 86
Note.*P<.05.
In the following, responses to each open-ended question in Questio皿aire 2 are
categorized by means ofcontent analysis, and the results f()r each question are reported
with examples ofcomments from examinees.
ρ1.JW2ic乃test・di吻oufeel〃iore nervous taking2
As can be seen in Table 4.24, a majority of the examinees(51.1%)felt more
nervous in taking the face-to-face test, while a minority(19.1%)perceived the computer
mode to be more anxiety-provoking. Table 4.25 presents the percentage of detailed
comments given飴r each option.75%of those who gave comments on choosing the
飴ce-to-face mode attributed the仕higher level of anxiety to the presence of the
mterviewer. In contrast, conrments from tho se who cho se the co卿uter mode focused
primarily, and surprisingly, on the facilitative role of reactions from the interviewer as
opposed to the one-sided nature of the computer mode(57%). The fbllowing are some
88
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
examples ofthe comments from the examinees:
a.During the face-to-face test, I cared about the expressio n o f the interviewer and
worried about whether I was speaking well or not. But when taking the
computer-delivered speaking test, I don’t need to face an interviewer directly,
which made me feel quite relaxed.(Face-to-face)
a.Ifelt assured during the face-to-face test when the interviewer gave reactions
with eye contact and backchannels. However, when there was no reaction丘o m
computer,1 felt somewhat tense.(Computer)
a.Rather than to the computer,1 prefer speaking in front o f a human being, since
I felt so mehow she could understand what 1 was talking about.(Computer?
Table 4.25 summary o f comments on nervousness(Q 1)
ReasonsFrequency(N == 68)
Percentage (%)
Computer〃mode
a.no reactions from computer
b.being the first test having taken
c.no oPPortunity to clarify unclear instructions
d.distracted by the timer on the screen
e.concemed with quality ofrecording with computer
Both theぷame
a.having the same task contents
b.no confidence in English speaking ability
Face-to-face〃mode
a.the presence of the interviewer
b.being the first test having taken
c.unable to concentrate on practice
d.unable to practice loudly
28
3
37
1
〔∠-
8弓111
2
41
4
54
57
Q5V74
713/0つ⊃
く」0ノ『11且
89
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
ρ2.〃hich test did you∫〕nd〃zore 4頒cμ1τ2
According to Table 4.24,42.1%ofthe examinees thought that the two tests were
not very different in difHculty. Table 4.26 indicates that among examinees who
commented on this question, those who chose“Both theぷαme”gave as their main
reason the fact that the content and procedure of the two types of tests were the same
(86%).Here are some example s ofthe comments:
a.Since the format ofthe two t)?es oftests and instnlctions廿om the interviewer
and the computer were all the same,1 did not feel any difference in difficulty o f
the tests.(Both the 50〃1(…ソ
a.There is no difference jn task contents ,
does not differ.(Both theぷαη2Cノ
so I feel the difficulty ofthe test itself
Table 4.26 summary o f comments on test difficulty(Q2)
ReasonsFrequency
(N=60)
Percentage
(%)
Computer mode 20
a.feeling nervous without any reactions from computer
b.being the first test having taken
c.unable to ask questions
d.unfamiliar with recording their voices on a microphone
e.unconfident to communicate with voice only
f. feeling unmotivated to perform better
Both the same 21
a.having the same task contents
b.no confidence in English speaking ability
Face-to-face mode 19
a.more anxiety-provoking
b.being the first test having taken
c.less control ofthe pace of test taking
002」1
1
33
35
32
0く」ハU
/04.QO11
06/010
ρ3.〃hich test didyoufeel wasfairer2
Table 4.24 shows that most ofthe examinees(55.3%)chose“Both the same”fbr
90
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
fairness ofthe two types of speaking test. Also, it should be noted that about three times
as many(34.1%vs.10.1%)fe lt that the computer mode was fairer. Table 4.27
summarizes the percentage ofdetailed comments given for each option. As indicated in
Table 4.27, the fact that the test contents and procedures were the same across modes
was the main reason examinees chose“Both the same”(87%). In addition, examinees
mentioned the absence of influence by the interviewer in the computer test as the
primary reason for its fairness(72%). The detailed comments are as fo llows:
a.No matter which type of speaking test it is, since the preparation and response
time, the content of the tasks, and test conditions were the same, I thmk the
飽irness were the same.(Both theぷ醐く劾
a.Ithink some interviewers may make the examinee feel nervous or relaxed. So,
the atmosphere of the interviewer may well influence how the examinee
performs in the face-to-face test.(Computer?
a.The interviewer looked quite kind, so I was pretty relaxed during the
face-to-face test. But when I took EIKEN in the past, the interviewer at that
time looked quite harsh and unfHendly. So I was quite scared and couldn’t
speak well at that time.1℃omputer?
Table 4.27 summary o f comments on test fairne ss(Q3)
ReasonsFrequency Percentage(ノV=48) (%)
Computer mode a.no influence ofthe interviewer
b.administration under the same conditions
c.less anxiety-provoking
Both the same
a.having the same test contents and test procedures
b.feeling no anxiety on both tests
Face-to-face〃zode
a.having oPPortunity to talk to real people
b.possible to clarify instructions with the interviewer
25
16
7
8 4.31
421
く∨〔∠
52
33
15
71111
丹13811
-且0/7’〔∠
91
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
ρ4.〃乃ich test did you 1’んθZ)et旋ir2
Table 4.24 indicates that almost half of the students(49%)reported that they
liked the face-to-face test better than the computer test, while 30.9%favored the
computer test. As can be seen in Table 4.28, comments regarding the affective appeal o f
the face-to-face test f()cused primarily on the fb llowing:(a)it felt natural to talk in the
presence of the interviewer(27%);(b)it was less anxiety-provoking(20%);(c)it was
similar to a real communication situation(20%);(d)it was pleasant to talk with people
(16%);(e)it was po ssible to clarify unc lear instructions with the interviewer(11%);and
(りthey felt motivated to use facial expressions and gestures to communicate(7%).
Tho se who cho se the comr)uter mode gave the fact that it was le ss anxiety-provokhlg as
the main reason(54%). They also mentioned better concentration(12%)and better
control ofpace of test-taking in the co卿uter mode(12%). Comments from examinees
shed some light on these丘ndings:
a.V》hen I speak㎞」丘ont ofaperson, I feel like talking to that person. So, I feel I
could speak naturally m the face-to-face test.(Fac(ヲィo吻c¢ノ
a.Ilike the face-to-face test because I fbel someone is listening to what I am
talking about. On the contrary, I feel lonely in the computer test.(Face-to-fac¢ノ
b.Ifbel more relaxed when I talk to someone than when I talk to a machine.
(Face-to-face?
e.Ithmk it is practical to test how we speak when someone is present. Personally,
Idon’t like taking a test on the computer because 1 feel that computer is easy to
break down. But in the face-to-face test, the interviewer can deal with problems
that may happen during the test. So 1 feel it is more flexible.(Face・・to-face?
e.In the face-to-face test, when there is anything I don’t understand or I want to
clarify, I could ask questions. The intcrviewer could help me cope with the
situation.(Fa ce-to-face?
92
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
fWhen the interviewer is present,1 feel motivated to use non-verbal means such
as facial expression and gesture to express what 1 want to say.(Face-to-face?
a.Ife lt quite nervous in the face-to-face test. But in the cornputer-delivered
speaking test, since I don’t need to talk in front of someone, I could keep calm
and speak as usual.(Computer)
b.During the computer test, there was no one around. So compared with the
face-to-face test, I was able to concentrate on thinking and giving responses to
the tasks.(Computer?
c.Icould give answers to the questions on my own pace in the computer test.
(Computer)
Table 4.28 summary o f comments on affective apPeal ofthe test modes(Q4)
ReasonsFrequency
(1V=78)
Percentage (%)
Computer〃mode
a.less anxiety-provoking
b.better concentration on thinking and responding
c.better control ofthe pace of test taking
d.being a fair test
e.being able to practice loudly
fgetting used to working on the computer
9.test-like
h.having spare time during the tasks
i.not getting shy
26
41
33
54
P2
P2S44444
Both the same
a.having the same task contents
b.having both advantages and disadvantages
74.「づ
9弓1354.
Face-to-face〃mode
a.feeling natural to talk in front of the interviewer
b.less anxiety-provoking
c.similar to the real communication situation
d.pleasant to talk to real people
e.possible to clarify instructions with the interviewer
fmotivated to use expression and gestures
4512
X9753
587’0061
93
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
ρ5.Mtich teぷt do you t励k reflecte吻o〃γEη91」ぷ〃evel〃20re accuratelγ2
As shown in Table 4.24, half of the students(50.5%)considered the face-to-face
test as giving better representation of their actual speaking English level, whereas only
11.6%chose the coml)uter mode. Table 4.29 presents the percentage ofcomments given
for each option. According to Table 4.29, tho se who gave comments on choosing the
face-to-face mode mainly believed that it was similar to a real communication situation
(55%)and it was less anxiety-provoking(16%). Here are some examples of the
comments:
a.In the situation of speaking English, it usually involves conversation between
people. So, I think the face-to-face test, which is more similar to real
communication than the computer test can measure my English speaking
ability more accurately.(Face-to-face)
a.When we actually speak, there are always other people present to listen to us or
talk to us. 1 think it is meaningless to talk to a computer.(Face-to-face?
Table 4.29 summary o f comments on test validity(Q5)
ReasonsFrequency Percentage(ノV『=54) (%)
Computer〃mode
a.less anxiety-provoking
b.no influence of the interviewer
c.test_like
Both theぷame
a.having the same task contents
b.having both advantages and disadvantages
.Face-to三face〃iode
ahCdef9 similar to the real communication situation
less anxiety-provoking
possible to clarify instructions with the interviewer
feeling natural to talk in front of the interviewer
nonverbal performance should also be evaluated
motivated to talk more
providing good extent of pressure to concentrate
11
5
38
弓づ(∠
2
20
9
70
つ⊃00 9一/-1
0(U/04.
く∨/0く」-
94
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
ρ6.〃hich type of test would)ノou」prefer to take in theノ’uture 2
Table 4.24 revealed that two-thirds of the examinees(61.6%)would prefer the
face-to-face test when given a choice, while only 30%of the students chose the
computer test. As indicated in Table 4.30, tho se who preferred the face-to-face test gave
the fbllowing reasons:(a)they felt less nervous(31%);(b)it was similar to real
communication(29%);and(c)it was pleasant to talk to real people(17%). Interestingly,
those who chose the cornputer test also mentioned feeling less anxious as the primary
reason(77%). Specific comments included the fo llowing:
a.Since I felt quite calm in the face-to-face test, I thmk this type of speaking test
SUitS me Well.(Face-to-face?
b.Although 1 felt a little nervous in the face-to-face test, I think it is similar to the
actual communication situation where we need to talk to native speakers. Also,
taking the test seems to be good practice. That’s why 1 prefer the face-to-face
test.(‘Face-to-face)
b.Although in the daily life, we have a chance to speak English in front of
someone, we seldom need to talk to a computer. So the face-to-face test seems
to be the more natural one.(Face-to-face?
a.If the only purpose of taking the test is to pass it, I would prefer the
cornputer-delivered speaking test because I didn’t feel very nervous during the
test.(Compute1ジ
a.Since I felt very nervous du血g the face-to-face test, I prefer the computer test
in which I could perfbrm to the best as I usually do.(Comρuter?
95
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)
Table 4.30 summary o f comments on preference ofthe test modes(Q6)
ReasonsFrequency Percentage(ノVF=72) (%)
Computer〃zode 22
a.less anxiety-provoking
b.being a fairer test without the interviewer’sinfluence
c.because he(she)is good at it
d.better concentration
Both the same
a.having both advantages and disadvantages
Face-to-face〃mode
a.less anxiety-provoking
b.similar to the real communication situation
c.pleasant to talk to real people
d.motivated to strive for better performance
e. possible to clarify instructions with the interviewer
f. concerned with quality of recording with computer
9.getting used to the face-to-face test format
2
48
1
2
31
3
67
77X95
100
31
Q9P7
W644
96
東京外国語大学 博士学位論文 Doctoral thesis (Tokyo University of Foreign Studies)