julie j. dubeau canadian defence academy bilc conference san antonio, texas, may 20-24 2007
DESCRIPTION
Are We All On the Same Page? An Exploratory Study of OPI Ratings Across NATO Countries Using the NATO STANAG 6001 Scale*. Julie J. Dubeau Canadian Defence Academy BILC CONFERENCE SAN ANTONIO, TEXAS, May 20-24 2007 - PowerPoint PPT PresentationTRANSCRIPT
Are We All On the Same Page?
An Exploratory Study of OPI RatingsAcross NATO Countries
Using the NATO STANAG 6001 Scale*
Julie J. DubeauCanadian Defence Academy
BILC CONFERENCESAN ANTONIO, TEXAS, May 20-24 2007
*This research was conducted as an MA Thesis, Carleton University, September 06
Presentation Outline Context Research Questions Literature Review Methodology Results
Ratings Raters Scale use
Conclusion
NATO Language Testing Context Standardized Language Profile (SLP) based on the
NATO STANDARDIZATION AGREEMENT (NATO STANAG) 6001 Language Proficiency Levels
26 NATO countries, 20 Partnership for Peace (PfP) countries
Interoperability is essential
Research Questions
The overarching research question was:
How comparable or consistent are ratings across NATO raters and countries?
Research Questions Research questions pertaining to the
ratings (RQ1)
Research questions pertaining raters’ training and background (RQ2)
Research questions pertaining to the rating process and to the scale (RQ3)
Literature Review
Testing Constructs What are we testing?
Rater Variance How do raters vary?
Methodology Design of study : Exploratory survey
Participants : Recruited at Sofia BILC 05
103 raters from 18 countries and 2 NATO units
Control group
Methodology Instrumentation & Procedure & Analysis
Rater data questionnaire
2 Oral Proficiency Interviews (OPIs) A & B
Questionnaire accompanying each sample OPI
Methodology
Analysis Rating comparisons
Original ratings ‘Plus’ ratings
Rater comparisons Training Background
Methodology Country to country comparisons
Within country dispersion
Rating process Rating factors
Rater/scale interaction Scale user-friendliness
Results RQ1- Summary
Ratings : To compare OPIs ratings in NATO countries, and to explore the efficacy of ‘plus levels’ or plus ratings.
Some rater-to-rater differences
‘Plus’ levels brought ratings closer to the mean
Some country-to-country differences
Greater ‘within-country’ dispersion
Low correlation between samples A & B
Results All Ratings for Sample A (level 1)
Levels Numbers %
1 46 44.7
1+ 14 13.6
2 40 38.8
2+ 2 1.9
3 1 1.0
Total 103 100.0
Results All Ratings (with +) for Sample A
Levels Numbers %
Within Level 1 range
70 68.0
Within Level 2 range
32 31.1
Within Level 3 range
1 1.0
Total 103 100.0
View of OPI ratings sample A
within level 3within level 2within level 1Stacked view of A
60
50
40
30
20
10
0
Co
un
t
1
32
10
60
Within L3 rangeWithin L2 rangeWithin L1 range
Adjusted scores with ‘pluses’
All Countries’ Means for Sample A
2.402.202.001.801.601.401.201.00
Overall Country Mean
Co
un
try
nu
mb
ers 20
1918
1716
1514
1312
1110
98
7
65
43
21
Results All Ratings for Sample B (level 2)
Levels Numbers %
1 2 1.9
1+ 1 1.0
2 47 45.6
2+ 8 7.8
3 34 33.0
3+ 2 1.9
4 2 1.9
Total 96 93.2
View of OPI ratings sample B
within level 4within level 3within level 2within level 1
Stacked view of B
60
50
40
30
20
10
0
Co
un
t
1 1
31
5
55
12
Within L4 range
Within L3 range
Within L2 range
Within L1 rangeAdjusted + range B
All Countries’ Means for Sample B
3.303.002.702.402.101.80
countrymeanB
25.00
20.00
15.00
10.00
5.00
0.00
Co
un
try
#
2019
1817
1514
1312
1110
98
7
65
43
21
Samples A & B A Spearman rank-order correlation
coefficient ρ = .57 A Pearson product-moment
correlation coefficient r = .55
= low statistical correlations between the two sets of data (Samples A & B)
= no consistency from raters
Results RQ2- Summary
Raters: To investigate rater training and scale training and see how (or if) they impacted the ratings, and to explore how various background characteristics impacted the ratings
Trained raters scored within the mean, especially for sample B
Experienced raters did not do as well as scale-trained raters
Full-time raters closer to mean ‘New’ NATO raters closer to mean No difference in ratings btwn NS & NNS raters
substantial to lotsnone to little
70
60
50
40
30
20
10
0
Fre
qu
ency
63.27%
36.73%
Tester (Rater) Training
Rating B and Tester Training Crosstabulation
Summary of Tester Trg
Little Lots
Total
1420236
Score B correct? Yes
No
Missing
Total
14
20
2
36
44
14
4
62
58
34
6
98
substantial to lotsnone to little
60
50
40
30
20
10
0
Per
cen
t
40.0%
60.0%
STANAG Scale Training
Rating B and STANAG Training Crosstabulation
Summary of STANAG Trg
Little Lots
Total
1420236
Rating B correct? Yes
No
Missing
Total
28
24
5
57
29
8
1
38
57
32
6
95
5 years +4 to 5 years2 to 3 years0 to 1 year
50
40
30
20
10
0
Fre
qu
ency
49.5%
15.84%19.8%
14.85%
Years Experience
Rating B and 4 Yrs Experience Crosstabulation
Experience
3 yrs or less 4 yrs or more
Total
1420236
Rating B correct? Yes
No
Missing
Total
26
6
3
35
34
29
3
66
60
35
6
101
Results Raters’ Background
Work in Testing Full-time? Yes 34 (33.0 %)
No 67 (65.0 %)
Full-time testers more reliable
60% were NNS
53% were from ‘older’ NATO countries
‘Old’ & ‘New’ NATO Countries
Rating B Correct?
Yes No
Total
1420236
New NATO? Yes
No
Total
27
27
54
6
26
32
37
55
92
Other/Missing
4
2
6
‘Old’ & ‘New’ NATO Countries
Summary of Tester Trg
Little Lots
Total
1420236
New NATO? Yes
No
Total
6
23
29
30
28
58
36
51
87
Results RQ3- Summary
Scale: To explore the ways in which raters used the various STANAG statements and rating factors to arrive at their ratings.
Rating process did not affect ratings significantly
Rating factors not equal everywhere
3 main ‘types’ of raters emerged: Evidence-based Intuitive Extra-contextual
Results An ‘evidenced-based’ rating for Sample B (level
2):
This candidate’s performance cannot be rated as 2+. Grammatical/structural control is inadequate and does not rise above (even occasionally) into the upper level. Mispronunciation detracts from the delivery and can be problematic. No evidence of well-controlled but extended discourse. No clear evidence of the use of even some complex structures that might raise the performance to the + level. Finally, there is no evidence that the performance rises and crosses into level 3. (Rater 36)
Results An ‘intuitive’ rating for Sample A (level 1):
I would say that just about every single sentence in the interpretation of the level 2 speaking could be applied to this man. And because of that I would say that he is literally at the top of level 2. He is on the verge of level 3 literally. So I would automatically up him to a low 3. (Rater 1)
Results An ‘extra-contextual’ rating for Sample A (level 1):
I wouldn’t give him a 2 plus but I would give him a 3 minus. I have to admit that I am basing that decision on the fact that by demonstrating he is a high 2 in every single aspect of the description of a level 2, I would give him a sort of vote of confidence that in any job abroad he might have a hard time at first but I think he could handle really working in the language. (Rater 1)
Results
An ‘extra-contextual’ rating for Sample A (level 1):
Yes! I would be happy to give him a 1+. Since we do not use ‘plus levels’ I am afraid that rating him as a clear 1 would disadvantage him and, for this reason, I would rather give him a very low 2. (Rater 20)
Results An ‘extra-contextual’ rating for Sample A (level 1):
I got to question 7 and re-read the STANAG document and now I think ‘2’ is more appropriate. (Rater 95)
***Level 3 is the basic level needed for officers in (my country). I think the candidate could perform the tasks required of him. He could easily be bulldozed by native speakers in a meeting, but would hold his own with non-native speakers. He makes mistakes that very rarely distort meaning and are rarely disturbing. (Rater 95)
Results
Control group:
Comparable ratings to lesser trained group of participants
Evidence-based ratings
Implications Plus levels beneficial
Training uneven
Frequent re-training
Different grids
Institutional perspectives
Limitations & Future Research
OPIs new to some participants
Future research could: Get participants to test Investigate rating grids Look at other skills
ConclusionSo, are we all on the same page?
YES! BUT…
Plus levels were instrumental in bridging gap
Training was found to be key to reliability
More in-country norming should be the first
step toward international benchmarking
Thank You!Questions?
Are We All On the Same Page?
An Exploratory Study of OPI RatingsAcross NATO Countries
Using the NATO STANAG 6001 Scale
Julie J. [email protected]
The full thesis is available on the CDA website
http://cda.mil.ca/dpd/engraph/services/lang/lang_e.asp(A condensed article is also forthcoming)