julie j. dubeau canadian defence academy bilc conference san antonio, texas, may 20-24 2007

Are We All On the Same Page?

An Exploratory Study of OPI RatingsAcross NATO Countries

Using the NATO STANAG 6001 Scale*

Julie J. DubeauCanadian Defence Academy

BILC CONFERENCESAN ANTONIO, TEXAS, May 20-24 2007

*This research was conducted as an MA Thesis, Carleton University, September 06

Presentation Outline Context Research Questions Literature Review Methodology Results

Ratings Raters Scale use

Conclusion

NATO Language Testing Context Standardized Language Profile (SLP) based on the

NATO STANDARDIZATION AGREEMENT (NATO STANAG) 6001 Language Proficiency Levels

26 NATO countries, 20 Partnership for Peace (PfP) countries

Interoperability is essential

Research Questions

The overarching research question was:

How comparable or consistent are ratings across NATO raters and countries?

Research Questions Research questions pertaining to the

ratings (RQ1)

Research questions pertaining raters’ training and background (RQ2)

Research questions pertaining to the rating process and to the scale (RQ3)

Literature Review

Testing Constructs What are we testing?

Rater Variance How do raters vary?

Methodology Design of study : Exploratory survey

Participants : Recruited at Sofia BILC 05

103 raters from 18 countries and 2 NATO units

Control group

Methodology Instrumentation & Procedure & Analysis

Rater data questionnaire

2 Oral Proficiency Interviews (OPIs) A & B

Questionnaire accompanying each sample OPI

Methodology

Analysis Rating comparisons

Original ratings ‘Plus’ ratings

Rater comparisons Training Background

Methodology Country to country comparisons

Within country dispersion

Rating process Rating factors

Rater/scale interaction Scale user-friendliness

Results RQ1- Summary

Ratings : To compare OPIs ratings in NATO countries, and to explore the efficacy of ‘plus levels’ or plus ratings.

Some rater-to-rater differences

‘Plus’ levels brought ratings closer to the mean

Some country-to-country differences

Greater ‘within-country’ dispersion

Low correlation between samples A & B

Results All Ratings for Sample A (level 1)

Levels Numbers %

1 46 44.7

1+ 14 13.6

2 40 38.8

2+ 2 1.9

3 1 1.0

Total 103 100.0

Results All Ratings (with +) for Sample A

Levels Numbers %

Within Level 1 range

70 68.0


32 31.1


1 1.0

Total 103 100.0

View of OPI ratings sample A

within level 3within level 2within level 1Stacked view of A

60

50

40

30

20

10

0

Co

un

t

1

32

10

60

Within L3 rangeWithin L2 rangeWithin L1 range

Adjusted scores with ‘pluses’

All Countries’ Means for Sample A

2.402.202.001.801.601.401.201.00

Overall Country Mean

Co

un

try

nu

mb

ers 20

1918

1716

1514

1312

1110

98

7

65

43

21

Results All Ratings for Sample B (level 2)

Levels Numbers %

1 2 1.9

1+ 1 1.0

2 47 45.6

2+ 8 7.8

3 34 33.0

3+ 2 1.9

4 2 1.9

Total 96 93.2

View of OPI ratings sample B

within level 4within level 3within level 2within level 1

Stacked view of B

60

50

40

30

20

10

0

Co

un

t

1 1

31

5

55

12

Within L4 range

Within L3 range

Within L2 range

Within L1 rangeAdjusted + range B

All Countries’ Means for Sample B

3.303.002.702.402.101.80

countrymeanB

25.00

20.00

15.00

10.00

5.00

0.00

Co

un

try

#

2019

1817

1514

1312

1110

98

7

65

43

21

Samples A & B A Spearman rank-order correlation

coefficient ρ = .57 A Pearson product-moment

correlation coefficient r = .55

= low statistical correlations between the two sets of data (Samples A & B)

= no consistency from raters


Raters: To investigate rater training and scale training and see how (or if) they impacted the ratings, and to explore how various background characteristics impacted the ratings

Trained raters scored within the mean, especially for sample B

Experienced raters did not do as well as scale-trained raters

Full-time raters closer to mean ‘New’ NATO raters closer to mean No difference in ratings btwn NS & NNS raters

substantial to lotsnone to little

70

60

50

40

30

20

10

0

Fre

qu

ency

63.27%

36.73%

Tester (Rater) Training

Rating B and Tester Training Crosstabulation

Summary of Tester Trg

Little Lots

Total

1420236

Score B correct? Yes

No

Missing

Total

14

20

2

36

44

14

4

62

58

34

6

98

substantial to lotsnone to little

60

50

40

30

20

10

0

Per

cen

t

40.0%

60.0%

STANAG Scale Training

Rating B and STANAG Training Crosstabulation

Summary of STANAG Trg

Little Lots

Total

1420236

Rating B correct? Yes

No

Missing

Total

28

24

5

57

29

8

1

38

57

32

6

95

5 years +4 to 5 years2 to 3 years0 to 1 year

50

40

30

20

10

0

Fre

qu

ency

49.5%

15.84%19.8%

14.85%

Years Experience

Rating B and 4 Yrs Experience Crosstabulation

Experience

3 yrs or less 4 yrs or more

Total

1420236

Rating B correct? Yes

No

Missing

Total

26

6

3

35

34

29

3

66

60

35

6

101

Results Raters’ Background

Work in Testing Full-time? Yes 34 (33.0 %)

No 67 (65.0 %)

Full-time testers more reliable

60% were NNS

53% were from ‘older’ NATO countries

‘Old’ & ‘New’ NATO Countries

Rating B Correct?

Yes No

Total

1420236

New NATO? Yes

No

Total

27

27

54

6

26

32

37

55

92

Other/Missing

4

2

6

‘Old’ & ‘New’ NATO Countries

Summary of Tester Trg

Little Lots

Total

1420236

New NATO? Yes

No

Total

6

23

29

30

28

58

36

51

87


Scale: To explore the ways in which raters used the various STANAG statements and rating factors to arrive at their ratings.

Rating process did not affect ratings significantly

Rating factors not equal everywhere

3 main ‘types’ of raters emerged: Evidence-based Intuitive Extra-contextual

Results An ‘evidenced-based’ rating for Sample B (level

2):

This candidate’s performance cannot be rated as 2+. Grammatical/structural control is inadequate and does not rise above (even occasionally) into the upper level. Mispronunciation detracts from the delivery and can be problematic. No evidence of well-controlled but extended discourse. No clear evidence of the use of even some complex structures that might raise the performance to the + level. Finally, there is no evidence that the performance rises and crosses into level 3. (Rater 36)

Results An ‘intuitive’ rating for Sample A (level 1):

I would say that just about every single sentence in the interpretation of the level 2 speaking could be applied to this man. And because of that I would say that he is literally at the top of level 2. He is on the verge of level 3 literally. So I would automatically up him to a low 3. (Rater 1)

Results An ‘extra-contextual’ rating for Sample A (level 1):

I wouldn’t give him a 2 plus but I would give him a 3 minus. I have to admit that I am basing that decision on the fact that by demonstrating he is a high 2 in every single aspect of the description of a level 2, I would give him a sort of vote of confidence that in any job abroad he might have a hard time at first but I think he could handle really working in the language. (Rater 1)

Results

An ‘extra-contextual’ rating for Sample A (level 1):

Yes! I would be happy to give him a 1+. Since we do not use ‘plus levels’ I am afraid that rating him as a clear 1 would disadvantage him and, for this reason, I would rather give him a very low 2. (Rater 20)

Results An ‘extra-contextual’ rating for Sample A (level 1):

I got to question 7 and re-read the STANAG document and now I think ‘2’ is more appropriate. (Rater 95)

***Level 3 is the basic level needed for officers in (my country). I think the candidate could perform the tasks required of him. He could easily be bulldozed by native speakers in a meeting, but would hold his own with non-native speakers. He makes mistakes that very rarely distort meaning and are rarely disturbing. (Rater 95)

Results

Control group:

Comparable ratings to lesser trained group of participants

Evidence-based ratings

Implications Plus levels beneficial

Training uneven

Frequent re-training

Different grids

Institutional perspectives

Limitations & Future Research

OPIs new to some participants

Future research could: Get participants to test Investigate rating grids Look at other skills

ConclusionSo, are we all on the same page?

YES! BUT…

Plus levels were instrumental in bridging gap

Training was found to be key to reliability

More in-country norming should be the first

step toward international benchmarking

Thank You!Questions?

Are We All On the Same Page?

An Exploratory Study of OPI RatingsAcross NATO Countries

Using the NATO STANAG 6001 Scale

Julie J. [email protected]

The full thesis is available on the CDA website

http://cda.mil.ca/dpd/engraph/services/lang/lang_e.asp(A condensed article is also forthcoming)

julie j. dubeau canadian defence academy bilc conference san antonio, texas, may 20-24 2007

Documents

ratings closer

opis ratings

raters training

sample bexperienced

sample b level

ratings rq1research

meannew nato raters

sample bsamples