predicting and interpreting identification errors in military vehicle training using...

This article was downloaded by: [Dicle University]On: 09 November 2014, At: 11:37Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

ErgonomicsPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/terg20

Predicting and interpreting identification errors inmilitary vehicle training using multidimensional scalingCorey J. Bohila, Nicholas A. Higginsa & Joseph R. Keeblerb

a Department of Psychology, University of Central Florida, Orlando, FL, USAb Department of Psychology, Wichita State University, Wichita, KS, USAPublished online: 04 Apr 2014.

To cite this article: Corey J. Bohil, Nicholas A. Higgins & Joseph R. Keebler (2014) Predicting and interpretingidentification errors in military vehicle training using multidimensional scaling, Ergonomics, 57:6, 844-855, DOI:10.1080/00140139.2014.899631

To link to this article: http://dx.doi.org/10.1080/00140139.2014.899631

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/terg20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/00140139.2014.899631

http://dx.doi.org/10.1080/00140139.2014.899631

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Predicting and interpreting identification errors in military vehicle training usingmultidimensional scaling

Corey J. Bohila*, Nicholas A. Higginsa and Joseph R. Keeblerb

aDepartment of Psychology, University of Central Florida, Orlando, FL, USA; bDepartment of Psychology,Wichita State University, Wichita, KS, USA

(Received 23 September 2013; accepted 21 February 2014)

We compared methods for predicting and understanding the source of confusion errors during military vehicle identificationtraining. Participants completed training to identify main battle tanks. They also completed card-sorting and similarity-rating tasks to express their mental representation of resemblance across the set of training items. We expected participantsto selectively attend to a subset of vehicle features during these tasks, and we hypothesised that we could predictidentification confusion errors based on the outcomes of the card-sort and similarity-rating tasks. Based on card-sortingresults, we were able to predict about 45% of observed identification confusions. Based on multidimensional scaling of thesimilarity-rating data, we could predict more than 80% of identification confusions. These methods also enabled us to inferthe dimensions receiving significant attention from each participant. This understanding of mental representation may becrucial in creating personalised training that directs attention to features that are critical for accurate identification.

Practitioner Summary: Participants completed military vehicle identification training and testing, along with card-sortingand similarity-rating tasks. The data enabled us to predict up to 84% of identification confusion errors and to understand themental representation underlying these errors. These methods have potential to improve training and reduce identificationerrors leading to fratricide.

Keywords: vehicle identification; training; incidental learning; multidimensional scaling

Introduction

Friendly fire accidents (also known as fratricide or blue-on-blue incidents) often result from mistaken identification of

military vehicles by individual soldiers (Regan 1995). Such errors disrupt the planning of appropriate responses to rapidly

changing field conditions, often with life-threatening results (Briggs and Goldberg 1995; Keebler et al. 2010). Research

suggests that novices can overreact based on the appearance of a prominent feature such as tank treads or turrets (Biederman

and Shiffrar 1987) and that, depending on the extent of feature overlap, vehicles can be easily confused with one another

(O’Kane et al. 1997). Accurate combat identification likely requires (1) attention to multiple features and (2) that features

with little predictive value of a vehicle’s identity are not overly attended to.

A common vehicle identification-training method is through repeated study of images – often line drawings – of

armoured vehicles until their identities are memorised (Keebler, Jentsch, and Hudson 2011). Although simple and cost

effective, there are shortcomings to this approach. There is no interactivity to promote active engagement in learning.

Training is known to be more effective when learners are deeply engaged in the training system (Kirkpatrick 1975; Malone

1981; Keebler, Jentsch, and Hudson 2011). Also, given the complexity of the training items (armoured vehicles with many

features) there is no way to know which features are most prominent in the learner’s memory.

Object identification requires learning a 1:1 mapping of stimulus features to response labels (see Figure 1a). However,

given the large number of features comprising an object as complex as a vehicle, there could be any number of feature

subsets upon which a learner fixates. Furthermore, attending to idiosyncratic – but not predictive beyond the training set –

features of training examples can lead to learning that is difficult to reverse (Biederman and Shiffrar 1987). However, given

the task of memorising several training items, focusing on a subset of outward physical features (even if unconsciously)

may be effective and efficient. Over time, experienced learners will develop a deeper, knowledge-based understanding of

the critical features for each vehicle type (i.e. other thought processes over and above perception and memory play a role).

But within the operational environment, most military personnel making identification judgements may be relatively

inexperienced. For example, upon encountering an armoured vehicle in the field, infantry must make rapid decisions to

engage or retreat, and must communicate what they see to forces elsewhere. These identification judgements are often made

in poor viewing conditions, including distance, rain or dust, and occlusion by other objects, as well as under time pressure

q 2014 Taylor & Francis

*Corresponding author. Email: [email protected]

Ergonomics, 2014

Vol. 57, No. 6, 844–855, http://dx.doi.org/10.1080/00140139.2014.899631

Dow

nloa

ded

by [

Dic

le U

nive

rsity

] at

11:

37 0

9 N

ovem

ber

2014

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1080/00140139.2014.899631

and the duress of battle (Keebler et al. 2007). Improved decision-making in the field could result from identification training

that is tailored to help learners focus attention on the most diagnostic features and avoid attending to less informative

features (Biederman and Shiffrar 1987; Keebler et al. 2008; Keebler, Jentsch, and Hudson 2011).

A well-known finding from the literature on classification learning is that when stimuli are complex, learners often focus

on a subset of stimulus features (Biederman and Shiffrar 1987; Yamauchi and Markman 1998). During identification

training, learners may incidentally (i.e. unintentionally) form a mental representation of vehicle classes in addition to

memorising individual items (e.g. Kemler Nelson 1984; Smith 2008; Folstein, Gauthier, and Palmeri 2010). While

progressing through a training set, learners are likely to notice recurring features across items (see Figure 1b). For example,

some tanks have treads that are covered by armour while many do not; some vehicles have various peripheral devices

mounted to their surface while others do not, and so on. Critically, this ‘many to one mapping’ mental representation may

occur despite the fact that training feedback does not suggest any sort of classification scheme across sets of training objects,

but rather supports only individual item identification.

The current research explores the possibility that participants develop mental classification schemes incidentally during

identification training, and the possibility that we can (1) determine what their mental representations are like and (2)

predict their identification errors based on this knowledge. Participants completed an identification-training task in which

they learned to identify a set of armoured military vehicles (main battle tanks). During identification training, participants

received information about the unique identity of each individual training item. The training information did not serve to

reinforce their attention to any particular subset of stimulus dimensions or class of vehicles based on common features. If

participants notice recurring differences and similarities across training items and form associations between these items

Figure 1. Representative items from the training set. (a) Identification: 1:1 mapping of stimuli to responses. (b) Classification: Manystimuli treated as equivalent.

Ergonomics 845

Dow

nloa

ded

by [

Dic

le U

nive

rsity

] at

11:

37 0

9 N

ovem

ber

2014

(even implicitly) then this could be considered a type of ‘unsupervised’ category formation (i.e. training feedback does not

reinforce category formation). We included both a passive ‘observational’ identification-training condition that is akin to

learning with a set of training images and a more actively engaging ‘feedback’ training condition, during which participants

had to guess the identity of each item followed by corrective feedback. Our goal was not necessarily to contrast these

training conditions, but rather to evaluate our ability to predict identification errors under a variety of training methods.

Participants also completed two tasks designed to assess their mental representation of the training set. They completed

a card-sorting task – both before and after the identification-training task – in which they placed each training item into

whatever piles made sense to them. Card-sorting is a widely used task for gaining insight into mental classification schemes

and elicitation of knowledge structures (Edwards et al. 2006). After completing these tasks, participants also completed a

similarity-rating task in which they compared the similarity of each possible pair of training images. These ratings facilitate

multidimensional scaling (MDS) analysis (details below), which we used to gain another measure of mental representation

of the stimulus set. Both tasks provide information regarding the participant’s focus of attention when examining members

of the training set.

Our main goal was to infer something about mental representations that develop during training and to examine howwell

we can predict identification confusion errors (i.e. confusing one training item for another) using that information. We

hypothesise that learners notice subsets of features during identification training and that attention to these features

contributes to their confusions (e.g. tanks with similar treads may be more confusable if this feature is the focus of attention).

If this is the case, then it may be possible to infer the dimensions that are receiving the most attention from a learner and

potentially predict some confusion errors during the training process. Knowledge of this underlying mental representation of

the training set could be used to create adaptive training that focuses attention on the most important dimensions and reduces

identification errors. After presenting the results of our comparison between several prediction methods, we consider the

possibility of using the methods explored here for developing an adaptive system for identification training.

Method

Participants

Undergraduate students from the University of Central Florida (n ¼ 38) voluntarily participated in the experiment for

course credit. Data from two participants were removed for failure to complete the tasks as instructed, so our analyses are

based on the remaining 36 participants.

Stimuli

The stimuli used across the experimental tasks were line drawings of armoured military vehicles, selected from 54 study

cards included in Graphic Training Aid 17-02-013: Armoured Vehicle Recognition, January 1997. Figure 1 shows some of

the images used in the study. A total of 32 drawings were selected from the set of study cards. Twenty-two were drawings of

main battle tanks, including 2S1-M1974, AMX-30, Centurion, Challenger, Chieftain, Leopard-II, M1-Abrams, M48A5,

M60A3, T-62, T-64, T-72, AMX-13, ASU-85, Leopard-1A2, Jagdpanzer-Kanone, Jagdpanzer-SK105, Leopard-1A4, T-80,

BMP-2, M109, T-54/55. A subset of 12 vehicles was chosen to be used in the identification-training task (the first 12 in the

list above), while the remaining 10 were used for comparison in the identification test, card-sort and similarity-rating phases

of the study. These vehicles were chosen because they were all of the same type (i.e. main battle tanks) and because they

appeared (to the researchers) to be highly similar to each other. An additional 10 ‘non-similar’ drawings from the set of

cards were also selected, including AMX-10P, BMD, BMP, Airborne, Jaguar, M2IFV, M551A1 Sheridan, Marder,

Type531, ZSU-57-2. These were selected because they were different in appearance from the main battle tanks (e.g. many

were not main battle tanks, although all were armoured military vehicles that were potentially confusable with other items

in the study). These 10 items appeared along with the items above during the identification test.

Images were roughly 2 inches tall and 3 inches wide when presented on paper cards (during the card-sort task), and

roughly 4 £ 6 inches when presented on the computer screen (during the identification and similarity-rating tasks). All

vehicles appeared at roughly the same oblique angle or rotation, and we controlled for image size and extraneous details in

the images (name information, size comparison with a human).

Procedure

Each participant completed several tasks. First was a card-sorting task; second, they were trained and tested on

identification; and third, they conducted another card-sorting task to assess changes due to identification training. We refer

C.J. Bohil et al.846

Dow

nloa

ded

by [

Dic

le U

nive

rsity

] at

11:

37 0

9 N

ovem

ber

2014

to the sorting tasks as pre- and post-test sorting tasks since they came before and after the identification task. Finally,

all participants finished the session by completing a similarity-rating task. Task details are as follows.

Card-sorting task

Participants were presented with a stack of 22 cards (the ‘similar’ cards described above), each with a single vehicle image

on one side. They were instructed to spread the cards onto a table and then sort them into piles. They were instructed that the

piles could be organised however they wished as long as the piles made sense to them. No other direction was given about

how to sort the vehicles. If the participant asked questions during the sorting task, they received only encouragement to sort

the vehicles however they felt made sense. They were also instructed that they should sort into at least two piles, but could

have as many piles as they wished beyond two. Finally, they were informed that they would have to provide a name for each

pile after the sort was completed, and that they would be asked to explain the basis for sorting the images as they did and

what rationale they used for naming the piles. Finally, for each vehicle in a sort pile, they were asked to rate the ‘goodness’

of that vehicle for the pile. These goodness ratings were based on a three-point scale (1 ¼ fair, 2 ¼ good, 3 ¼ perfect). This

provided a metric of representativeness for the pile into which each vehicle image was placed.

The sorting task was completed at the beginning of the experimental session, and again following the identification test

phase. We assumed that the post-ID sort results would reflect knowledge gained from experience with the images (i.e.

during the ID training and test phases). Each sorting task (pre- and post-test) took about 10 minutes to complete.

Identification training and testing tasks

There were two identification-training conditions. In both training conditions, participants viewed a series of 12 vehicle

images – one at a time – on a computer screen. These 12 vehicles also appeared in the sorting-task set.

In the ‘observational’ training condition, on each trial the participant would view a tank image, along with the name of

that tank. They could view the image for as long as they wished, and they pressed a key to move on to the next image. The

set of 12 images was presented in 10 training blocks (with order randomised for each block), for a total of 120 training trials.

In the ‘feedback’ condition, the participant was presented on each trial with one of the 12 vehicles. Instead of a single

label, all 12 vehicle labels were presented on the screen, along with an instruction to ‘press the corresponding keyboard key

to guess the name of the tank’. After pressing a response key, the screen was cleared, followed by ‘correct’ or ‘incorrect –

that was a [correct label of tank]’. The feedback remained on the screen for two seconds. The 12 training tanks were

randomly presented in each training block. The training continued until the participant provided the correct category label

for all 12 tanks in a row. It is typical in identification-training studies with feedback to continue training until an accuracy-

based criterion is reached (e.g. 100% accuracy one or two times through the training set). We followed this convention for

the feedback training condition. In the observational condition, we selected the number of training cycles based on a guess

as to the number of training trials that would likely be needed to reach criterion for participants in the feedback condition.

This was done to keep the number of training observations roughly equal across conditions. As shown in the results section,

the number of training trials completed in both conditions was similar.

The observational and feedback training conditions were presented between subjects. No participant completed more

than a single training condition.

Following training, each participant (in both training conditions) completed an identification test. In random order, the

12 training items, along with 10 additional similar and 10 non-similar (described above) tank images were presented on the

computer screen, one image per trial. On a trial, the test item was presented for 1 second before disappearing. Then the list

of 12 item labels from the training phase appeared, along with an additional label that said ‘other’ which participants could

press if they felt that they had not seen the image during training. The participant could take as long as they wished to select

their response. No accuracy feedback was provided during this identification test. The identification training and testing

portion of the experiment took approximately 30 minutes to complete. Following the identification training and test phases,

each participant again completed the card-sort task (described above).

Similarity-rating task

All participants finished the session by completing a similarity-rating task. On each trial of the similarity-rating task, two

tanks were presented side by side on the computer screen, along with the instructions ‘How similar are these?’ Under these

instructions was a rating scale with numbers from 1 (‘low similarity’) to 5 (‘high similarity’). The participant’s task was to

examine the two images and press the corresponding number key to rate their similarity. After a one-second inter-trial

interval, the next randomly selected pair of tanks was presented. The presented tanks included the 12 ID task training items,

Ergonomics 847

Dow

nloa

ded

by [

Dic

le U

nive

rsity

] at

11:

37 0

9 N

ovem

ber

2014

along with the 10 similar items that also appeared in the card-sorting and ID training tasks. All pairwise comparisons were

rated for these 22 stimuli, resulting in 231 similarity-rating trials. This portion of the experiment took approximately

30 minutes to complete.

Results

We begin by reviewing the outcome of the identification training and test phases. After that we evaluate our ability to

predict confusion errors committed during the identification test based on the results of the card-sorting and similarity-

rating tasks. The similarity-rating data for each participant were submitted to a MDS analysis that gave a spatial

representation of confusability for the items. We also examine three methods for inferring which stimulus dimensions

tended to be the focus of attention. These include a method based on the card-sort results and one based on interpreting the

dimensions of MDS results. Finally, we compare all methods on the basis of signal detection analysis, which provides a

quantitative measure of ability to predict confusions.

Identification confusion errors

During the identification test, participants were presented with the 12 training items, along with 20 additional tank images to

serve as lures (the 10 similar and 10 non-similar items described above). The lures were included to make the identification

task more difficult. Although participants completed, on average, slightly more training trials in the feedback condition, this

was not a statistically reliable difference. In the observational condition, all participants completed 120 training trials. In the

feedback condition, the median number of trials completed to reach criterion was also 120 trials (M ¼ 135 trials, SD ¼ 54),

t (30) ¼ 1.08, p ¼ 0.291.

Participants committed more identification errors (including ‘other’ responses) after observational training (M ¼ 8.24,

SD ¼ 4.72) than after feedback training (M ¼ 4.63, SD ¼ 2.49), t (34) ¼ 2.91, p ¼ 0.006. Excluding the ‘other’ response,

the difference in ID confusion rates was smaller (i.e. the rate of confusing one training item for another training item). Based

only on the 12 training stimuli, participants averaged 2.94 (SD ¼ 2.61) confusions after observational training and averaged

2.84 (SD ¼ 2.19) confusions after feedback training, but there was no significant difference between the two conditions,

t (34) ¼ 0.12, p ¼ 0.902. It follows that more ‘other’ responses were made after observational training (M ¼ 5.29,

SD ¼ 3.29) than after feedback training (M ¼ 1.79, SD ¼ 1.23), t (34) ¼ 4.32, p , 0.001.

Although the difference in error rates seems to disappear after removing the ‘other’ responses, it is important to note that

there were far fewer errors overall in the feedback training condition (total errors across all participants ¼ 88) than in the

observational condition (total errors across participants ¼ 140), while ‘other’ responses accounted for 34 (39%) and 90

(64%) errors, respectively, in these conditions. Clearly, there was greater uncertainty about item identity after observational

training.

The ‘other’ responses could merely indicate a lack of confidence in the identification response, rather than certainty that

the item was not seen in training. As a result, in the rest of our analyses we limit attention to identification confusions based

only on the 12 training items. Our primary interest is in errors of commission (i.e. misidentifying one training item as

another training item), due to these types of errors being responsible for incidences of fratricide.

Predicting ID confusions from card-sort results

We analysed the card-sort piles with respect to their ability to predict ID confusions using the following method. When two

tanks were placed into the same card-sort pile, we interpreted this to mean that the participant considered these items to be

more similar to each other than to items in other piles. Therefore, if two items appeared in the same pile and those items

were indeed confused during the identification task, we considered that observed error to be predicted by the card-sort

results. For example, if the participant confused a T-64 with a T-72 during the identification task, and these tanks appeared

together in one of the participant’s sort piles, then this confusion was predicted by the sorting task.

We compared predictions based on training condition, and also compared the pre- and post-task sorts using a 2 £ 2

mixed-factor analysis of variance (ANOVA) with training as a between-participant factor and pre–post sort as a within-

participant factor. We could predict a higher proportion of ID confusions after feedback training (M ¼ 0.54, SD ¼ 0.35)

than after observational training (M ¼ 0.33, SD ¼ 0.33; collapsed over pre- and post-task sorts); this difference was very

close to reaching statistical significance, F (1,27) ¼ 4.12, p ¼ 0.052, partial h 2 ¼ 0.132. There was no reliable difference

in predictions based on pre- (M ¼ 0.42, SD ¼ 0.35) and post-task sorts (M ¼ 0.45, SD ¼ 0.36), F (1,27) ¼ 0.10,

p ¼ 0.752, h 2 ¼ 0.004, and no interaction between training condition and change from pre- to post-task sort,

F (1,27) ¼ 0.03, p ¼ 0.87, h 2 ¼ 0.001. It appears that the difference between feedback and observational training card-sort


Dow

nloa

ded

by [

Dic

le U

nive

rsity

] at

11:

37 0

9 N

ovem

ber

2014

predictions largely existed even before training occurred. This conclusion is supported by the non-significant difference in

ability to predict confusions based on pre- and post-test card-sorts.

The point we wish to make with this analysis is that card-sort results do a reasonable job of predicting ID confusions.

The card-sorts predict about 45% of confusion errors collapsed over training conditions and pre- and post-task sorts.

Because the main goal of this study was to predict identification confusion errors during training, rather than to champion

one training style over another, we limit our consideration of differences between training conditions in our remaining

analyses.

Before evaluating other prediction methods, we must evaluate the content of the card-sorting results. In addition to error

prediction, our goal is to interpret the psychological dimensions underlying the confusability of training items (i.e. we wish

to determine the basis for judging items as similar enough to be confusable). We summarise here the outcome of the post-

training card-sorts. This interpretation will be contrasted with another approach to error interpretation in a later section.

In the (post-training) card-sort task, participants averaged around four piles each (M ¼ 4.37, SD ¼ 2.28). Piles averaged

slightly more than five items per pile (M ¼ 5.43, SD ¼ 3.05). Despite idiosyncrasies in participants’ descriptions of their

sort piles, we found substantial consistency within and across participants in terms of the features they described as guiding

their sorts. We were able to organise their sort labels into a relatively small set of approximately six categories. These

included features pertaining to the following (from most common to least): body style (e.g. shape, size), small auxiliary

guns (e.g. presence, absence, number), antennas (e.g. presence, absence, size), main turret (e.g. size, shape), attached

auxiliary equipment (e.g. presence, absence) and number of wheels.

For each participant, we counted the number of unique feature categories described. For example, in some cases a

participant’s sort piles each corresponded to a different feature (e.g. body style, wheels). In most cases, though, several sort

piles were based on different facets of the same dimension (e.g. rounded body, sectioned body and angular body). When this

was the case, we treated these as use of a single feature category (e.g. body features). Using this process, we counted 62 total

feature categories across participants. Most prevalent among these were body features (appearing for 16 participants), small

attached guns (13 cases) and antenna (11 cases), then main turret features (8), attached equipment (8), number of wheels (5)

and finally one miscellaneous feature (based on ‘modern’ appearance). These results suggest that a relatively small set of

prominent tank features formed the basis for organising training items in the minds of participants.

Predicting ID confusions from similarity ratings

After completing the identification-training and card-sorting tasks, participants provided pairwise-similarity ratings for all

the training items and the 10 similar lure items (items were rated on a five-point similarity scale). Similarity ratings should

provide another way to understand the mental representation of the training items, and could provide a means for predicting

confusion errors.

We submitted the similarity ratings to MDS analysis, which is a data reduction technique that places each item into an

N-dimensional space based on their psychological proximity (Borg and Groenen 2005). If two items are perceived as highly

similar, they should be located close together in the MDS space; if perceived as non-similar they will be located far apart.

We used the ALSCAL method to derive a two-dimensional MDS space for each individual participant (Borg and

Groenen 2005). (The number of dimensions, N, can be fixed arbitrarily or determined based on model fit.) Because we

sought to both predict ID confusion errors and deduce the primary dimensions underlying these errors, we limited the space

to two dimensions for ease of interpretation. Better predictive performance would likely result from allowing the MDS

procedure to determine the optimal number of dimensions to describe each participant’s data, although it might be more

challenging to interpret the dimensions. Overall, however, the two-dimensional MDS solutions provided a reasonable fit to

the similarity-rating data. Average fit (Young’s S-Stress) was 0.174 (SD ¼ 0.05), and the average proportion of variance

accounted for was high (R 2 ¼ 0.83).

A separate MDS analysis was carried out for each participant. This individual-level analysis is critical since we are

interested in finding methods with potential to adaptively improve training for each learner, and because each learner’s

focus of attention is idiosyncratic. Figure 2a displays the MDS result for a representative participant (chosen at random).

All 22 tanks are plotted in the space, and those that are close together had higher pairwise-similarity ratings than those

far apart.

Based on the MDS solution for each participant, we examined two methods for predicting identification confusions. The

first approach is based on our ability to assign psychological interpretation to the dimensions of a participant’s MDS space.

This approach, which we refer to as the MDS-features method, is described in the next section. A second approach, which

we refer to as the MDS-distance method, is based on predicting confusions from inter-item distance in the MDS space (e.g.

the distance between points in the top panel of Figure 2). This approach will be considered after we evaluate the MDS-

features method.

Ergonomics 849

Dow

nloa

ded

by [

Dic

le U

nive

rsity

] at

11:

37 0

9 N

ovem

ber

2014

Predictions based on MDS-features

Apparent feature ratings. In order to interpret the MDS dimensions, the researchers rated (prior to the study) each of the

tanks in terms of their visually apparent features. For each of the tank images used, two raters assessed the following

features: sectioned body (1 ¼ no, 5 ¼ yes), number of wheels (number of visible wheels), degree to which armour covered

the wheels (1 ¼ none to 5 ¼ a lot), armour smoothness (1 ¼ smooth to 5 ¼ rough), presence of attached auxiliary

equipment (1 ¼ little to 5 ¼ a lot), presence of small auxiliary guns (1 ¼ no, 5 ¼ yes), size of main turret (1 ¼ small to

5 ¼ very large) and presence of antenna (1 ¼ no, 5 ¼ yes). In addition, various other dimensions were rated (1–5 scales

were used as well) to rule out the influence of nuisance features, including angle of the vehicle depicted in the image, size of

Figure 2. (a) Spatial representation between training items produced by multidimensional scaling for a representative participant. (b)The MDS placement of the training items summarised in panel a for the same participant. Coordinate axes are labelled with dimensionalinterpretation produced by regressing apparent feature ratings onto MDS x, y coordinates for the items. See text for details.


Dow

nloa

ded

by [

Dic

le U

nive

rsity

] at

11:

37 0

9 N

ovem

ber

2014

the vehicle image on the card and assumed actual vehicle size, and presence/absence of a human for size perspective. None

of these dimensions appeared to contribute to the results we describe next, so we will not consider them further. Because

this method was exploratory, we were satisfied with the level of agreement between the feature ratings of the two raters.

There was a high correlation between ratings, r (142) ¼ 0.725, p , 0.001, and no significant difference between the

feature-rating responses, t (286) ¼ 0.84, p ¼ 0.401.

Next, we derived dimensional interpretations for each participant’s MDS space by regressing the apparent feature

ratings for the set of tanks onto the x, y coordinates for each tank in the MDS space. This allowed us to determine which tank

features contributed most to the perceived similarity of the tanks as indicated by their proximity in the MDS space. Features

with the smallest p-values in the regression analysis contributed most to the x, y coordinates for each item. This method has

been utilised in a variety of studies to provide psychological interpretation to latent stimulus-space dimensions (e.g. Kruskal

and Wish 1978; Markman and Makin 1998).

Figure 2b shows the outcome of this analysis for one representative participant (i.e. the same data-set as in Figure 2a).

Regression indicated that for this participant, the appearance of a ‘sectioned body’ and the presence of ‘antenna’ contributed

most strongly to the configuration of points in the MDS space (reflecting this participant’s similarity ratings). Each of the

vehicles displayed to the participant in the similarity-rating task are displayed for illustration. The vehicles on the left side

of the space tended to be main battle tanks consisting of a sectioned body with an armoured and tracked chassis and a

rotating turret supporting a large main weapon. The vehicles to the right tended to have a more unified body style in which

there is no differentiation between chassis and turret. As for the vertical dimension, it is clear that the tanks near the bottom

of the MDS space tended to have antenna, while those at the top did not. Clearly, the presence or absence of antenna is not a

highly informative indicator of tank model, and a mental representation that focuses attention on this dimension might

contribute to hazardous identification confusions among novice decision-makers.

This approach to interpreting the dimensions underlying each participant’s mental representation of tank similarity

enables us to predict identification confusion errors. For each of the two most prominent dimensions for a given participant

(e.g. sectioned body and antennae in our example), we examined the apparent feature ratings (described above) for all

possible pairings of tanks from the ID training task. If the ratings on at least one of the two dimensions matched for a pair of

tanks, or differed in rated value by no more than one, then we predicted a confusion error for that pair. For example, if two

items rated a four on ‘number of wheels’, or if one item rated a one and the other a two on the dimension ‘sectioned body’,

then we would predict a confusion error. A pair had to be closely matched on only one of the two MDS dimensions.

Although this algorithm was a rather arbitrary preliminary attempt, it substantially increased our predictive power for ID

confusions over the card-sort method summarised above. Based on this MDS-features analysis, we could predict a higher

proportion of confusions (M ¼ 0.85, SD ¼ 0.26) than based on the card-sort method (M ¼ 0.45, SD ¼ 0.36), t (56) ¼ 4.83,

p , 0.001. We further evaluate these two methods below.

Correlation between card-sort and MDS-features interpretations

There was substantial overlap between features identified by the MDS-features method and the card-sort method. The

MDS-features method identified about 58 feature categories totalled across participants. The most prevalent were features

pertaining to antenna (15 cases), number of wheels (11), small attached guns (9), armoured wheels (7), sectioned body (7),

armour smoothness (5) and attached equipment (4).

For each participant we compared the number of feature categories in common across card-sort piles and MDS-features

results. For 28 participants (those committing ID confusion errors), prediction features overlapped for at least one feature

category in 50% of cases (i.e. for 14 participants, the card-sort and MDS-features methods identified at least one of the same

features as important to the participant’s sorting or similarity-rating decisions). There was a fair amount of overlap between

the predictive features produced by the card-sort and MDS-features methods, but they were often not in agreement. This

leaves us with an important question: Which method are we to favour for inferring psychological dimensions? The next

section details another method that may inform this decision.

The psychological interpretation provided by the MDS-features method is valuable for understanding – and potentially

influencing – the outcome of identification training (i.e. for understanding the source of confusion errors). However,

predictions do not necessarily have to be linked to a psychological interpretation. Better predictive performance should be

possible when based only on inter-point distances in the MDS space, which reflects the contribution of any number of

underlying psychological dimensions.

In the next section, we consider an MDS-distance-based predictor of confusion errors (which does not rely on dimensional

interpretation), and compare its performance with the MDS-features and card-sort methods. In addition, so far we have

considered only ability to predict observed confusions (‘hits’ in the language of signal detection theory). In the next section, we

use signal detection analysis to compare all three prediction methods, taking into account both hit and false alarm rates.

Ergonomics 851

Dow

nloa

ded

by [

Dic

le U

nive

rsity

] at

11:

37 0

9 N

ovem

ber

2014

MDS-distance predictions and signal detection analysis

Using the MDS representation of training item similarity, we can use inter-point distance as a method for creating new

sorting criteria for the items (i.e. for sorting more-confusable from less-confusable items). We did this by evaluating a

range of inter-point distance criteria for each participant in order to predict confusion errors. This method is described

next.

Given an MDS representation of inter-item similarity (i.e. inter-point distance), we can predict confusability of two

items based on their closeness in space. For each participant, we evaluated a range of distance criteria and made confusion

error predictions. Then, for each participant, we selected the inter-point distance value that resulted in the best predictive

performance (as determined by d 0, the signal detection measure of discriminability).

For example, using the same participant as in Figure 2 as an example, the distance between points in the space is the

Euclidean distance computed using the x, y coordinates for each point. For this participant, inter-point distances ranged

between 0.482 and 4.328. We tried grouping items in the MDS space (i.e. treating them as confusable in a manner akin to

card-sorting) using several distance criterion values (ranging from 0.5, 1, 1.5, . . . , 4.5, 5). For example, when 0.5 was the

criterion, all pairs of items with inter-point distance of 0.5 or less were sorted together and considered confusable. All pairs

of items with larger inter-point distance would not be considered confusable. We repeated this process for each of the

distance criterion values in order to find one that best predicted ID confusion errors. For the participant in our example, an

inter-point distance criterion of two led to the best prediction performance (as defined below). The best inter-point distance

criterion varied by participant.

We computed signal detection indices of predictive performance as follows. A ‘signal’ was defined as an observed ID

confusion error (i.e. an actual confusion made by the participant during the ID confusion task). All other item pairs were

considered ‘noise’. In other words, observed confusions were treated as a signal that we tried to predict using our MDS-

distance-based detector. Other potential (but not observed in the data) confusions were treated as noise since these could

potentially be predicted as ‘signal’ by our algorithm (i.e. they could result in false alarms). A ‘hit’ was defined as a signal

that was predicted by our algorithm, and a false alarm was an ID confusion that was predicted by our algorithm but not

actually committed by the participant during the ID task.

To facilitate comparison, we also used this method to compute signal detection measures for the card-sort and MDS-

features procedures. For the card-sort method, items appearing together in a pile predicted a confusion error (a ‘hit’ if the

predicted error also happened to be an observed confusion in the data; a ‘false alarm’ otherwise). For the MDS-features

method, pairs with at least one of two dimensions close to matching on apparent feature ratings (differing by no more than

one; described above) predicted a confusion error (these, too, were classified as hits or false alarms). All signal detection

analyses were carried out at the individual participant level, and aggregate results presented below are based on individual-

level outcomes.

To increase the reliability of each method’s predictive power, we base these comparisons on participants who

committed at least three confusion errors in the identification task. Table 1 shows the d 0, hit rate and false alarm rates for

each method.

There was an advantage for the MDS-based approaches over the card-sort method in terms of discriminability, d 0 andhit rate. Average d 0 was significantly higher for the MDS-distance method than for the MDS-features method, t (18) ¼ 3.09,

p ¼ 0.003, as well as higher than the card-sorting method, t (18) ¼ 3.498, p ¼ 0.001. Although d 0 was higher for MDS-

features than for card-sort, this difference was not statistically significant, t (18) ¼ 0.93, p ¼ 0.363. Both MDS-distance and

MDS-features had significantly higher hit rates than card-sort, t (18) ¼ 4.29, p , 0.001 and t (18) ¼ 4.49, p , 0.001,

respectively. MDS-distance and MDS-features hit rates did not significantly differ from each other, t (18) ¼ 0.245,

p ¼ 0.405.

On the other hand, the card-sort method produced the lowest false alarm rates. MDS-features resulted in significantly

more false alarms than card-sort, t (18) ¼ 8.43, p , 0.001, as did MDS-distance, t (18) ¼ 4.32, p , 0.001. The MDS-

distance method provided an intermediate level of false alarms; significantly lower than those from the MDS-features

approach, t (18) ¼ 1.76, p ¼ 0.048 but still higher than the card-sort approach.

Table 1. Summary of signal detection analysis on prediction of ID confusions by method.

d 0 Hit rate False alarm rate

Card-sort 0.556 (1.541) 0.450 (0.274) 0.270 (0.163)MDS-features 0.974 (1.174) 0.819 (0.231) 0.771 (0.185)MDS-distance 1.816 (0.672) 0.836 (0.302) 0.624 (0.353)


Dow

nloa

ded

by [

Dic

le U

nive

rsity

] at

11:

37 0

9 N

ovem

ber

2014

In summary, the MDS-based approaches can lead to substantial gains in hit rate, but this trades off with an increase in

false alarm rate. However, the MDS-distance method provides some relief from this problem over the MDS-features

approach. The overall level of discriminability, d 0, which takes both hit and false alarm rates into account, clearly favours

the MDS-distance method over the others.

Correlation between prediction methods

Another way to examine the relationship between prediction methods is through their correlation with each other. We

computed pairwise correlations between the methods on measures of d 0, hit rate and false-alarm rate. None of the

correlations reached the level of statistical significance ( p . 0.05 in all cases), but the trends were nevertheless consistent

with what we might expect based on the results described above. Based on d 0, MDS-features and MDS-distance were closer

to each other than to card-sorts. There was a weak correlation between MDS-features and MDS-distance, r (17) ¼ 0.27, but

little correlation between MDS-features and card-sorts, r (17) ¼ 0.02 and between MDS-distance and card-sorts,

r (17) ¼ 0.17. Similarly, for hit rates, the strongest correlation was between MDS-distance and MDS-features,

r (17) ¼ 0.39, with virtually no correlation between MDS-features and card-sorts, r (17) ¼ 0.002, and MDS-distance and

card-sorts, r (17) ¼ 0.08. However, for false alarm rates, the relationship between the methods was more equivocal. There

were weak correlations between MDS-features and MDS-distance r (17) ¼ 0.20 and between MDS-distance and card-sorts,

r (17) ¼ 0.21. There was very little relationship between MDS-features and card-sorts, r (17) ¼ 0.11. These relationships

suggest that the strongest predictive relationships are between MDS-based measures on d 0 and hit rate.

Discussion

In this study, we compared methods for predicting identification confusion errors and for understanding the mental

representation underlying these confusions. Each participant completed a series of tasks, including a pre-training card-sort,

identification training and testing, a post-training card-sort and finally a pairwise-similarity-rating task. The card-sort and

similarity-rating tasks provide separate indicators of participants’ mental representation of the similarities and differences

between items in the identification-training stimulus set. Our hypothesis was that participants focus on a subset of features

during training and that we can use this tendency to predict and explain confusion errors. In doing so, we may discover a

basis for directing learners’ attention towards critical stimulus features and away from superfluous details of training items.

The ID training task was designed to mimic features of self-paced study with a set of training images (e.g. a deck of

training cards). We compared two training conditions; a passive ‘observational’ training condition in which participants

studied tanks and their labels together, and a more actively engaging ‘feedback’ training condition in which identification

attempts were followed by corrective feedback. We observed some differences between training types – including a lower

error rate and greater correspondence of confusions with card-sort categories in the feedback condition. Although these

differences are interesting and might warrant further study, our primary focus was on understanding and predicting

confusion errors rather than comparing training methods.

Predicting ID confusions

There was clear evidence that subsets of features were prominent in the minds of participants. Attention was often given to

important features such as body style and weapons. In many cases, however, participants were influenced by the presence of

antennas or other peripheral attachments which may not provide a reliable guide to identification of armoured vehicles in

operational environments.

We found that the card-sort method, which is straightforward to implement, accounted well for identification errors,

predicting about 45% of observed confusions. However, the MDS-based methods proved to be much more sensitive. The

MDS-features method, which combined the spatial representation of similarity judgements with regression-based

interpretation of the most prominent psychological dimensions, predicted about 82% of observed confusions. And the

MDS-distance method, which omits any psychological interpretation, predicted about 84% of confusions. On the other

hand, both MDS methods predict a higher false alarm rate than the card-sort method (although this problem was less severe

in the MDS-distance case). If the goal is to root out ID confusions, the MDS approaches seem to be worth exploring in more

detail.

It is important to point out that the methods applied here are preliminary, and that better predictive performance could

be achieved with the MDS-based methods by allowing higher dimensional representations (we limited our MDS solutions

to two dimensions in this study to simplify interpretation). Also, the MDS-distance method could be improved slightly by

using a parameter optimisation algorithm to find the most predictive distance criterion for each participant. In order to gain a

clearer picture of the trade-off between hit and false alarm rates for each method, we would likely need a study with many

Ergonomics 853

Dow

nloa

ded

by [

Dic

le U

nive

rsity

] at

11:

37 0

9 N

ovem

ber

2014

more repetitions of the identification test trials (to produce more identification confusions). This would make the

identification task a more sensitive measure of mental representation. It is also known that small training sets can lead to

different learning strategies than large training sets (Rouder and Ratcliff 2004). Nevertheless, our goal here was

exploratory. We were able to demonstrate the feasibility of interpreting psychological representation and predicting

identification errors in a variety of ways.

Another prediction approach worth exploring would be to apply models from the literature on classification learning

(Pothos and Wills 2011). One thing we have not explored in the analyses reported here is whether participants

unintentionally develop mental categories (in addition to memorising individual identities) for the training items based on

similarity or whether they simply apply rules along a small subset of dimensions. If mental organisation of training items

corresponds to application of simple dimensional rules (e.g. tanks with armour-covered treads), then learners may memorise

fewer features of the training items than if their representation is based on memories for a set of features for each training

exemplar. Determining their actual strategy would likely require fitting computational models to the data.

Future research

There are of course additional avenues for future research. For example, eye tracking has been used in identification-

training studies to answer questions similar to our own about the focus of attention during learning (e.g. Lee et al. 2013).

It would be valuable to see whether eye tracking results corroborate our behaviour-based conclusions regarding the features

that drive performance.

Another important question pertains to the contribution of individual differences. Attention to certain details may be

predictable from personal preferences, personality type or experiences that learners have had. The current work did not

assess these variables. Furthermore, it will be important to understand any differences between learning performance of

novices (as evaluated here) and those who have already received some form of training in vehicle identification.

Another important consideration is the possibility that old/new recognition, rather than identification based on

memorising sets of features, influences what learners remember and respond to. Some errors might be due more to a vague

sense of recognition than reasoning based on attention to specific characteristics.

Application to adaptive training

Both the card-sort and the MDS-based methods could provide the basis for future adaptive training systems. However,

although card-sort data are simple to collect, their analysis and interpretation are much more subjective than the similarity-

rating-based MDS methods. It might be difficult to implement in a computer-mediated training system using the card-sort

method. On the other hand, the similarity-based methods lend themselves much more readily to automation. It is easy to

envision an adaptive training system that teaches and tests item identification (tanks or otherwise) along with a system for

collecting similarity ratings for the purpose of tailoring subsequent training sessions to optimise attention allocation.

Such an approach makes sense given the ubiquity of hand-held computing devices (e.g. cell phones). Mobile devices

can increase the realism of training stimuli and allow interactive learning, in addition to real-time adaptive capabilities

based on the methods reported here. The system could improve training by directing attention away from features that are

unreliable or unimportant and towards features that are critical for accurate vehicle identification. Furthermore, such an

evaluation system would easily integrate into efforts at determining the ideal form-factor for training items. For example,

ongoing research focuses on investigating differences between static images, movies and 3D interactive virtual models for

training (Keebler, Jentsch, and Hudson 2011; Keebler, Jentsch, and Schuster, 2013).

Mobile device training systems such as the Army’s ROC-V (Recognition of Combat Vehicles) training programme

allow users many options for studying and testing on trained vehicles (Night Vision and Electronic Sensors Directorate

2013). By incorporating a pairwise comparison or sorting task like those evaluated here, and analysing the results using our

MDS-based approach, the system could be tuned to (1) shorten training by focusing learner attention on routinely confused

items and (2) ameliorate potential errors by emphasising dimensions that are critical for accurate identification.

Finally, future research will need to put such an adaptive training method to the test with respect to real training

outcomes. The system would need to model learner performance in real time, and compare performance with more

traditional methods.

References

Biederman, I., and M. M. Shiffrar. 1987. “Sexing Day-Old Chicks: A Case Study and Expert Systems Analysis of a Difficult Perceptual-Learning Task.” Journal of Experimental Psychology: Learning, Memory, and Cognition 13: 640–645.

Borg, I., and P. Groenen. 2005. Modern Multidimensional Scaling: Theory and Applications. 2nd ed. New York: Springer-Verlag.


Dow

nloa

ded

by [

Dic

le U

nive

rsity

] at

11:

37 0

9 N

ovem

ber

2014

Briggs, R. W., and J. H. Goldberg. 1995. “Battlefield Recognition of Armored Vehicles.” Human Factors 37 (3): 596–610.Edwards, P. J., F. Sainfort, T. Kongnakorn, and J. A. Jacko. 2006. “Methods of Evaluating Outcomes.” In Handbook of Human Factors

and Ergonomics, 3rd ed., 1150–1187. Hoboken, NJ: Wiley.Folstein, J. R., I. Gauthier, and T. J. Palmeri. 2010. “Mere Exposure Alters Category Learning of Novel Objects.” Frontiers in Psychology

1: 40.Keebler, J. R., M. Harper-Sciarini, M. Curtis, D. Schuster, F. Jentsch, and M. Bell-Carroll. 2007. “Effects of 2-Dimensional and 3-

Dimensional Media Exposure Training on a Tank Recognition Task.” Proceedings of the 51st Annual Meeting of the Human Factorsand Ergonomic Society, Baltimore.

Keebler, J. R., F. Jentsch, and I. Hudson. 2011. “Developing an Effective Combat Identification Training.” Proceedings of the 55thAnnual Meeting of the Human Factors and Ergonomics Society, Las Vegas.

Keebler, J. R., F. Jentsch, and D. Schuster. 2013. “The effects of video game experience and active stereoscopy on performance in combatidentification tasks.” Submitted for publication.

Keebler, J. R., L. Sciarini, T. Fincannon, F. Jentsch, and D. Nicholson. 2008. “Effects of Training Modality on Target Identification in aVirtual Tank Recognition Task.” Proceedings of the 52nd Annual Meeting of the Human Factors and Ergonomics Society, New York.

Keebler, J. R., L. Sciarini, T. Fincannon, F. Jentsch, and D. Nicholson. 2010. “A Cognitive Basis for Vehicle Misidentification.” InHuman Factors Issues in Combat Identification, edited by D. H. Andrews, R. P. Herz, and M. B. Wolf, 113–128. Burlington, VT:Ashgate.

Kemler Nelson, D. G. 1984. “The Effect of Intention on What Concepts Are Acquired.” Journal of Verbal Learning and Verbal Behavior23 (6): 734–759.

Kirkpatrick, D. L., ed. 1975. “Techniques for Evaluating Training Programs.” In Evaluating Training Programs. Alexandria, VA: ASTD.Kruskal, J. B., and M.Wish. 1978. “Multidimensional Scaling.” In Sage University Paper Series on Quantitative Application in the Social

Sciences, 07–011. Beverly Hills: Sage.Lee, C., E. Middleton, D. Mirman, S. Kalenine, and L. J. Buxbaum. 2013. “Incidental and Context-Responsive Activation of Structure-

and Function-Based Action Features during Object Identification.” Journal of Experimental Psychology: Human Perception andPerformance 39 (1): 257–270.

Malone, T. W. 1981. “Toward a Theory of Intrinsically Motivating Instruction.” Cognitive Science 4: 333–369.Markman, A. B., and V. S. Makin. 1998. “Referential Communication and Category Acquisition.” Journal of Experimental Psychology:

General 127 (4): 331–354.Night Vision and Electronic Sensors Directorate. 2013. Army ROC-V (Version 1.0) [Mobile application software]. https://play.google.

com/store/apps/details?id¼gov.usa.rocvO’Kane, B. L., I. Biederman, E. E. Cooper, and B. Nystrom. 1997. “An Account of Object Identification Confusions.” Journal of

Experimental Psychology: Applied 3 (1): 21–41.Pothos, E. M., and A. J. Wills, eds. 2011. Formal Approaches in Categorization. Cambridge: Cambridge University Press.Regan, G. 1995. Blue on Blue: A History of Friendly Fire. New York: Avon Books.Rouder, J. N., and R. Ratcliff. 2004. “Comparing Categorization Models.” Journal of Experimental Psychology: General 133: 63–82.Smith, E. E. 2008. “The Case for Implicit Category Learning.” Cognitive, Behavioral, and Affective Neuroscience 8 (1): 3–16.Yamauchi, T., and A. B. Markman. 1998. “Category Learning by Inference and Classification.” Journal of Memory and Language 39:

124–148.

Ergonomics 855

Dow

nloa

ded

by [

Dic

le U

nive

rsity

] at

11:

37 0

9 N

ovem

ber

2014

http://play.google.com/store/apps/details?id=gov.usa.rocv

http://play.google.com/store/apps/details?id=gov.usa.rocv

predicting and interpreting identification errors in military vehicle training using...

Documents