an empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

34
Int. J. Human-Computer Studies 62 (2005) 487–520 An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices Leon Barnard a , Ji Soo Yi a , Julie A. Jacko a, , Andrew Sears b,1 a Laboratory for Human-Computer Interaction and Health Care Informatics, School of Industrial and Systems Engineering, Georgia Institute of Technology, 765 Ferst Drive NW, Atlanta, GA 30332-0205, USA b Information Systems Department, Interactive Systems Research Center, UMBC, 1000 Hilltop Circle, Baltimore, MD 21250, USA Received 22 July 2004; received in revised form 29 November 2004; accepted 22 December 2004 Available online 2 April 2005 Communicated by Prof. J. Scholtz Abstract There is a clear need for evaluation methods that are specifically suited to mobile device evaluation, largely due to the vast differences between traditional desktop computing and mobile computing. One difference of particular interest that needs to be accounted for is that mobile computing devices are frequently used while the user is in motion, in contrast to desktop computing. This study aims to validate the appropriateness of two evaluation methods that vary in representativeness of mobility, one that uses a treadmill to simulate motion and another that uses a controlled walking scenario. The results lead to preliminary guidelines based on study objectives for researchers wishing to use more appropriate evaluation methodologies for empirical, data-driven mobile computing studies. The guidelines indicate that using a treadmill for mobile evaluation can yield representative performance ARTICLE IN PRESS www.elsevier.com/locate/ijhcs 1071-5819/$ - see front matter r 2005 Published by Elsevier Ltd. doi:10.1016/j.ijhcs.2004.12.002 Corresponding author. Tel.: +1 404 385 2545; fax: +1 404 385 6115. E-mail addresses: [email protected] (L. Barnard), [email protected] (J.S. Yi), [email protected] (J.A. Jacko), [email protected] (A. Sears). 1 Tel.: +1 410 455 3883; fax: +1 410 455 1531.

Upload: leon-barnard

Post on 21-Jun-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

Int. J. Human-Computer Studies 62 (2005) 487–520

1071-5819/$ -

doi:10.1016/j

�Correspo

E-mail ad

[email protected].: +1 4

www.elsevier.com/locate/ijhcs

An empirical comparison of use-in-motionevaluation scenarios for mobile computing

devices

Leon Barnarda, Ji Soo Yia, Julie A. Jackoa,�, Andrew Searsb,1

aLaboratory for Human-Computer Interaction and Health Care Informatics, School of Industrial and

Systems Engineering, Georgia Institute of Technology, 765 Ferst Drive NW, Atlanta, GA 30332-0205, USAbInformation Systems Department, Interactive Systems Research Center, UMBC, 1000 Hilltop Circle,

Baltimore, MD 21250, USA

Received 22 July 2004; received in revised form 29 November 2004; accepted 22 December 2004

Available online 2 April 2005

Communicated by Prof. J. Scholtz

Abstract

There is a clear need for evaluation methods that are specifically suited to mobile device

evaluation, largely due to the vast differences between traditional desktop computing and

mobile computing. One difference of particular interest that needs to be accounted for is that

mobile computing devices are frequently used while the user is in motion, in contrast to

desktop computing. This study aims to validate the appropriateness of two evaluation

methods that vary in representativeness of mobility, one that uses a treadmill to simulate

motion and another that uses a controlled walking scenario. The results lead to preliminary

guidelines based on study objectives for researchers wishing to use more appropriate

evaluation methodologies for empirical, data-driven mobile computing studies. The guidelines

indicate that using a treadmill for mobile evaluation can yield representative performance

see front matter r 2005 Published by Elsevier Ltd.

.ijhcs.2004.12.002

nding author. Tel.: +1 404 385 2545; fax: +1 404 385 6115.

dresses: [email protected] (L. Barnard), [email protected] (J.S. Yi),

atech.edu (J.A. Jacko), [email protected] (A. Sears).

10 455 3883; fax: +1 410 455 1531.

Page 2: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520488

measures, whereas a controlled walking scenario is more likely to adequately simulate the

actual user experience.

r 2005 Published by Elsevier Ltd.

Keywords: Context; Context-aware; Mobile computing; Mobility; Evaluation; Empirical

1. Introduction

1.1. The need for more appropriate mobile evaluation methods

While numerous evaluation methods for desktop interfaces have been formallyestablished and validated, mobile computing evaluation presents many newchallenges for interface designers and usability researchers. Mobile device inputand output methods, due to size and mobility constraints, test the limits of humaninformation processing and motor abilities. Furthermore, the broad array ofavailable environments in which mobile devices are frequently used and the variablecontexts of use make evaluating mobile devices in realistic scenarios much moredifficult than evaluating desktop computing products.

Despite the clear differences in interaction between mobile and desktopcomputing, many mobile device prototypes are investigated using traditionaldesktop-paradigm evaluation strategies. In particular, the evaluation of mobiledevices in stationary, static desktop-like situations tends to be the norm, eventhough, as Kristoffersen and Ljungberg (1999) point out, mobile computing taskstend to take place outside of the device, as opposed to inside of the device fordesktop computing, implying a much broader range of use contexts. York andPendharkar (2004) also emphasize the range of mobile scenarios available to themobile device user and the variable work contexts that mobile device users arerequired to work in. However, a review of 102 recent mobile HCI studies (Kjeldskovand Graham, 2003) revealed ‘‘a clear bias towards environment independent andartificial setting research’’ (p. 326). While many empirical studies avoid payingattention to environmental considerations, it may, in fact, be the interaction betweenthe technology and the environment that is most interesting and revealing for mobilecomputing, as asserted by Oulasvirta (2004).

Several researchers recognize this phenomenon and have argued for moreattention to be paid to environmental and situational considerations in evaluatingmobile devices. Even as far back as 1998, in the relative infancy of mobilecomputing, Peter Johnson (1998) observed: ‘‘HCI methods, models and techniqueswill need to be reconsidered if they are to address the concerns of interaction on themove’’ (p. 7). He further goes on to say: ‘‘Evaluation would clearly be possible, butthe criteria and the methods used would need to be researched’’ (p. 7). Later on,Dunlop and Brewster (2002) cite designing for mobility as the number one challengeto mobile HCI investigators and designers, while Pascoe et al. (2000) encourage thedevelopment of Minimal Attention User Interfaces to cope with complex contextualdemands. Since then, others have frequently discussed the environment as an

Page 3: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 489

important contextual factor for consideration. However, context has been noticeablyabsent as an independent variable in empirical studies.

1.2. The effects of mobile contexts on mobile device use

In order to understand in-context mobile device use, it is valuable to begin bylooking at previous research that has investigated the effects of motion on secondarytask performance. Studies on healthy adults have indicated that motion can have asignificant cost when it comes to performing concurrent tasks. For example,Ebersbach et al. (1995) found that performing basic tasks while walking can lead toimpaired gait rhythm and slow task performance, as compared to walking alone.Also, Lajoie et al. (1993) observed participants’ performance while sitting, standingand walking and concluded that walking requires more attention than both sittingand standing. Additionally, a review of recent empirical research concluded thatcognition and motor control are not independent functions (Pellecchia, 2003),indicating that exerting effort in one function affects resources available to the other.These studies confirm that the task of maintaining balance and motion while walkingexerts attentional demands on humans.

Although only a few mobile HCI studies exist that have compared performancewhile stationary to performance while walking, differences have been found formobile tasks in terms of performance and/or workload while using personal digitalassistants (PDAs) (Brewster, 2002; Kjeldskov and Stage, 2004) and mobile phones(Kjeldskov and Stage, 2004; Mustonen et al., 2004). In each of these studies,stationary participants performed better and/or felt less workload than walkingparticipants in all cases, indicating that using a mobile device while moving yields adetriment to performance. Sears et al. (2003) refer to detriments to effective mobilecomputing interaction caused by environmental and situational factors assituationally induced impairments and disabilities (SIID).

Many researchers in the field of context awareness have acknowledged the effectsof motion on behavior by using sensors to attempt to recognize when a user ismoving and what type of motion they are undergoing (e.g., Schmidt et al., 1999a, b;Hinckley et al., 2000; Mantyjarvi and Seppanen, 2003; Bristow et al., 2004). Much ofthis research is designed to help mobile devices adapt to changes in motion byincreasing the font size or other device parameters. However, while the technologyfor this adaptation is being developed, rules for determining when and how thedevice should adapt are lacking due to the deficit of studies looking at humanbehavior with mobile devices under changing contextual conditions.

While previous work has helped to advance the state of mobile device research andevaluation, there is still a large amount of work to do in order to formally establishappropriate mobile device evaluation scenarios. Currently, mobile device evaluationsare typically concocted in an ad-hoc manner, rather than rigorously developed basedon established guidelines. The following quote from Pirhonen et al. (2002) is stillrepresentative of the current state of mobile device evaluation: ‘‘As yet there are nowell-developed methods for testing mobile devices, y we decided to try two very

Page 4: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520490

different evaluation techniques to see what kinds of usability problems each broughtup so that we could learn more’’ (p. 292).

1.3. Research questions

For mobile HCI researchers there are several ways that the effects of motion canbe investigated. Most intuitively, studies where participants use mobile devices whileroaming freely in a public environment can be devised. However, for researchersaiming to isolate the effects of motion from other contaminants, the idea of suchuncontrolled studies can be daunting. Control is critical for empirical data collectionmethods employing the scientific method, which involve a testable hypothesis.Meister (2004) warns that data from uncontrolled studies can be virtuallyuninterpretable. In a mobile HCI study where controlled and uncontrolled studieswere used and compared, the controlled experiment was able to yield more accurateperformance data which enabled straight-forward comparisons between new andexisting designs (Pirhonen et al., 2002).

Simulated environments, on the other hand, can be as good, or better than naturalenvironments (e.g., Kjeldskov et al., 2004; Lee et al., 2003). There can be instances,for example, where the advantages associated with control or safety outweigh thedisadvantages associated with a lack of realism, depending on the goals of the study.When motion is of interest, treadmills are frequently used to simulate real worldwalking (also referred to as ‘‘overground’’ walking) situations in a variety ofdomains. Treadmill walking, as compared with overground walking, is typically farmore controllable and affords easier data collection, which is desirable inexperimental studies (Alton et al., 1998).

However, the use of a treadmill can also be inadequate because it might be toounrealistic, due to the constant speed and inherent lack of navigation required by theparticipant. Several studies (e.g., Murray et al., 1985; Nigg et al., 1995; Alton et al.,1998; Schache et al., 2001; Vogt et al., 2002) found differences between treadmill andoverground walking in terms of human body kinematics, showing that people’sphysical behavior differs between the two conditions. For example, Murray et al.(1985) and Alton et al. (1998) found that treadmill participants used a significantlyshorter stride length than overground walking participants. Other differences werefound in pelvic motion and angle and spine movement, among others. Additionally,Patla (1997) found that difficulties can arise when visual and kinesthetic inputs donot match, such as when the body is moving but the surroundings are notcorresponding to that motion, like while on a treadmill. The author states: ‘‘Whenthe support surface movesythe kinesthetic output can be in error because its outputis referenced to the support surface’’ (p. 58).

Other approaches can be devised which aim to strike a balance between the twoextremes, and are likely to inherit some of the pros and cons of both the free walkingand treadmill scenarios. In absence of a perfectly controlled, yet perfectly naturalexperimental scenario, new experimental scenarios need to be investigated until oneis found that is able to effectively balance control and realism.

Page 5: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 491

The goal of this study was to empirically assess the viability of two specificexperimental scenarios for use in mobile HCI evaluation and to capture the tradeoffsinherent in each. This study was designed to be exploratory due to the limitedprevious work in the area of assessing use-in-motion mobile evaluation scenarios.Because of this, this study focuses on effectiveness of mobile device output(conveying information from the device to the user) and attempts were made to limitthe input (entering information to the device from the user) requirements so that acleaner picture could be gleaned from the data collected. Additionally, the goals ofthis study did not include reporting on the effectiveness of specific technologies usedin use-in-motion scenarios. Rather, the methods of evaluation were of primaryinterest. The technologies used were intended to be representative of common mobiletechnologies, rather than representing a new approach to mobile device interaction.

The two experimental scenarios analysed in this paper are as follows:

1.

Walking on a treadmill at a constant speed. 2. Free walking around a defined path in a controlled environment.

Both scenarios will be described in detail in Section 2.It is imperative that more realistic evaluation scenarios be employed in order to

produce more accurate and more appropriate results and promote a betterunderstanding of in-context mobile device use. The study reported here aims toinvestigate more precisely the differences and tradeoffs between specific mobileexperimental evaluation scenarios in order to contribute to developing mobile deviceevaluation guidelines. Of interest were differences in participant experiences whileperforming basic tasks on a mobile device between the two scenarios. User behavior,as measured by performance and subjective measures, was recorded and compared.This will add an additional degree of rigor to the studies that have been previouslyreported in the literature. The results generated can benefit all researchers andpractitioners interested in incorporating motion into studies of mobile deviceusability or mobile device use in general.

2. Methodology

2.1. Overview

One hundred twenty-six participants were asked to perform a set of tasks on amobile device while sitting, walking on a treadmill, or free walking along a patharound a room (the demographic information of the participants and the procedureof the experiment will be described in the following sections). Data from a subset ofthe participants, the participants who performed the tasks on the treadmill(‘‘treadmill group’’) or while walking around the room (‘‘walking group’’), will beexamined further in this paper. The sitting condition, which involved stationaryparticipants, is less relevant to this discussion about mobility in evaluation and willbe excluded from the analysis of the present study. The results from the sitting

Page 6: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520492

condition have been analysed separately and submitted for publication elsewhere(Barnard et al., 2004).

2.2. Apparatus

For the experiment, the PalmTM m505, a color-screen display PDA, was chosen asa representative mobile device because it has been one of the biggest selling, recentPDAs available on the market. For the experimental tasks, which were mostlyinformation retrieval in nature, PDAs were regarded as a more representative mobiledevice than mobile phones because PDAs are more widely used for reading eBooksand browsing web content. Detailed specifications of the PDA are described inTable 1.

For the treadmill scenario, a PaceMaster Pro Elite motorized treadmill (AerobicsInc., 2002), manufactured by Aerobics, Inc., was used.

2.3. Participants

Undergraduate students (mean age ¼ 21.8 years) were recruited from a senior-level class in the School of Industrial and Systems Engineering at the GeorgiaInstitute of Technology (N ¼ 126). Compensation was provided to the participantsin the form of extra credit for their participation. Given that all the participants hadadvanced to the junior- or senior-levels at the same institution, within the samedegree program, and had met the same admission and programmatic performancecriteria, the participant sample was regarded to be relatively homogeneous.Although these participants might not perfectly represent all mobile device users,the benefits of recruiting from a relatively homogenous subject population wasassessed to outweigh the cost of recruiting a less homogeneous sample of participantsbecause the variability between subjects could be minimized to capture the subtledifferences in outcome measures.

Table 2 summarizes the comparisons of demographic information from the twogroups: the treadmill and walking groups. Because most of the factors did not meetthe assumptions required for a parametric test, the profiles of both groups werecompared mainly using a non-parametric test, the Mann–Whitney test, except forthe age factor. The age factor met the assumptions, so it was compared using a

Table 1

Specifications of PalmTM m505

Items Values

Weight 4.9 oz (139 g)

Width�height�depth 3.1 in� 4.5 in� 0.5 in

(7.87 cm� 11.43 cm� 1.27 cm)

Color depth 16-bit (65535 colors)

Screen resolution 160� 160

Display TFT active matrix

Page 7: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

Table 2

Demographic comparisons between the treadmill and walking groups

Group Treadmill (n ¼ 44) Walking (n ¼ 38) Test statistic p-Value

Age 21.9 (1.70)a 21.7 (1.29)a F ¼ 0:435b 0.512

Gender Male ¼ 22, Male ¼ 17,U ¼ 792 0.636Female ¼ 22 Female ¼ 21

Native language English ¼ 36, English ¼ 35,U ¼ 750 0.176Other ¼ 8 Other ¼ 3

Dominant hand Right ¼ 41, Right ¼ 37,U ¼ 801 0.383Left ¼ 3 Left ¼ 1

Computer use frequency 41/day ¼ 39 41/day ¼ 36

U ¼ 785 0.327E1/day ¼ 5 E1/day ¼ 2

o1/day ¼ 0 o1/day ¼ 0

PDA owner? Current ¼ 8 Current ¼ 5

U ¼ 749 0.330Previous ¼ 6 Previous ¼ 8

Never ¼ 30 Never ¼ 25

PDA familiarity score 1.25 (1.53)a 1.82 (1.70)a U ¼ 681 0.127

Regularly read While walking? Yes ¼ 11, Yes ¼ 14,U ¼ 737 0.248No ¼ 33 No ¼ 24

aMean (Standard deviation).bAn ANOVA test was employed.

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 493

parametric test, Analysis of Variance (ANOVA), for better statistical power. Theresults of comparisons showed that no significant differences existed at level 0.05between the two groups in various factors, such as age, gender, native language,dominant hand, the frequency of computer usage, the experience of owning PDA,the familiarity of PDA, and the tendency to read while walking. The PDA familiarityscore was the sum of participants’ average frequency of use of common PDAapplications and their overall PDA comfort level, on a scale from 0 to 8, in which 0represented no previous exposure at all and 8 represented a very high level ofexpertise and familiarity. Interestingly, Table 2 shows that the majority of theparticipants had not owned PDA and were not familiar with PDA though they wererecruited from an engineering school. However, this factor did not significantlyaffect the validity of the experiment because the nature of the tasks did not requireany prior experience with PDAs. Additionally, detailed instructions and enoughpractice sessions were given to participants so that device familiarity effects wereminimized.

2.4. Procedure

Participants performed two tasks on a PDA. Stylus input was required for bothtasks, but the experimental focus was on the salience and effectiveness of theinformation that was delivered to the participants from the PDA over the success of

Page 8: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520494

participants’ input to the device. The input required was deliberately limited anddesigned to avoid errors. Observation confirmed that the input required did notinterfere with the goals of the study.

The procedure consisted of five parts:

1.

A background questionnaire, which was used to collect basic demographicinformation about each participant as well as information about their exposureto computers and handheld devices.

2.

A reading comprehension task, where participants were asked to read passages oftext and answer multiple choice questions on a PDA while moving.

3.

A word search task, also done while moving, where participants read through ashort news article on a PDA and attempted to locate a specific target word asquickly as possible.

4.

A subjective workload assessment (the NASA-TLX), which was given after eachtask was completed in order to gauge participant feelings about the stress he/sheexperienced while performing the tasks.

5.

A post-task questionnaire, which asked participants about their feelings about thetasks and the overall experience.

The procedure for the two experimental tasks will be elaborated upon below.

2.4.1. Task 1: reading comprehension

Reading comprehension on a mobile device was investigated because of theavailability of electronic reading material, such as eBooks (Open eBook Forum,2003) and off-line web browsers, such as Avantgo (iAnywhere Solutions, 2003).Additionally, many other PDA tasks involve deeper comprehension of information,such as processing the specific details of an appointment or meeting. Text passagesand multiple-choice answers were selected from 501 Reading ComprehensionQuestions (Learning Express, 1999), a book designed to help high school studentsprepare for standardized tests.

Ten reading passages were presented in a random order to each participant, eachfollowed immediately by two multiple-choice questions about the content ormeaning of the previous passage. The order of the two multiple-choice questions wasalso randomized. Participants were not allowed to see the multiple-choice questionsbefore they read the passage and also could not return to the passage once theymoved on to the multiple-choice question screen. This was done in order to makesure that participants attempted to process the reading material thoroughly withoutany prior information about the following questions and to separate the timerequired to read the text from the time required to read the answer choices andanswer the questions. Screen shots of one of the text passages and one of the multiplechoice screens are shown in Figs. 1 and 2, respectively.

As can be seen from Fig. 1, a button labeled ‘Done’ was placed at the bottom ofthe text screen that took participants to an answer screen, like the one shown in Fig.2. Once participants selected their answer, they pressed a button labeled ‘Submit’,which would either take them to the second question for the associated passage, the

Page 9: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

Fig. 2. Answer choice screen shot.

Fig. 1. Text passage screen shot.

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 495

next text passage, or to a screen telling them they had finished the task, depending onwhich answer screen they were on.

2.4.2. Task 2: word search

A second task was also used in order to account for other types of tasks that aretypically performed on a mobile device. This task was intended to be representativeof a shallower, seek-and-find task. An example could be skimming a list of contactsto find a specific person’s name. Selected sentences from twenty news articles wereused for the task. Participants were presented a screen like the one shown in Fig. 3,which contained a single word in large, all-capital letters (the ‘‘target’’ word) at thetop with text from a news article below. Participants were asked to locate the targetword from within the text. The target word was a randomly selected word of three ormore letters that only appeared once in the text.

Once a participant had located the target word (‘‘DONALDSON’’ in the screenshot in Fig. 3), he/she was instructed to touch anywhere on the line that it appeared(the sixth line in the case of Fig. 3). This was done to reduce the level of stylusprecision required by the user. Once participants had selected a line of text, whethercorrectly or not, a new text passage and target word would appear. This process wasrepeated for all 20 trials.

Page 10: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

Fig. 3. Word Search screen shot.

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520496

2.5. Evaluation scenarios

As stated above, participants performed the Reading Comprehension and WordSearch tasks while either walking on a treadmill or navigating a path around a room.Both evaluation scenarios will be described in detail below.

2.5.1. Treadmill

Since the intention of using the treadmill was to simulate a normal walkingscenario, the first step once participants had been introduced to the safety features ofthe treadmill was to establish a comfortable walking speed. The experimenter beganthe treadmill at 1.5 miles per hour (mph; 2.4 km/h) and allowed the participant toadjust the speed up or down in 0.1 miles per hour (0.16 km/h) increments until he/shefelt that it approximated his/her normal walking speed. Normal walking speed wasdescribed to participants as the rate he/she would typically walk with the purpose ofarriving at a specific destination when he/she was not in a hurry, such as walkingfrom one class to another. This speed was recorded by the experimenter. After awalking speed had been established, the participant stepped off of the treadmill andwas introduced to the tasks to be performed on the mobile device.

After task orientation, the experimenter started the treadmill at the establishedspeed and allowed the participant to walk for approximately 20 seconds in order toestablish a stride and allow the treadmill speed to stabilize. The participant was thengiven the PDA and time to read the instructions. The participant began byperforming practice trials (one passage and one question for the ReadingComprehension task, five trials for the Word Search task) and then verballyindicated that he/she understood the task and was ready. The experimenter then toldhim/her to begin by pressing a button on the PDA screen. The treadmill speed waskept constant for each participant throughout the experiment. Each participant wastold that he/she could decrease the speed (or stop the experiment entirely) if itbecame too physically strenuous, however no participants did so. Handrails wereavailable for support and safety, but were not used during the tasks. Additionally,the treadmill was kept parallel to the floor; as the incline feature was not used. Whenthe participants completed each task they stopped the treadmill by pressing a largered stop button and handed the PDA to the experimenter. This process was repeatedfor each scenario. The treadmill condition is demonstrated in Fig. 4.

Page 11: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

Fig. 4. Demonstration of the treadmill scenario.

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 497

2.5.2. Walking

The walking tasks took place in the same room as the treadmill tasks. A one foot-wide path was taped to the floor that navigated participants around anapproximately 30-ft by 30-ft room using furniture in the room as obstacles. Adiagram of the path is shown in Fig. 5. All elements of the diagram are to scale,except for the width of the tape, which has been exaggerated for visual clarity. Aphotograph of the course is shown in Fig. 6.

The walking task began by establishing a normal walking speed for eachparticipant, similar to the treadmill condition. Each participant walked around thepath twice, starting out walking counter-clockwise, then reversing direction after onelap until he/she returned to the starting point. This was done in order to familiarizeparticipants with the path as well as establish a reference speed for future use. Theexperimenter recorded the time that each participant took to complete the two laps.

Participants were then introduced to the task they would be performing on themobile device and given a chance to perform several practice trials while followingthe path. Once participants had completed the practice trials, they returned to thestarting point and began the recorded trials. The tasks often required participants tocomplete multiple laps around the course. Unlike the treadmill condition,participants were allowed to walk at varying speeds while performing the tasks.The only instruction they were given about their walking speed was that they had tokeep moving. The other instruction they were given is that they should take care notto step on or outside of the tape. This was done in order to encourage participants to

Page 12: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

Fig. 5. The path used in the walking condition.

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520498

pay attention to the path. The experimenter kept track of the number of times eachparticipant stepped on the tape, but made no effort to limit the occurrences. Thetotal number of complete and partial laps made was recorded by the experimenterand converted to distance (in feet). The direction that the participant walkedalternated between scenarios, but stayed constant within a scenario (meaning thatonce he/she started walking in one direction, he/she continued in that direction untilthe task was complete).

2.5.3. Lighting

The lighting level was varied in both the treadmill and walking evaluationscenarios in order to simulate contextual changes in lighting experienced in actualmobile device use. Each participant performed one set of trials in each evaluationscenario with the room lights at full illumination (approximately 260 lux) and

Page 13: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

Fig. 6. A partial view of the walking course.

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 499

another set of trials with only one-third of the overhead lights turned on(approximately 85 lux). Both evaluation scenarios were conducted in the sameroom, so the lighting effect was consistent across scenarios. The order in which thelighting level changed was randomized across participants, as was the allocation ofquestions between lighting scenarios in the Reading Comprehension task. In otherwords, participants experienced the same questions overall, but the lighting scenarioin which they saw such questions varied.

2.6. Objective measures recorded

Four performance measures of time and accuracy were recorded for the ReadingComprehension task and two measures were recorded for the Word Search task.Each measure is listed and described in Table 3.

2.7. Subjective measures recorded

2.7.1. NASA-TLX

After each task, participants filled out the NASA-TLX workload assessment (Hartand Staveland, 1988). All six subscales were used, resulting in subscores for mentaldemand, physical demand, temporal demand, effort, performance and frustration, aswell as an overall workload score calculated from the weighted subscores.

2.7.2. Post-task questionnaire

After completing both tasks, the final phase of the experiment was to fillout a questionnaire about the tasks recently completed. For each task

Page 14: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

Table 3

Objective performance measures gathered in the study

Task Measure Description

Reading

comprehension

Reading time Average duration covering the time from when the text

passage was displayed until the ‘‘Done’’ button was pressed

Response time Average time from when the answer screen was displayed

until the ‘‘Submit’’ button was pressed

Scrolls The average number of times a scroll arrow was pressed,

either an on-screen arrow or a physical arrow button on the

device

Score The number of correct answers selected (out of 10)

Word search Time The average time from when the passage was displayed until a

line on the screen was selected

Score The number of correct selections containing the target word

(out of 10)

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520500

(Reading Comprehension and Word Search), participants were asked the followingquestions:

Please read the following statement and indicate how strongly you agree or disagreewith the statement by selecting the corresponding number on the scale, ranging from1 (strongly agree) to 7 (strongly disagree).I was able to successfully accomplish the goals of the task.

Stronglyagree

Stronglydisagree

1

2 3 4 5 6 7 J J J J J J J

Please rate the overall level of difficulty of the task, ranging from 1 (very easy) to 7(very difficult).

Veryeasy

Verydifficult

1

2 3 4 5 6 7 J J J J J J J

3. Results

3.1. Objective performance measures

3.1.1. Reading comprehension task

The means and standard deviations for Reading Time, Response Time, Score andScrolls for the treadmill and walking conditions in each of the two lighting scenariosfor the Reading Comprehension task are listed in Table 4. The means for ReadingTime and Response Time are plotted in Fig. 7, while the means for Score and Scrollsare plotted in Figs. 8 and 9, respectively.

Page 15: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

Table 4

Means (std. dev.) for the objective measures for the reading comprehension task

Measure Treadmill Walking

High light Low light High light Low light

Reading time (sec.) 27.93 28.29 29.08 29.79

(8.89) (8.64) (8.59) (9.26)

Response time (sec.) 17.71 18.10 17.11 18.48

(4.24) (4.50) (4.20) (4.28)

Score (out of 10) 7.75 8.02 7.68 7.47

(1.74) (1.42) (1.47) (1.65)

Scrolls 13.48 14.84 12.76 14.34

(4.75) (4.63) (3.60) (4.14)

0

5

10

15

20

25

30

Treadmill Walking Treadmill Walking

Reading Time (sec.) Response Time (sec.)

Measure

Sec

on

ds

High Light Low Light

Fig. 7. Reading time and response time for the reading comprehension task.

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 501

A repeated measures ANOVA was used to compare performance for two factors,each at two levels: evaluation condition (treadmill and walking) and lighting scenario(high light and low light), plus the interaction effects between them. The data met theassumptions required for using ANOVA.

The results of the between-subjects comparison between the treadmill andwalking evaluation conditions are presented in Table 5. These results are exclusive

Page 16: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

0

1

2

3

4

5

6

7

8

9

10

Treadmill WalkingScore (out of 10)

Measure

Sco

re

High Light Low Light

Fig. 8. Score for the reading comprehension task.

0

2

4

6

8

10

12

14

16

Treadmill WalkingScrolls

Measure

Nu

mb

er o

f S

cro

lls

High Light Low Light

Fig. 9. Scrolls for the reading comprehension task.

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520502

of the effects of changes in lighting level. The results indicate that no significantdifferences were present between the treadmill and walking conditions for allmeasures.

Page 17: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

Table 5

ANOVA results for the main effects of evaluation condition for the objective measures in the reading

comprehension task

Measure F-value p-Value

Reading time 0.545 0.46

Response time 0.016 0.90

Score 1.267 0.26

Scrolls 0.764 0.38

Table 6

ANOVA results for the main effects of lighting level for the objective measures in the reading

comprehension task

Measure F-value p-Value

Reading time 0.457 0.50

Response time 5.662 0.02�

Score 0.020 0.88

Scrolls 4.905 0.03�

�Statistically significant at the 0.05 level.

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 503

The results of the within-subjects comparison between the lighting levels (shown inTable 6) indicated significant main effects of lighting level in Response Time andScrolls, where both Response Time and the number of times participants used thescroll bars were significantly higher (p ¼ 0:02; 0:03; respectively; indicated by *) inthe low light scenario than in the high light scenario. There were no significantinteraction effects between evaluation condition and lighting level for any measures.

An additional analysis was performed to determine the effect of question order onResponse Time for each evaluation condition, since participants answered twoquestions for each reading passage. This was intended to assess participants’ abilityto retain the information they processed while reading, since they were not able toreturn to the passage once they moved on to the questions. The average responsetime (in seconds) for each question by evaluation condition is listed in Table 7 andplotted in Fig. 10.

The results of individual ANOVA comparisons for each condition are provided inTables 8 and 9.

These results clearly indicate that participants in the treadmill condition tookabout the same amount of time to answer the second question as the first, whereasparticipants in the walking condition took significantly longer (p ¼ 0:01; indicated by*) to answer the second question than the first. Within each randomly orderedreading passage, the order of the two questions was the same for all participants. Allparticipants saw the same reading passages and questions. Since the ordering of thequestions within a passage was the same, this result indicates that participants in thewalking condition demonstrated a need to take significantly more time to respond to

Page 18: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

Table 7

Mean response time (std. dev.) by question for each condition

Order Treadmill Walking

Question 1 (sec.) 17.82 17.12

(4.78) (4.30)

Question 2 (sec.) 17.97 18.45

(5.07) (5.25)

16.5

17

17.5

18

18.5

19

Question 1 Question 2Question Order

Res

po

nse

Tim

e (s

ec.)

Treadmill Walking

Fig. 10. Average response time by condition for each question.

Table 8

Effect of question order on response time in the treadmill condition

Measure F-value p-Value

Order 0.089 0.76

Table 9

Effect of question order on response time in the walking condition

Measure F-value p-Value

Order 6.954 0.01�

�Statistically significant at the 0.05 level.

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520504

Page 19: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 505

a question that was later in sequence than to a question that was earlier in sequence,after reading a passage. Participants in the treadmill condition did not demonstratethis effect.

3.1.2. Word search task

The means and standard deviations for Time and Score for the treadmill andwalking conditions in the each of the two lighting scenarios for the Word Search taskare listed in Table 10, followed by graphs of Time and Score in Figs. 11 and 12,respectively.

Table 10

Means (std. dev.) for time and score for the word search task

Measure Treadmill Walking

High light Low light High light Low light

Time (sec.) 3.90 3.90 4.03 4.67

(1.35) (1.15) (0.81) (1.23)

Score (out of 10) 9.80 9.73 9.66 9.55

(0.46) (0.54) (0.58) (0.64)

0

1

2

3

4

5

6

Treadmill Walking

Time (sec.)

Measure

Sec

on

ds

High Light Low Light

Fig. 11. Time for the word search task.

Page 20: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

0

1

2

3

4

5

6

7

8

9

10

Treadmill Walking

Score (out of 10)

Measure

Sco

re

High Light Low Light

Fig. 12. Score for the word search task.

Table 11

ANOVA results for the main effects of evaluation condition for the objective measures in the word search

task

Measure F-value p-Value

Time 3.962 0.05�

Score 2.512 0.11

�Statistically significant at the 0.05 level.

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520506

For the Word Search task, similar to the Reading Comprehension task, a repeatedmeasures ANOVA was used to investigate whether any significant main effectsexisted for evaluation condition (treadmill vs. walking) and lighting (high light vs.low light) on performance, and determine if any significant interaction effects werepresent. The data met the assumptions required for using ANOVA.

Results of the repeated measures ANOVA between the treadmill and walkingconditions are presented in Table 11. The results indicate a significant main effect ofevaluation condition for the Time measure (p ¼ 0:05); participants in the walkingcondition took longer than those in the treadmill condition. However the presence ofa significant lighting� evaluation condition interaction effect (p ¼ 0:01) was alsodetected (this effect can be seen in Fig. 11). No significant difference in the Scoremeasure was detected between the treadmill and walking conditions, nor was there a

Page 21: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

Table 12

ANOVA results for the main effects of lighting level for the objective measures in the word search task

Measure F-value p-Value

Time 7.118 0.01�

Score 1.349 0.24

�Statistically significant at the 0.05 level.

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 507

significant interaction effect between evaluation condition and lighting level for theScore measure.

The results from the within-subjects comparison between lighting levels (shown inTable 12) indicated a significant main effect of lighting level for the Time measure,which was significantly higher (p ¼ 0:01) in the low light scenario than in the highlight scenario. Score was not shown to be significantly different between lightingscenarios.

3.2. Subjective measures—NASA-TLX

3.2.1. Reading comprehension task

The average weighted NASA-TLX workload sub-scores (accompanied by thecorresponding standard deviations in parentheses) for each condition in each of thelighting scenarios are listed below in Table 13, and are represented graphically in Fig.13. Average weighted responses to all six subscales are reported, as well as an overallscore, which is derived from the individual sub-scores and suggests the overallworkload that the participant experienced. The overall workload score is scaled torange from 0 to 100.

The weighted sub-scores did not meet the assumptions required for ANOVA, sothe Mann-Whitney test was used to determine the presence of significant differencesbetween the treadmill and walking conditions. The results of this analysis arepresented in Table 14. Significant differences (pp0:05) are indicated by *. Significantdifferences between the treadmill and walking conditions were observed for Effort(p ¼ 0:03), Frustration (p ¼ 0:01), and the overall score (po0:01). In all cases ofsignificance, participants rated the walking condition higher than the treadmillcondition in terms of workload.

In order to compare the two lighting scenarios, the Wilcoxon Signed Ranks test,which is used for within-subjects samples, was employed. The results are presented inTable 15 and indicate that the lower lighting level induced significantly higherworkload for Physical demand (p ¼ 0:05), Effort (po0:01), Performance (p ¼ 0:02),Frustration (po0:01), and the overall score (po0:01).

3.2.2. Word search task

The average weighted NASA-TLX workload scores and standard deviations foreach condition in each of the lighting scenarios are listed below in Table 16, and are

Page 22: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

0

50

100

150

200

250

300

Trea

dmill

Wal

king

Wal

king

Trea

dmill

Wal

king

Trea

dmill

Wal

king

Trea

dmill

Wal

king

Trea

dmill

Wal

king

Trea

dmill

Wal

king

Trea

dmill

Mental Physical Temporal Effort Performance Frustration Overall

TLX Subscale

Wei

gh

ted

Rat

ing

High Light Low Light

Fig. 13. Comparison of TLX subscales between conditions for the reading comprehension task.

Table 13

Average weighted TLX scores (std. dev.) by condition for the reading comprehension task

Measure Treadmill Walking

High light Low light High light Low light

Mental demand 269 274 266 276

(120) (128) (105) (113)

Physical demand 93 88 70 75

(92) (85) (82) (87)

Temporal demand 151 158 135 134

(106) (105) (139) (138)

Effort 175 185 220 237

(96) (104) (101) (102)

Performance 97 98 108 124

(96) (92) (81) (88)

Frustration 98 102 163 182

(110) (113) (128) (140)

Overall 59 60 64 69

(10) (13) (12) (11)

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520508

Page 23: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

Table 14

Results from the Mann–Whitney test for the NASA TLX in the reading comprehension task between the

treadmill and walking conditions

Measure Z-value p-Value

Mental demand �0.214 0.83

Physical demand �1.177 0.23

Temporal demand �1.070 0.28

Effort �2.106 0.03�

Performance �1.392 0.16

Frustration �2.452 0.01�

Overall �2.915 o0.01�

�Statistically significant at the 0.05 level.

Table 15

Results from the Wilcoxon Signed Ranks test for the NASA TLX in the reading comprehension task

between lighting levels

Measure Z-value p-Value

Mental demand �1.357 0.17

Physical demand �1.951 0.05�

Temporal demand �1.868 0.06

Effort �3.228 o0.01�

Performance �2.342 0.02*

Frustration �3.094 o0.01�

Overall �3.037 o0.01�

�Statistically significant at the 0.05 level.

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 509

represented graphically in Fig. 14. The overall workload score is scaled to range from0 to 100.

The results from the Mann–Whitney test comparing TLX scores between thetreadmill and walking conditions are presented in Table 17. These results indicatesignificant differences between the treadmill and walking conditions for Effort(p ¼ 0:03), Performance (p ¼ 0:05), Frustration (p ¼ 0:05), and the overall score(p ¼ 0:04). Again, in all cases of significance, participants rated the walkingcondition higher than the treadmill condition in terms of workload.

For the comparison between lighting scenarios, the Wilcoxon Signed Ranks testindicated a significant detrimental effect of the lower light scenario on workload forEffort (po0:01), Frustration (p ¼ 0:01), and the overall score (po0:01), as shown inTable 18.

3.3. Subjective measures—post-task questionnaire responses

The post-task questionnaire consisted of two questions about participants’abilities to perform each task effectively. Responses to the questions were coded with

Page 24: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

0

20

40

60

80

100

120

140

160

180

200

Trea

dmill

Wal

king

Trea

dmill

Wal

king

Trea

dmill

Wal

king

Trea

dmill

Wal

king

Trea

dmill

Wal

king

Trea

dmill

Wal

king

Trea

dmill

Wal

king

Mental Physical Temporal Effort Performance Frustration Overall

TLX Subscale

Wei

gh

ted

Rat

ing

High Light Low Light

Fig. 14. Comparison of TLX subscales between conditions for the word search task.

Table 16

Average weighted TLX scores (std. dev.) by condition for the word search task

Measure Treadmill Walking

High light Low light High light Low light

Mental demand 116 116 136 150

(110) (116) (106) (106)

Physical demand 93 91 60 76

(102) (105) (68) (90)

Temporal demand 151 148 144 156

(123) (122) (135) (135)

Effort 118 123 145 174

(89) (96) (101) (96)

Performance 52 59 84 85

(64) (67) (84) (89)

Frustration 51 59 85 111

(64) (70) (94) (117)

Overall 39 40 43 50

(17) (19) (17) (17)

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520510

Page 25: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

Table 18

Results from the Wilcoxon Signed Ranks test for the NASA TLX in the Word Search task between

lighting levels

Measure Z-value p-Value

Mental demand �1.898 0.06

Physical demand �1.385 0.16

Temporal demand �1.746 0.08

Effort �3.188 o0.01�

Performance �1.804 0.07

Frustration �2.723 0.01*

Overall �3.391 o0.01�

�Statistically significant at the 0.05 level.

Table 17

Results from the Mann–Whitney test for the NASA TLX in the word search task between the treadmill

and walking conditions

Measure Z-value p-Value

Mental demand �1.400 0.16

Physical demand �1.040 0.29

Temporal demand �0.200 0.84

Effort �2.060 0.03�

Performance �1.936 0.05�

Frustration �1.949 0.05�

Overall �1.995 0.04�

�Statistically significant at the 0.05 level.

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 511

numerical values between one and seven for comparison purposes. In all cases, alower score indicated a more effective interaction, while higher scores indicated moredifficulty and/or frustration.

3.3.1. Reading comprehension task

The average coded values for the post-task questions (out of seven), along withtheir associated standard deviations, for the Reading Comprehension task arepresented in Table 19 and Fig. 15. In all cases, a higher value indicates a higher levelof difficulty or challenge.

As with the NASA-TLX data, the post-task questionnaire data failed to meet theassumptions required for parametric data analysis and the Mann–Whitney test wasused to compare the two conditions. Significant differences were noted in the degreeof success (p ¼ 0:03) between conditions, where participants indicated they werebetter able to accomplish the goals of the task in the treadmill condition. Theseresults are shown in Table 20.

Page 26: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

0

1

2

3

4

5

6

7

Degree of success Difficulty

Question

Rat

ing

Treadmill Walking

Fig. 15. Post-task questionnaire responses by condition for the reading comprehension task.

Table 19

Average post-task questionnaire response scores (std. dev.) by condition for the Reading Comprehension

task

Measure Treadmill Walking

Degree of success 3.39 4.13

(1.51) (1.54)

Overall difficulty 4.32 4.82

(1.32) (1.15)

Table 20

Results from the Mann–Whitney test for responses to the post-task questionnaire in the reading

comprehension task

Measure Z-value p-Value

Degree of success �2.112 0.03�

Overall difficulty �1.637 0.10

�Statistically significant at the 0.05 level.

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520512

3.3.2. Word search task

The average coded values for the post-task questions (out of seven), alongwith their associated standard deviations, for the Word Search task are presented in

Page 27: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

0

1

2

3

4

5

6

7

Degree of success Difficulty

Question

Rat

ing

Treadmill Walking

Fig. 16. Post-task questionnaire responses by condition for the word search task.

Table 22

Results from the Mann–Whitney test for responses to the post-task questionnaire in the word search task

Measure Z-value p-Value

Degree of success �0.690 0.49

Overall difficulty �1.758 0.07

Table 21

Average post-task questionnaire response scores (std. dev.) by condition for the word search task

Measure Treadmill Walking

Degree of success 2.52 2.47

(2.06) (1.78)

Overall difficulty 1.89 2.37

(0.89) (1.26)

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 513

Table 21 and Fig. 16 below. In all cases, a higher value indicates a higher degree ofdifficulty or challenge.

For the Word Search task, the Mann–Whitney test did not identify any significantdifferences between conditions for either of the measures. These results aresummarized in Table 22.

Page 28: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520514

4. Discussion

4.1. Differences between evaluation scenarios

Of particular interest in the objective measures was the difference betweenconditions in terms of time taken to answer the second of the two ReadingComprehension questions for each Reading Comprehension passage. Participants inthe walking condition took significantly longer to answer the second ReadingComprehension question than the first, whereas participants took about the sameamount of time to answer the first and second questions in the treadmill condition.However, the score of the second question, meaning whether the participantanswered the second Reading Comprehension question correctly or not, did notshow any significant difference between the two conditions, implying thatparticipants took more time on the second question in the walking condition, butthis did not translate to greater accuracy. This result supports the idea thatinformation was not retained as well in the walking condition as time elapsed, and itmay have been more susceptible to interferences due to motion (and due to readingand answering the first question).

A clearer picture of the two experiences can be uncovered by looking at thesubjective measures. Results from the NASA-TLX workload assessment indicatedthat participants in the walking scenario experienced significantly higher levels offrustration for both tasks, significantly higher overall workload for both tasks, andsignificantly higher levels of exerted effort for both tasks. Additionally, the NASA-TLX results from the Word Search task showed that participants felt thatmaintaining adequate performance during the walking scenario imposed moreworkload than in the treadmill scenario. This is consistent with the actual differencein performance between conditions, as measured by task time, in the Word Searchtask. Overall, participants tended to perceive a greater level of challenge in thewalking scenario than the treadmill scenario.

Looking at the effects of varying the lighting level is also very revealing. Asanticipated, decreasing the lighting level available to the participants had adetrimental effect on performance, in general, and on specific workload measures.What is of particular interest is the presence of interaction effects between thelighting scenarios and evaluation conditions, where the lighting effect was notconstant across evaluation conditions, being greater in the walking condition than inthe treadmill condition. Several graphs show instances where the combination of lowlighting and the walking condition tended to hinder performance or increaseworkload much more than any other combination, for example: Time for the WordSearch task (Fig. 11), Frustration for the Reading Comprehension task (Fig. 13),and Effort for the Word Search task (Fig. 14).

For both tasks, the subjective factors of mental demand, physical demand, andtemporal demand did not reveal any significant differences between evaluationconditions. The lack of differences in perceived physical demand may be anindication that the treadmill adequately simulates physical motion for these tasks. Asfor the lack of differences in mental and temporal demand, these differences couldindicate that these factors are more task-dependent than context-dependent.

Page 29: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 515

The post-task questionnaire revealed results that were consistent with the TLXresults. In the Reading Comprehension task, participants indicated a significantlygreater degree of disagreement with the statement ‘‘I was able to successfullyaccomplish the goals of the task’’ in the walking scenario than in the treadmillscenario. This could likely have been due to the additional challenge of navigating inthe walking condition. Additionally, the overall level of difficulty was rated higher inthe walking scenario for both tasks, although not significantly. In the case of theWord Search task, the fact that the task was relatively simple may have muted anypotential differences in either category.

This study has clearly revealed that two viable mobile evaluation scenarios yielddifferential influences on participant behaviors and experiences. The findings fromthis study enable evaluators to more clearly distinguish the similarities anddifferences between specific mobile evaluation scenarios. These results supplementthe results from Kjeldskov and Stage (2004), who found no differences inperformance between motion conditions but several differences in NASA-TLXratings, and further show results from additional objective and subjective measures.The results from this study also support the findings of Mustonen et al. (2004), whoindicate that subjective measures are more sensitive to changes in conditions thanperformance measures when looking at motion. These previous results, and theresults from this study, indicate that for mobile computing scenarios, participantsstrive to maintain an adequate level of performance, even as the scenario becomesmore difficult. Supplementing objective measures with subjective measures is key touncovering all aspects of the task experience.

4.2. Additional measures afforded by the walking scenario

In addition to subjective measures available for detecting changes in requiredattentional resources, the walking scenario was able to contribute two additionalmeasures for comparison. As stated in the methodology, a one foot wide taped pathwas used as a navigation guide for participants in the walking condition. Participantswere instructed to keep both feet entirely inside the path at all times and the numberof times they stepped on or outside of the tape was recorded. The number of timesthey stepped on the tape can play a key role in determining the tradeoff thatparticipants are making between attention to walking and attention to the task, andis a measure that cannot be collected in the treadmill scenario. Several attempts havebeen made to establish equivalent attentional measures for the treadmill scenario,but these attempts were not successful due to the limited degree of freedom in thetreadmill scenario. Abnormal behaviors, such as grabbing a rail while walking on thetreadmill, were observed during the experiment, which might be considered as anequivalent behavior to stepping on the tape in the walking scenario. However, thesebehaviors were only observed on two occasions; the vast majority of participantsfinished the tasks without any problems or unusual behaviors. Thus, the number ofinstances of these behaviors could not be used as equivalent attentional measures. Inthis study, the number of times participants stepped on the tape was highly inverselycorrelated (r ¼ �0:401; p ¼ 0:01) with the number of times they used the scroll bars

Page 30: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

Table 23

Means and standard deviations of walking speeds in the walking scenario

Baseline Reading

comprehension

Word search

Means of walking speed (miles per hour;

km/h in parentheses)

2.19 1.53 1.36

(3.52) (2.46) (2.19)

Standard deviation of walking speed

(miles per hour; km/h in parentheses)

0.32 0.29 0.38

(0.51) (0.46) (0.61)

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520516

in the Reading Comprehension task, possibly indicating that participants who paidmore attention to the path (fewer number of steps on the tape) tended to lose theirplace in the reading passage more often (resulting in more scrolling to re-establishthe point where they left off). This is one of several eye-opening results that cannotbe revealed through exclusive use of a treadmill scenario and indicates thatattentional focus may be easier to measure in a controlled walking scenario.

Additionally, since participants were allowed to walk at whatever speed felt mostcomfortable to them, as well as speed up and slow down at will, in the walkingscenario, the relative speed at which they walked can also provide an indication ofhow they were allocating attentional resources. This speed can be easily compared tothe baseline speed they established without a secondary task, and the percentagereduction in speed can be recorded. For example, in this study, participants in thewalking scenario reduced their speed by an average of 30% in the Word Search taskand 37% in the Reading Comprehension task, as compared to their recordedbaseline speed (refer to Table 23 for the actual speeds). This measure can also not beeasily assessed using a treadmill. The ability to have participants regulate their ownspeed also increases the setting representativeness (also referred to as ecologicalvalidity), which is, according to Kantowitz (1992), ‘‘the coherence between the testsituation in which research is performed and the target situation in which researchmust be applied’’ (p. 390).

5. Conclusions

The results from this study are able to further the state of knowledge in mobiledevice evaluation by weighing the tradeoffs inherent to two practical use-in-motionevaluation scenarios. This study is most similar to Kjeldskov and Stage (2004), butcontributes additional rigor by employing two separate tasks, two lighting scenarios,additional objective and subjective measures, and more than double the number ofparticipants (a limited number of participants was listed as a shortcoming inKjeldskov and Stage (2004)). This enables guidelines (though nascent in form) to beproposed for mobile device evaluation when device output is expected to play asignificant role in the interaction. Both highly representative (i.e. the walking task inthis study) and mildly representative (i.e. the treadmill scenario) scenarios can be

Page 31: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 517

appropriate for mobile device evaluation, but the best choice is dependent on thestudy objectives and resources available.

A treadmill is an appropriate mobile evaluation tool when performance times ormeasures are of primary interest. Examples of where it would be well used is in taskmodeling, in calculating task completion times for productivity investigations, or foruse in physical models like Card, Moran and Newell’s Keystroke Level Model(KLM) (Card et al., 1983). Exceptions could be found in cases where the mobiletasks or typical use environments are particularly difficult or stressful, or take placein time or safety-critical situations. In this case, a treadmill would be unlikely toinduce adequate stress to affect performance as much as the actual situation. Also, ifadditional contextual factors wish to be studied, using a treadmill may beinappropriate. In this study, the treadmill condition was shown to be less sensitiveto changes in lighting level than the walking condition. While this may beadvantageous when the experimenter is concerned with fluctuating noise factors, itcould dilute the effects of other contextual factors if used as independent variables.

The walking scenario appeared to yield more accurate performance and subjectivemeasures, but at an additional cost to the experimenter (in terms of setup difficultyand data collection effort). A walking scenario like the one used in this study is wellsuited to a comparison of mobile device designs (hardware or software), whereminimal attention required by the user is desirable (as advocated by Pascoe et al.,2000) and it is valuable to capture attentional demands. In these cases, workload andother subjective measures between designs are particularly important, in addition todifferences in task completion effectiveness. The walking scenario used in the studyalso suggested that it could be more sensitive to changes in other contextual factors(evidenced by lighting level, in this case), which could possibly yield more realisticimpressions of user behavior.

For example, if a new interface design is to be compared with an existing one, awalking scenario (combined with changes in lighting level) could be used todetermine if the new design is easier to use through several measures, such asperformance, workload and questionnaire responses, and measures of attentional‘‘shedding’’, such as navigation errors (denoted by the number of times participantsstepped on the tape) and percent reduction in speed as compared to participants’baseline speed. If the study objectives dictate that multiple prototypes be comparedin terms of minimal attention required to interact with, a walking scenario such asthe one described in this paper would be preferable to many other scenarios thathave been used. Using a treadmill, for example, in this case would likely not yieldaccurate indicators of workload or perceived difficulty and prevent more realisticexperiential comparisons of use in actual environments.

Varying the lighting level, to represent lighting changes that frequently occur inmobile computing use, is also a way to induce representative challenges in order togenerate more representative results than common, contextually vacant evaluationscenarios. The results from this study indicate that lighting changes can havenoticeable effects on behavior. Therefore, this easy-to-manipulate factor makes foran attractive independent variable choice in mobile evaluation studies. At the veryleast, the lighting level should match the anticipated use conditions.

Page 32: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

Table 24

Summary of advantages and limitations of each methodology for specific study goals

Methodology Treadmill set at comfortable

walking speed

Controlled walking scenario

around a defined path

Likely study types Exploratory or modeling studies Comparison or evaluation studies

(more likely to be basic research) (more likely to be applied research)

Likely study goals Collect representative performance

measures for modeling or baseline

data collection

Compare ease of use and/or

attentional demands of mobile

device designs (hardware or

software)

Advantages 1. Easy to setup

2. Much more appropriate than

stationary scenarios

3. High degree of control

4. Objective measures in-line with

walking scenarios

1. Additional measures of

attention can be employed

2. More reliable subjective

measures

3. Objective measures may be

more accurate, depending on

tasks

4. Can be modified to include

stopping and starting tasks

Limitations 1. Less appropriate for difficult

tasks

2. Does not adequately capture or

represent attentional demands

3. Subjective measures may not be

valid

4. Less sensitive to changes in

other contextual factors

5. Can’t easily model tasks that

require stopping and starting

1. More difficult to setup

2. Requires adequate physical

space

3. Two experimenters preferable

because of additional non-

automated data collection

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520518

The preceding discussion is represented in Table 24 in the form of guidelinesorganized based on study goals. Again, input effectiveness was not measured, so theguidelines may be less applicable when input to the device is expected to significantlycomplicate the interaction.

6. Future work

Given the compelling new results conveyed in this paper, there remain several areas ofinquiry for future work that will further illuminate the role of context in mobilecomputing. For example, experimenting with changes to the walking course used in thisstudy should be of interest. The changes can include using other types of path markers,different obstacles, a wider or narrower path, and so forth. Further, more contextualelements could be incorporated (similar to the lighting level that was manipulated in thisstudy), but they must be selected and varied judiciously so as not to limit the validity ofthe results or lessen the degree of control. Similarly, other types of mobile devices, such

Page 33: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520 519

as mobile phones or wearable computers, could be evaluated using the proposedevaluation scenario. Because of different form factors and interaction of other devices,different conclusions might be drawn. Finally, an outdoor walking scenario could beconsidered. Again, this more realistic approach might introduce excessive experimentalcontamination, so further research with natural settings should be designed with extracare. Thus, additional studies would be necessary to determine how valid results couldbe obtained using such a realistic approach. The results from these studies, together withthe findings of the present study, would be helpful to establish more rigorous andcomprehensive guidelines for mobile computing evaluation.

Acknowledgements

The authors are greatly indebted to Young Sang Choi, Paula Edwards, ThitimaKongnakorn, V. Kathlene Leonard and Kevin Moloney for their crucial assistanceand support. This material is based upon work supported by the National ScienceFoundation (NSF) under Grant No. IIS-0121570. Any opinions, findings andconclusions or recommendations expressed in this material are those of the authorsand do not necessarily reflect the views of the NSF.

References

Aerobics Inc., 2002. Pacemaster ProElite. Retrieved June 2nd, 2004, from http://www.pacemaster.com/

proelite.htm

Alton, F., Baldey, L., Caplan, S., Morrissey, M.C., 1998. A kinematic comparison of overground and

treadmill walking. Clinical Biomechanics 13 (6), 434–440.

Barnard, L., Yi, J., Jacko, J.A., Sears, A., 2004. Capturing the effects of context on human performance in

mobile computing systems. Submitted for publication.

Brewster, S., 2002. Overcoming the lack of screen space on mobile computers. Personal and Ubiquitous

Computing 6, 188–205.

Bristow, H.W., Baber, C., Cross, J., Knight, J.F., Woolley, S., 2004. Defining and evaluating context for

wearable computing. International Journal of Human–Computer Studies 60, 798–819.

Card, S.K., Moran, T.P., Newell, A., 1983. The Psychology of Human–Computer Interaction. Lawrence

Erlbaum Associates, Inc., Hillsdale, NJ.

Dunlop, M., Brewster, S., 2002. The challenge of mobile devices for human computer interaction. Personal

and Ubiquitous Computing 6, 235–236.

Ebersbach, G., Dimitrijevic, M.R., Poewe, W., 1995. Influence of concurrent tasks on gait: a dual task

approach. Perceptual and Motor Skills 81, 107–113.

Hart, S.G., Staveland, L.E., 1988. Development of NASA-TLX (Task Load Index): results of empirical

and theoretical research. In: Meshkati, N. (Ed.), Human Mental Workload. Elsevier (North-Holland),

Amsterdam, The Netherlands, pp. 139–183.

Hinckley, K., Pierce, J., Sinclair, M., Horvitz, E., 2000. Sensing techniques for mobile interaction. Paper

presented at the Proceedings of the 13th Annual ACM Symposium on User Interface Software and

Technology, San Diego, CA.

iAnywhere Solutions., 2003. AvantGo. Retrieved May 26th, 2004 (November 10) from http://

www.avantgo.com/frontdoor/index.html

Johnson, P., 1998. Usability and mobility: interactions on the move. Paper presented at the 1st Workshop

on Human Computer Interaction with Mobile Devices, Glasgow, Scotland.

Kantowitz, 1992. Selecting measures for human factors research. Human Factors 34 (4), 387–398.

Page 34: An empirical comparison of use-in-motion evaluation scenarios for mobile computing devices

ARTICLE IN PRESS

L. Barnard et al. / Int. J. Human-Computer Studies 62 (2005) 487–520520

Kjeldskov, J., Graham, C., 2003. A Review of Mobile HCI Research Methods. Paper presented at the

Fifth International Symposium on Human Computer Interaction with Mobile Devices and Services,

Udine, Finland (September 8–11).

Kjeldskov, J., Stage, J., 2004. New techniques for usability evaluation of mobile systems. International

Journal of Human–Computer Studies 60, 599–620.

Kjeldskov, J., Skov, M.B., Als, B.S., Hoegh, R.T., 2004. Is it worth the hassle? Exploring the added value

of evaluating the usability of context-aware mobile systems in the field. In: Proceedings of the Sixth

International Mobile HCI Conference, Glasgow, Scotland.

Kristoffersen, S., Ljungberg, F., 1999. Designing interaction styles for a mobile use context. Paper

presented at the HUC’99, Karlsruhe, Germany.

Lajoie, Y., Teasdale, N., Bard, C., Fleury, M., 1993. Attentional demands for static and dynamic

equilibrium. Experimental Brain Research 97, 139–144.

Learning Express, 1999. 501 Reading Comprehension Questions. LearningExpress, New York, NY.

Lee, H.C., Cameron, D., Lee, A.H., 2003. Assessing the driving performance of older adult drivers: on-

road versus simulated driving. Accident Analysis and Prevention 35 (5), 797–803.

Mantyjarvi, J., Seppanen, T., 2003. Adapting applications in handheld devices using fuzzy context

information. Interacting with Computers 15, 521–538.

Meister, D., 2004. Conceptual Foundations of Human Factors Measurement. Lawrence Erlbaum

Associates, Mahwah, NJ.

Murray, M.P., Spurr, G.B., Sepic, S.B., Gardner, G.M., Mollinger, L.A., 1985. Treadmill vs. floor

walking: kinematics, electromyogram, and heart rate. Journal of Applied Physiology 59, 87–91.

Mustonen, T., Olkkonen, M., Hakkinen, J., 2004. Examining mobile phone Text legibility while walking.

In: Proceedings of CHI 2004. ACM Press, Vienna, pp. 1243–1246.

Nigg, N.M., De Boer, R., Fisher, V., 1995. A kinematic comparison of overground and treadmill running.

Medicine and Science in Sports and Exercise 27, 98–105.

Open eBook Forum, 2003. Q3 2003 eBook Statistics (December 8). Retrieved May 20, 2004 from http://

www.openebook.org/pressroom/pressreleases/q303stats.htm

Oulasvirta, A., 2004. Finding meaningful uses for context-aware technologies: the humanistic research

strategy. CHI Letters 6 (1), 247–254.

Pascoe, J., Ryan, N., Morse, D., 2000. Using while moving: HCI issues in fieldwork environments. ACM

Transactions on Computer–Human Interaction (TOCHI) 7 (3), 417–437.

Patla, A.E., 1997. Understanding the roles of vision in the control of human locomotion. Gait and Posture

5 (1), 54–69.

Pellecchia, G.L., 2003. Postural sway increases with attentional demands of concurrent cognitive task.

Gait and Posture 18, 29–34.

Pirhonen, A., Brewster, S., Holguin, C., 2002. Gestural and audio metaphors as a means of control for

mobile devices. In: Proceedings of the Conference on Human Factors in Computing Systems (CHI

2002). ACM Press, Minneapolis, Minnesota, New York, NY, USA, pp. 291–298.

Schache, A.G., Blanch, P.D., Rath, D.A., Wrigley, T.V., Starr, R., Bennell, K.L., 2001. A comparison of

overground and treadmill running for measuring the three-dimensional kinematics of the lumbo-

pelvic-hip complex. Clinical Biomechanics 16 (8), 667–680.

Schmidt, A., Aidoo, K.A., Takaluoma, A., Tuomeoa, U., Van Laerhoven, K., Van de Velde, W., 1999a.

Advanced Interaction in Context. In: Gellersen, H.W. (Ed.), Proceedings of the Handheld and

Ubiquitous Computing Conference. Springer, Berlin, pp. 89–101.

Schmidt, A., Beigl, M., Gellersen, H.W., 1999b. There is more to context than location. Computers and

Graphics 23 (6), 893–901.

Sears, A., Lin, M., Jacko, J., Xiao, Y., 2003. When computers fadeypervasive computing and situationally

induced impairments and disabilities. In: Stephanidis, C., Jacko, J. (Eds.), Human–Computer

Interaction: Theory and Practice (Part II). Lawrence Erlbaum Associates, London, pp. 1298–1302.

Vogt, L., Pfeifer, K., Banzer, W., 2002. Comparison of angular lumbar spine and pelvis kinematics during

treadmill and overground locomotion. Clinical Biomechanics 17 (2), 162–165.

York, J., Pendharkar, P.C., 2004. Human–computer interaction issues for mobile computing in a variable

work context. International Journal of Human–Computer Studies 60, 771–797.