inferring capabilities of intelligent agents from their ...1. the power of agent-based interaction...

- 1 -

Inferring capabilities of intelligent agents from their external traits

Bart P. Knijnenburg

Eindhoven University of Technology, Human-Technology Interaction group

Martijn C. Willemsen

Eindhoven University of Technology, Human-Technology Interaction group

RUNNING HEAD: CAPABILITIES OF INTELLIGENT AGENTS

Corresponding Author’s Contact Information: Bart Knijnenburg: [email protected] Alternatively, contact Martijn Willemsen: [email protected]

Brief Authors’ Biographies:

Bart Knijnenburg is a cognitive psychologist with an interest in human-like, intelligent and adaptive systems and their interaction paradigm; he is a researcher/teacher in the Human-Technology Interaction group of the Eindhoven University of Technology (TU/e). Martijn Willemsen is a decision-making psychologist with an interest in the cognitive and interactive processes underlying decision making; he is an assistant professor in the Human-Technology Interaction group of the Eindhoven University of Technology (TU/e).

- 2 -

ABSTRACT

This paper presents an online experiment investigating the usability of an agent-based travel advisor. The results of this experiment show that users of an agent-based system form an anthropomorphic use image of the system. This use image instills certain responses from the users: users act human-like towards the system, and to try to exploit typical human-like capabilities that they believe the system may possess. Overestimation may occur when the system does not actually possess these capabilities.

Furthermore, the results show that the anthropomorphic use image is an integrated use image, as compared to the compositional use image that is formed when using conventional graphical user interfaces. This means that feedforward cues provided by the system do not instill user responses in a one-to-one matter, but that users instead integrate all cues into a single use image. Because of this integrated use image, users may provide responses that try to exploit capabilities that were not signaled in the system’s cues.

- 3 -

CONTENTS

1. THE POWER OF AGENT-BASED INTERACTION 2. THE COMPOSITIONALITY OF USER INTERFACES 3. AGENT-BASED USE IMAGES

3.1. Human-like Cues 3.2. Anthropomorphism 3.3. Consequences of the Use Image

4. THE FUNDAMENTAL PRINCIPLE OF AGENT-BASED INTERACTION 4.1. An Integrated Use Image? 4.2. Effects on Usability

5. EXPERIMENTAL SETUP 5.1. Procedure 5.2. Participants 5.3. Measures

6. RESULTS 6.1. Low versus High Capabilities 6.2. Overestimation 6.3. Existence of an Anthropomorphic Use Image

Human-like Responses Capability-exploiting Responses

6.4. Integrated Use Image 7. CONCLUSION AND DISCUSSION

7.1. Discussion APPENDIX: TABLES WITH DETAILED RESULTS

Regression A: First Person References Regression B: Words per Chat Request Regression C: Grammatical Correctness Regression D: Capability-exploiting Responses

- 4 -

1. THE POWER OF AGENT-BASED INTERACTION

Many researchers believe that agent-based interaction, in which the user interacts with a virtual entity or agent using natural language, is the most promising new paradigm in human-computer interaction (Quintanar, Crowell, & Moskal, 1987; Bradshaw, 1997; Erickson, 1997; Walker, Sproull, & Subramani, 1994; Nickerson, 1976; Nilsson, 1995; Norman, 1997). Because agent-based interaction is richer, finer grained and more natural than our interaction with current, more tool-like interfaces, it should better suit the complex tasks we currently try to perform using computers (Negroponte, 1997; Laurel, 1990; Williges, Williges, & Elkerton, 1987). At the same time, however, present agent-based interfaces seem to be unable to live up to these promises (Shneiderman, 1997; Dehn & van Mulken, 2000). The late MS Office agent “Clippy” was generally believed to be no more than a superficial gimmick that caused annoyance more than fluent interaction.

Are agent-based interfaces really that much different from our traditional Graphical User Interfaces (GUIs), or are they essentially the same? Why do researchers praise their theoretical usefulness (with some notable exceptions, e.g. Lanier, 1995), while users dislike existing agent-based applications? The goal of the current paper is to identify a fundamental principle that makes agent-based interaction different from traditional GUIs: whereas the usability of GUIs is determined by the formation of a compositional use image, agent-based interfaces have an integrated use image. We will provide evidence that this integrated use image is the cause of both the theoretical advantage of agent-based interfaces, as well as the disappointing nature of the current applications (Dehn & van Mulken, 2000). Specifically, our research shows that human-like interfaces elicit responses from users that typically try to exploit linguistic human-like capabilities. However, when the actual capabilities of the system cannot handle these responses, the system fails in terms of usability. Moreover, due to the “integratedness” of the use image, it may not be easy to fix this overestimation.

The paper will conclude by warning interaction designers for the pitfalls of agent-based interface design. Moreover, we will argue that usability researchers can use the integrated use image theory to investigate and improve the usability of agent-based systems.

2. THE COMPOSITIONALITY OF USER INTERFACES

Before analyzing the unique properties of agent-based interfaces, let us first consider how users interact with traditional GUIs. The first influential theory of human-computer interaction was formulated by Norman in his chapter on “Cognitive Engineering” (Norman, 1986). To explain why some systems are more usable than others, Norman argued that there are two gulfs between the user and the system: the gulf of execution and the gulf of evaluation (see Figure 1). The gulf of execution manifests itself if the user intends to perform a certain task and has to discover how to manipulate the system to accomplish this task. Based on a general intention, the user will formulate an action specification (a sequence of actions), and then perform these actions on the system. The

- 5 -

system then starts to do some processing, after which the gulf of evaluation emerges: this is the point where the user will interpret what the system has done, and whether this is in line with what the user wanted to happen. The user inspects the interface display, interprets its content, and evaluates it in relation to his or her goals. In effect, human-computer interaction is a sequence of bridged gulfs of execution and evaluation between the user and the system.

<FIGURE 1 ABOUT HERE>

Norman states that a system is usable if users are able to easily overcome (bridge) these gulfs. Users do this by forming a “use image”, a mental representation of the way the system works. A use image may contain some gaps and inconsistencies, and it rarely matches the actual internal workings of the system (which Norman calls the “system image”), but an appropriate use image assists users to infer which interface actions would fulfill their goal (bridging the gulf of execution), and, after the system handled the request, what the output of the system means (bridging the gulf of evaluation).

According to Norman, the formation of a useful and adequate use image is greatly facilitated by providing appropriate feedback and feedforward. For instance, clear labels on buttons (feedforward) let users infer what the system can do for them, and understandable output (feedback) allows them to see if the system actually did what they wanted it to do.

An operationalization of Norman’s use image theory is the Layered Protocol Theory (LPT) (Taylor, 1988). LPT decomposes user-system interaction into a set of layers that each have a different level of abstraction, consistent with Norman’s theory. On each successive layer, users’ intentions are broken down into smaller components. Brinkman (2003; Brinkman, Haakma, & Bouwhuis, 2007) argued that this compositional character of the interaction is reflected in the users’ use image, and that usability is therefore compositional. This implies that the use image of a graphical user interface is the sum of the use images of each widget (e.g. levers, buttons, text fields, scrollbars) that the interface is composed of. Brinkman showed that this assumption held for various interfaces, with two exceptions: If the use images were inconsistent or too numerous, the usability of the whole system was worse than the usability of its parts. Concluding, users do not form a use image of the entire system, but a great number of tiny use images of the function of each widget, and they tend to lose track of these use images when the system is too big or inconsistently designed.

Although Brinkman (2003) was the first to formalize the compositional use image, it has been assumed by many other usability researchers and designers. Evaluation techniques like Heuristic Evaluation (Nielsen, 1994) explicitly evaluate the usability of parts of the interface, so the effectiveness of such techniques partially depends on the legitimacy of the compositional character of user interfaces.

3. AGENT-BASED USE IMAGES

- 6 -

Norman's usability theory is applicable to “real life” interfaces (e.g., doors and phones) as well as our current software interfaces (Norman, 1988). However, agent-based interfaces typically lack the common levers, buttons, text fields and scrollbars. How do users form a use image of an agent-based interface?

3.1. Human-like Cues and Anthropomorphism

Cook and Salvendy (1989) argued that users infer the use image of an agent-based system from the way it “looks” and “talks” and the apparent intelligence of its responses, just like they would do when interacting with other human-beings. In the light of Norman’s theory, the agent’s cues in terms of appearance and language provide feedforward, and the actual system capabilities provide feedback. This notion is in line with Laurel (1990), who stated that a system shows signs of human-like intelligence when it shows human-like responsiveness (meaning that it is able to respond flexibly to incomplete requests), and when it shows human-like accessibility (meaning that it looks like a human being and uses grammatically correct sentences). We thus hypothesize that:

The more human-like the system looks (human-like appearance cues) and the more capabilities it displays (human-like capability cues), the more intelligent users believe the system to be (use image).

Examples of human-like appearance cues are a presentation of the agent as a human-like cartoon character and a grammatically correct sentence structure. Examples of human-like capability cues are the use of contextual references (words like “here” or “now”).

Previous studies have also found evidence that users attribute common human intelligence to systems that provide more human-like appearance and capability cues. For example, users of a system with a cartoon character that “talks” in full sentences and personifies itself believe that it shows some form of common human intelligence as well, while these users do not show a similar belief when using a system without such a cartoon character that talks “computerese” (De Laere, Lundgren, & Howe, 1998; Quintanar et al., 1987).

It is possible that this instilled use image will not hold foot in the long run. Over time, the actual capabilities of the system provide feedback about the accuracy of the use image. Actual capabilities might not necessarily co-occur with capability cues; the system might exhibit specific linguistic capabilities (for instance using the word “here” to refer to the current location) without actually being able to understand them in the user dialog. In effect, cues of human-like appearance and capabilities can underplay or overplay the agent’s actual capabilities.

What psychological mechanisms could underlie this use image of believed intelligence? Thompson (1980) found that users of a natural language-based system showed a tendency to anthropomorphize the behavior of the system: users treated the system as if it were a human being. Other research confirms this tendency to anthropomorphize systems that display human-like cues: Researchers have found that the

- 7 -

use of personalization, conversational tone, affective responses and diversified wording leads users to perceive agents as being more “human” (Quintanar et al., 1987; De Laere et al., 1998).

Not only agent-based systems are subject to anthropomorphism. Users tend to use the human metaphor whenever they are unable to explain the behavior of a computer, although they may not admit to engage in negative anthropomorphism (e.g. shouting, physical threats), as it is seen as socially undesirable (Chin, et al., 2005). Furthermore, users of computers system seem to unconsciously adhere to social principles that normally apply between humans. This phenomenon has been thoroughly researched by Reeves and Nass (1996). For example, they show that computer users may show a “politeness effect”: they evaluate a computer more positively when asked by that same computer, than when asked by another computer.

Why would human beings try to be polite towards computers? Why would they start to “hate” malfunctioning computers, and shout at them? Bradshaw (1997) offers an explanation for this phenomenon: When a system’s behavior is too complex to understand, users are inclined to take the “intentional stance” (Dennett, 1987) when reasoning about these systems. Users attribute intentional behavior to systems as a convenient shortcut towards explaining complicated behavioral patterns, and this also leads them to adhere to human social principles.

Although the intentional stance holds for any type of system, traditional GUIs have a traditional, compositional use image that is not directly influenced by anthropomorphism. We believe, however, that:

In agent-based systems, the intentional stance is at the heart of the construction of a use image. The use image is an anthropomorphic construct, instilled by human-like cues.

Concluding, the tendency to anthropomorphize an agent-based system, allows the users to infer a complex use image using the system’s appearance and capability cues. This phenomenon is depicted in the left part of Figure 2: human-like systems provide human-like cues, which implicitly allow users to construct a use image of believed intelligence (Erickson, 1997; Laurel, 1990; Murray & Bevan, 1985; Norman, 1997).

As the use image is a mental construct, we cannot observe directly whether it is actually anthropomorphic (cf. traditional). However, if the use image is anthropomorphic, users will interact with the agent in a way that is in accordance with human-human interaction.


3.2. Consequences of the Use Image

The users’ formation of an anthropomorphized use image is of course not without consequences. We hypothesize two effects of this use image formation. First of all:

- 8 -

Since the use image of an agent is anthropomorphic, users will act in a more human-like way towards a system they believe to be more intelligent.

“Human-like responses” (the top right ellipse in Figure 2) are reactions that previous research has found to typically occur in our interactions with other humans, but lacking in human-computer interaction. Examples are the use of long and grammatically correct sentences. Some existing research seems to be in line with these expectations: Verbosity is typically higher in human-human interaction than in human-computer interaction, and the use of a computer-simulated face and person-focused feedback leads users to be more verbose to computer systems as well (Brennan, 1991; Rosé & Torrey, 2005; Walker et al., 1994; Richards & Underwood, 1984). Secondly:

Users will assume that systems they believe to be more intelligent also have more advanced capabilities, and they will try to exploit these capabilities.

If the system looks and behaves human, then these believed capabilities are typical human capabilities (the bottom right ellipse in Figure 2). An important category of human capabilities is the linguistic capability of implicit reference to the context of the conversation1 (Levinson, 1983; Halliday & Hasan, 1976). References are often used in human conversation to speed up the interaction. However, computers are notoriously bad at understanding such references (Winograd, 1972; Dey, 2001). In the case of agent-based interaction, users may believe that a human-like system, like a human being, can resolve such implicit references. Specifically, that it can infer the meaning of words referring to:

• the current or previously mentioned location, like “here” and “there” (spatial anaphora or place deixis);

• the current or previously mentioned time, like “now” or “then” (temporal anaphora or time deixis);

• a previously mentioned object, like “that trip”, “that ticket” or simply “that” (nominal anaphora or discourse deixis).

The right part of Figure 2 sums up the discussion above: the anthropomorphic use image leads users to act more human-like towards agent-based system, and it implies the existence of certain advanced human-like capabilities (e.g. understanding implicit references), which users will in turn try to exploit.

4. THE FUNDAMENTAL PRINCIPLE OF AGENT-BASED INTERACTION

1 This can refer to the textual context, in terms of “endophoric references” (also called anaphora for backward references or cataphora for forward references), as well as to the situational context, in terms of “exophoric references” (also called deixis).

- 9 -

4.1. An Integrated Use Image?

If the use image of an agent-based system would be just like the use image of a traditional GUI, it would also be compositional: there would be a one-to-one mapping from its cues to a piece of related functionality. Therefore, each cue would instill its own part of the use image, and induce a corresponding response in the users. In other words, the user would “mirror” the agent’s behavior and try to exploit a certain human-like capability if and only if the system would provide a cue of this capability; e.g. the user would use contextual references (“here”, “now”) if and only if the system gave him/her an explicit cue that this is possible (by using contextual references in its responses to the user).

Brennan (1991) found support for such a one-to-one mapping. In her experiments with both human-human and natural language-based human-computer interaction, she found that participants were likely to show syntactic entrainment; a direct reflection of the conversation partner’s responses. According to Brennan’s findings, one could evoke a certain behavior in the users’ response by expressing the same behavior in the agent: “cue A” will instill “response A” and “cue B” will instill “response B”. This finding is in line with the theory that the use image is compositional. The left panel of Figure 3 shows a graphical representation of this theory.

However, the intentional stance (Dennett, 1987) suggests that users create an integrated use image based on the behavior of the system as a whole. If the system is considered sufficiently human-like, it will be attributed intentional behavior, and this attribution is based on the “humanness” of the agent as a whole, not on a specific part of its behavior. The intentional behavior that users attribute to the system gives them a “shortcut” to its functionality. Users can predict what the system can and cannot do because they know what their fellow humans can and cannot do; for example, they will infer that human-like systems are able to handle complex sentences with many implicit references.

In sum, users know how to ask questions in a way that the fellow human will understand, and this knowledge gives them a cue of how to interact with the agent. In terms of Norman’s usability theory: The fact that the “system” is “human” provides them instantaneously and effortlessly with a detailed use image of what it can do and how to interact with it. In the words of Laurel (1990): “[An agent-based interface] makes optimal use of our ability to make accurate inferences about how a character is likely to think, decide and act on the basis of its external traits. This marvelous cognitive shorthand is what makes plays and movies work … With interface agents, users can employ the same shorthand – with the same likelihood of success – to predict, and therefore control the actions of their agents.” (pp. 358-359)

Suppose users integrate the system cues they receive into a use image of believed intelligence. This suggests a more complex relation between system cues and user responses than the one-to-one mapping that the compositional use image theory predicts. As can be seen in the middle panel of Figure 3, in such a case, “cue B” does not only instill “response B”, but it can also instill “response C”. Furthermore, a cue that does not

- 10 -

have a directly related response (e.g. “cue A” in Figure 3) may still instill other responses. Also, responses not directly connected to a related cue (e.g. “response D” in Figure 3) can still be evoked by other cues. We will call this the “integrated use image theory”, since it suggests that the relation between system cues and user responses is mediated by the formation of an integrated use image. In effect, all cues about the intelligence of the system will be integrated and instill a series of possibly unrelated2 responses. The right panel of Figure 3 graphically represents this theory.


4.2. Effects on Usability

An integrated use image of agent-based interfaces can have both positive and negative effects on their usability (for an overview of selected studies in this field see (Dehn & van Mulken, 2000) and more recently (Qiu & Benbasat, forthcoming)).

On the positive side, the integrated use image makes agent-based interfaces especially suitable for performing complex tasks. Typical GUIs have a compositional use image with a one-to-one mapping from the layout of the interface to the functions of the system (Brinkman, 2003; Brinkman, Haakma, & Bouwhuis, 2007). The more complex the system gets, the more interface elements it requires, and the more difficult it becomes to remember the adequate use image for each widget. Therefore it might be impossible to create a really usable GUI for a complex system. If agent-based interfaces are subject to an integrated use image, they would instantly provide users with a heuristic to determine what they can and cannot do, and with an easy way to access this functionality. So instead of having to form a mini-use image for every widget in the interface, the users will identify the humanness of the interface, instantly infer its integrated use image, and act accordingly. In effect, the more human-like the system, the more usable it will be. In line with this, the experiments of Dharna, Thomas and Weinberg (2001) and Quintanar et al. (1987) found that more human-like interaction increased usability.

On the negative side, an integrated use image poses severe problems when it is incorrect. This occurs when the system cannot perform a certain function – or a number of functions – that the user expects the system to be able to perform based on the formed use image. If the system looks more capable than it actually is, users will overestimate the system’s capabilities, which will result in confusion and reduced usability; especially when it comes to learnability (Richards & Underwood, 1984; Brennan, 1991; Erickson, 1997; Forlizzi, Zimmerman, Mancuso, & Kwak, 2007; Walker et al., 1994; Shneiderman, 1981). Note that this idea of overestimation has been suggested by previous studies, but it has to our best knowledge never explicitly been tested.

Good usability only arises when the user tries to use the capabilities that the system provides, and does not try to use any capabilities the system does not provide. Since the users base their interaction decisions on their use image of the system, this means that

2 The responses are never completely unrelated to the cues, so with “unrelated” we here mean “not directly related”.

- 11 -

this use image has to match the actual system capabilities (Norman, 1986). If the compositional use image theory would hold, it would be fairly easy to “manage” the use image so that it matches the actual system capabilities. For each actual capability, the system would have to provide a capability cue, so that the user would note the presence of this capability and exploit it when necessary. However, the integrated use image theory implies that it’s much harder to control the use image, as it suggests that there is more than just a one-to-one relation between cues and responses (see Figure 3). In effect, even human-like appearance cues may instill capability-exploiting responses: merely “looking human” may be enough to make users believe that the system has certain typically human-like capabilities (even if they are not actually present).

In sum, the presumed integrated use image is responsible for both the greatest advantage but at the same time the most significant drawback of agent-based interaction: due to our natural tendency to use anthropomorphism, it is very easy to instantly create a complex, integrated use image that from which users can instantly and effortlessly infer a myriad of complex functions they could possibly perform with the system, along with possible ways to exploit them. However, since these functions are not directly coupled to a specific underlying cue, an overestimation effect can easily occur in which one or more of these functions are actually not available in the system, and it will be rather difficult to tweak this use image such that it perfectly matches the actual system capabilities.

5. EXPERIMENTAL SETUP

The main goal of the current paper is to empirically demonstrate that the use image of agent-based interfaces is more likely to be integrated rather than compositional. Moreover, we wanted to test our expectations about the consequences of an integrated use image on the users’ human-like and capability-exploiting responses. Finally, we wanted to evaluate the effect that these instilled responses have on the usability of the interaction.

How can we test whether the use image is integrated? As show in Figure 3, we expect that in case of an integrated use image, certain type of cues will invoke responses not directly related (in a one-to-one manner) to the cues. For example, a human-like appearance cue given by the agent (such as full sentence usage) might evoke capability-exploiting responses from the user that is not directly related to the cue (such as using context of time and place).

By independently varying cues and actual capabilities, we can test the effect the match between instilled and actual capabilities on the usability of the system. As we discussed above, actual usability should go down whenever the system does not have capabilities as instilled by the cues.

Specifically, the actual “intelligence” of the system can be varied in two “actual capability” conditions. These capabilities are related to the type of user input that the system can handle: a system with “low capabilities” that can only process simple, complete requests, and a “high capabilities” system that can process complex requests

- 12 -

with implicit (deictic or anaphoric) references, stated in convoluted natural language, just like a human being would be able to handle.

Moreover, the cues provided by the system can varied in three different “cues” conditions (see Figure 4): a “computer-like cues” condition in which the system looks and behaves like a computer, a “human-like appearance cues” condition in which the system looks like a human being, and a “human-like appearance and capability cues” condition in which the system not only looks human but also behaves like one. In the latter condition, the system displays capabilities in its own responses that are related to the capabilities available in the “high capabilities” condition. For instance, the agent using place deixis (e.g. the word “here”) in its own sentences (a capability-displaying cue) reflects the actual system capability to understand place deixis when used by the user3.

The combination of two “actual capability” conditions and three “cues” conditions allows us to test the four hypotheses outlined below. Hypotheses 1 and 2 are mostly instrumental; Hypotheses 3 and 4 are of main concern for the theoretical argument of this paper.

First, it should be easier for the user to use a system with high capabilities, regardless of the cue level provided. Therefore:

H1. Usability in the “high capabilities” condition should be higher than in the “low capabilities” condition.

Overestimation might occur as a possible negative consequence of the presumed integrated use image if the system does not possess the capabilities that the user might infer from the system’s cues. Therefore:

H2. Within the “low capabilities” conditions, the occurrence of “human-like appearance cues” and “human-like appearance and capability cues” should lead to lower usability than “computer-like cues”, because users will overestimate the system capabilities in these conditions.

Although we cannot measure the use image directly, the existence of an anthropomorphic use image predicts presence of more human-like and capability-exploiting responses when more human-like cues are provided (see Figure 4).

H3. Within the “high capabilities” conditions4, the existence of an anthropomorphic use image suggests that users in the “human-like appearance cues” and the “human-like

3 Note that the cue can be present regardless of the actual capability and vice versa, as cues and conditions are varied independently. 4 In the “low capabilities” conditions the system provides feedback in the form of errors when the user tries to exploit nonexistent capabilities. This may lead the users to adjust their use image and their behavior accordingly. Therefore, hypotheses 3 and 4 can be tested in the “high capabilities” conditions only.

- 13 -

appearance and capability cues” conditions exhibit more human-like and capability-exploiting responses than in the “computer-like cues” condition.

Finally, if capability-exploiting responses occur even in the “human-like appearance cues” condition (the dark arrow in Figure 4), this would rule out the strictly one-to-one mapping between cues and responses that the compositional use image would predict, and thereby provide evidence for the integrated use image theory. Therefore:

H4. If the use image is integrated, then–within the “high capabilities” conditions–users exhibit more capability-exploiting responses in the “human-like appearance cues” condition than in the “computer-like cues” condition.


5.1. Procedure

For our experiment we created an agent-based system for requesting travel information for the Netherlands Railways (NS). The specific capabilities and cues of the system in the 2x3 between-subjects design are described in Figure 5 and Figure 6 respectively.



5.2. Participants

The experiment was conducted online with university students from all over The Netherlands. 92 participants (35 male, 57 female; age M=21.8, SD=3.55) took part in the experiment. The experiment was conducted entirely over the Internet, with participants logging in from their home computers. Since only the high-capabilities condition allows us to test the mechanisms of use-image construction (hypotheses 3 and 4), 59 participants were assigned to the “high capabilities” condition and only 33 to the “low capabilities” condition (the latter condition only being used for testing hypotheses 1 and 2).

Because the systems needed to support natural language input, a Wizard of Oz technique was used (see Figure 7): users were led to believe that they were interacting with a real system, but actually the experimenter read their inputs and provided the system responses5. This pretense proved to be credible: when asked in a post-test questionnaire, almost all of the participants believed that the system worked without any human interference.

Users were told that they were testing an online prototype of a new handheld travel companion for the Netherlands Railways. All users performed four predefined tasks using

5 To our best knowledge, this is the first Wizard of Oz experiment ever conducted over the Internet.

- 14 -

the system6, e.g.: “You are in Eindhoven and you want to go to Tilburg. Do you have to switch trains anywhere?” The tasks could be performed by typing questions to the system (“requests”), which would then respond with an answer. Requests were assessed by the experimenter using a strict protocol, and the provided answers were standardized in templates to keep both experimenter interference and response times to a minimum7. The tasks as well as all conversation with the agent were conducted in the Dutch language.


5.3. Measures

In order to test our expectations, we needed to measure the presence or absence of human-like and capability-exploiting responses that our participants provided during the interaction, as well as the overall usability of the interaction.

Human-like responses are behaviors that typically occur in interaction between two humans, and that do not occur when a human is interacting with a computer. Brennan (1991) found that participants used more first-person references (“I”, “me”) when talking to humans compared to when they were talking to computers. Researchers have found that participants also used significantly longer sentences when talking to other humans compared to when they were talking to computers (Rosé & Torrey, 2005; Shechtman & Horowitz, 2003; Richards & Underwood, 1984). Thompson (1980) found that participants used more grammatically correct sentences when talking to humans compared to when they were talking to computers. We therefore use the number of personal references, the number words per request, and the grammatical correctness of the requests (resembling correct sentences versus command-style language) as human-like responses.

We define capability-exploiting responses as behaviors that exploit the typical capabilities of a human conversation partner; specifically, the linguistic capability to understand implicit references, which is commonly not available in computers. This capability includes understanding “anaphora” (a reference to an earlier question/sentence), “time deixis” (a reference to the current date or time), “space deixis” (a reference to the current place), and “multiple connected questions” (asking for multiple things within one request). Other humans can exploit these capabilities by using “anaphora” (referring to earlier questions), “time deixis” (using words like “now” or “tomorrow”, or not indicating a time, thereby implicitly meaning “now”), “space deixis” (using words like “here”, or not indicating a place, thereby implicitly meaning “here”), and asking multiple questions at once (asking “how do I get to Amsterdam, and what does it cost?”). We used the presence of such user behavior as measures of capability-exploiting responses.

6 Tasks were not counter-balanced, but the effect of task number was modeled as a within-subjects variable in our analyses. 7 A response-timer was used to make the “processing time” of each response 9 seconds.

- 15 -

Usability is typically conceived as a multi-dimensional concept, including effectiveness (whether users are able to perform the task with the system or not), efficiency (the amount of time or number of actions users needed to accomplish a task), and satisfaction (a self-reported reflection of the users’ feelings with regard to using the system) (International Standards Organization [ISO], 1998). As participants performed multiple similar tasks with the system, we added learnability (the time it takes to learn to do a task efficiently, or the amount of efficiency that can be gained over time) to this list. We measured usability in our experiment by the following means:

• Discontinuation of the experiment (a proxy of ineffectiveness)

• The number of requests a user needed to make per task and the time per task (a proxy of efficiency)

• A satisfaction measure based on the “overall reactions to the software” section of the Questionnaire for User Interface Satisfaction (Chin, Diehl, & Norman, 1988)8, summing the five nine-point scale items (current reliability: α = .879).

• The difference in time per task between the first and the last task (a proxy of learnability).

6. RESULTS

6.1. Low versus High Capabilities

As a manipulation check, we first wanted to confirm that the system with high capabilities was actually more usable than the system with low capabilities (hypothesis 1). As, portrayed in Figure 8, users needed significantly fewer requests per task in the high compared to the low-capabilities conditions and were also significantly more satisfied, indicating that the highly capable system was indeed more usable (in terms of efficiency and satisfaction) than the less capable system.


Interestingly, the average time needed to perform the tasks was actually higher (although not significantly) in the high capabilities condition compared to the low capabilities condition. This supports the often-voiced opinion in the field of human-computer interaction that a speed and usability are not equal; a slower system does not necessarily have to be less usable.

8 The questionnaire can be found at http://hcibib.org/perlman/question.cgi?form=QUIS. We excluded item 4 (inadequate power – adequate power), because it would be confounded by our “system capabilities” manipulation. 9 A factor analysis (unweighted least squares extraction) provided a single factor solution that explained 58% of the variance, with communalities ranging between .37 and .80, and correlation residuals not exceeding .071.

- 16 -

6.2. Overestimation

In line with Hypothesis 2, we expected the mismatch between cues and capabilities in the “low capabilities, human-like appearance cues” and the “low capabilities, human-like appearance and capability cues” conditions to result in overestimation. As an indicator of overestimation, we expected these conditions to have a significantly lower usability than the “low capabilities, computer-like cues” condition.

Strong evidence for overestimation was found in terms of system effectiveness: 5 of the 23 participants interacting with a system with low capabilities but human-like cues (and none for computer-like cues) prematurely quit the experiment. Two participants in the “low capabilities, human-like appearance cues” condition and three participants in the “low capabilities, human-like appearance and capability cues” condition were not able to adapt their questions to the limited capabilities of the system. Probably frustrated with the numerous error messages they encountered, they either closed their browsers or skipped all remaining tasks to end the experiment. Typically, these users completed none or just a single task before giving up.

Additional evidence for the overestimation effect was found in terms of learnability. A Factorial ANOVA on the decrease in time10 used to perform the first task and the last task was performed with capabilities and cues as independent variables. First, a main effect of capabilities showed that across cue conditions, users in the low capabilities conditions exhibited a stronger decrease in time used per task than users in the high capabilities conditions (Mdecrease,low = 58.2 seconds versus Mdecrease,high = 9.3 seconds, F(1,83) = 8.56, p < .005, partial η2 = .098). This result indicates that there was less need for learning in the high capabilities system.

More important, the interaction between capabilities and cues was significant (F(2,82) = 3.95, p < .05, partial η2 = .091), which indicates that the strongest decrease in time (our measure for learnability) occurred in the “low capabilities, computer-like cues” condition. As shown in Figure 9 below, within the low-capabilities condition, users of the computer-like interface showed a higher time decrease (learned faster) than users of the human-like systems.


Note that according to the findings above, overestimation occurred not only in the “human-like appearance and capabilities cues” condition, but also in the “human-like appearance cues” condition. This could suggest that participants tried to exploit inexistent capabilities even if the system gave no explicit (one-to-one) cues of these capabilities. This provides preliminary evidence for the existence of an integrated use image, which we will test in more detail in the next sections.

6.3. Existence of an Anthropomorphic Use Image

10 A repeated measures test on the time decrease over all four tasks provided similar results

- 17 -

In order to answer our main question of whether the use image in agent-based interaction is integrated, we first need to prove the existence of an anthropomorphic use image. In accordance with Hypothesis 3, this can be done by measuring the absence or presence of human-like and capability-exploiting responses in participants’ interaction. As indicated before (see footnote 3) we can perform these tests only on the high capabilities systems, because the direct feedback (errors) in the low capabilities systems mitigates these behavioral differences.

Human-like Responses

In line with Hypothesis 3, we tested whether the occurrence of human-like responses in the “high capabilities” conditions increased with cue-level (going from computer-like cues, to human-like appearance cues, to human-like appearance and capability cues). Each participant performed four tasks on the interface, which provided us with repeated observations of each participant. As dependent measures of human-like responses we tested the number of first-person references, total number of words per request and the grammatical correctness of the requests. See Figure 10 for overall and task-specific numbers of human-like responses for these three dependent measures.

Each of these measures differed in their underlying scale (nominal, frequency and continuous). To analyze the data we therefore used mixed (Poisson, linear and nominal) regressions with random intercepts, using the Mixor statistical package (Hedeker & Gibbons, 1996). As independent factors we used cue-level and task number. Since the right half of Figure 10 reveals substantial effects of task on most measures, we used a linear and quadratic task term in the regression to account for these inherent task variations. As these task effects are not substantive to our question on how cue-level affects human-like responses, we will concentrate our discussion on the effects of cue-level. Further details can be found in the appendix.

The number of first-person references used increased with cue level, as shown in Figure 10a. This was corroborated by a significant effect of cue-level in a Poisson mixed regression (regression A in the appendix) (βcues = 1.16 (std. err = 0.131), z = 8.87, p < 0.0001).

Figure 10b shows that the more human-like the cues, the more words per chat request were used. Indeed, a significant effect of cue level was found in a mixed linear regression (regression B in the appendix) (βcues = 2.50 (.80), z = 3.09, p < 0.005).

Finally, we found that the more human-like the cues, the better the grammatical correctness as shown in Figure 10c. A nominal mixed regression (regression C in the appendix) indicated a significant effect of cues (βcues = 6.47 (2.05), z = 3.15, p < 0.005).

In conclusion, we found evidence for the existence of an anthropomorphic use image (Hypothesis 3), as several human-like responses were significantly higher when the system had a more human-like appearance and human-like capability cues.


- 18 -

Capability-exploiting Responses

Capability-exploiting responses were measured by the presence or absence of specific behaviors that exploit typically human-like linguistic capabilities, specifically:

• Implicit references to place

• Implicit references to dates

• Implicit references to time

• Implicit references to earlier requests

• Asking multiple connected questions at once

A sum measure of these five possible capability-exploiting responses was taken for each task. As can be seen in Figure 11, the number of capability-exploiting responses is dependent on the cue level, and varies by task. It must also be noticed that even in the “computer-like cues” condition, the number of capability-exploiting responses observed is unexpectedly high. This is probably the case because our interface allowed users to type anything into the textfield of our system, which for some participants seemed to be a prompt in itself to try to exploit as many of the system capabilities as possible. More importantly, however, the number of capability-exploiting responses in the other conditions is significantly higher.

A linear mixed regression on the sum of the capability-exploiting responses was performed (regression D in the appendix), with cue-level and task number as independent factors. The effect of cue-level was significant; the more human-like the cues, the more capability-exploiting responses were used (βcues=0.288 (0.114); z = 2.52, p < 0.05).

In conclusion, these results provide further evidence for the existence of an agent-based use image (Hypothesis 3), as the total number of capability-exploiting responses was significantly higher when the system had a more human-like appearance and human-like capability cues.


6.4. Integrated Use Image

The integrated use image theory predicts that capability-exploiting responses can be induced even without human-like capability cues (i.e. with human-like appearance cues alone). In such a case, a compositional use image theory is ruled out, since this theory predicts that capability cues would be needed to induce capability-exploiting responses. In other words, the compositional use image theory predicts that the capability-exploiting responses in the “human-like appearance cues” condition are as low as in the “computer-like cues” condition, while the integrated use image theory predicts that they are as high as in the “human-like appearance and capability cues” condition.

- 19 -

As can be seen in Figure 11, the capability-exploiting responses do increase between “human-like appearance cues” and “human-like appearance and capability cues”, but the increase between the “computer-like cues” and the “human-like appearance cues” systems is larger. An ANOVA11 on the number of capability-exploiting responses with a planned contrast between “computer-like cues” and the other two conditions provided significant results (t(56) = 2.41, p < .05, r = .31). Moreover, the contrast between “human-like appearance cues” and “human-like appearance and capability cues” was not significant (p > .05). This is consistent with an integrated use image theory. By comparison, the alternative hypothesis of a compositional use image would suggest that a gap exist between the first two cue conditions and the last one: (“computer-like cues” + “human-like appearance cues”) versus “human-like appearance and capability cues”. This contrast was not significant (p > .05).

Thus, in line with Hypothesis 4, there is a significant increase in capability-exploiting responses between the system with “computer-like cues” and the other two conditions, meaning that even with human-like appearance cues alone, people show more capability-exploiting responses than for computer-like cues. Apparently, these cues can elicit cue-unrelated responses. This suggests that the cues do not elicit a specific response, but that they are integrated into a complex use image that can trigger any kind of response that is in correspondence with this use image, even outside the range of the provided cues (see Figure 4). This disproves the compositional use image theory for agent-based interfaces, and provides evidence for the integrated use image theory.

7. CONCLUSION AND DISCUSSION

In this study, we argued that users of human-like systems anthropomorphize the system and thereby infer a use image of human-like intelligence. We set out to show that these systems therefore incite human-like and capability-exploiting responses (H3). Results from the experiment show that users of human-like systems show more human-like and capability-exploiting responses than users of computer-like systems. This results in more first-person references, longer sentence and more grammatically correct sentences. Users of human-like systems also (try to) make more use of human-like capabilities, like implicit references to context and earlier parts of the conversation, and asking multiple questions at the same time.

We argued that some systems may not be able to handle such human-like and capability-exploiting responses. In these cases the actual system capabilities would be overestimated by the user, which would result in a reduction in system usability (H2). Results from the experiment indeed show that with low system capabilities and human-like cues, overestimation occurs and causes reductions in effectiveness and learnability. Effectiveness is dramatically reduced when cues indicate human-like intelligence while the system has low capabilities; participants may even give up entirely when they overestimate the system. Learnability is also better when cues match system capabilities. If a system has limited capabilities, only with computer-like cues will users eventually learn to use the system in a significantly faster way.

11 A mixed linear regression (see section “Capability-exploiting Responses”) provided similar results

- 20 -

Finally, we hypothesized that the constructed use image would not be compositional like in “normal” human-computer interaction (Brinkman, 2003), but that it would instead be integrated (H4). We set out to prove this “integratedness” by showing how certain cues could elicit unrelated responses. Results from the experiment show that “human-like appearance cues” (i.e. cues not directly related to capabilities) alone increase the number of capability-exploiting responses. This suggests that there is not a one-to-one mapping from cues to responses, but that these responses are instead mediated by an underlying integrated use image.

7.1. Discussion

Our results can be explained by the realization that human-like agents are a metaphor. Metaphors are an easy way to instill a complex, integrated use image, because capabilities are not directly related to cues, but can be inferred. This anthropomorphic use image may however cause the user to overestimate the system’s capabilities, which reduces the usability of the system. The results of our experiment confirm these statements. In a system using an agent-based interface, the usability is higher when the system is more capable. In our system with low capabilities, however, the usability decreases even further when the agent looks more human-like. This effect is caused by the anthropomorphic use image: the user thinks such a human-like agent possesses some form of human-like intelligence, and thereby overestimates the actual system capabilities.

The use image created by the agent leads the users to respond in a human-like fashion and to exploit the human-like capabilities that the user believes the system to have. For usable agent-based interaction, each cue must therefore be delicately tuned to instill the right beliefs; otherwise it will inadvertently lead the user to overestimate the capabilities of the system.

Moreover, this use image is integrated, meaning that the full set of provided cues are integrated into a single use image that results in a set of responses that do not necessarily need to be directly related to the provided cues. Specifically, capability-exploiting responses can be induced not only by capability cues, but even by appearance cues alone. This last finding is most important, since it explains that it is very hard to fix overestimation problems exactly because the capabilities are inferred instead of directly related to cues. The integrated use image makes finding the right set of cues a matter of trial and error, and it might well be the case that there is no possible configuration of cues that will not lead to overestimation, underestimation, or even both at the same time.

Agent-based interaction thus provides a powerful metaphor, but this metaphor is too powerful for most of our current systems. These systems cannot live up to the expectations elicited by the agent-based metaphor, and thereby suffer from bad usability.

Our findings have specific implications for various fields related to agent-based interaction. From the point of view of social psychologists, we can contend that humans have a way of interacting with agent-based systems similar to their interaction with other humans. They use the same kinds of quick heuristic inferences to form an understanding of their interaction partner, and they act accordingly. Further research can establish the

- 21 -

effects of certain established strategies for self-presentation on agent-based interaction (Bailenson & Yee, 2005). The effects of appearing competent, likable, or powerful have been well-researched in social psychology (see Kenrick, Neuberg and Cialdini, 2007, chapter 4 for a review); our results suggest that these strategies should work for agents too. Conversely, agents can be used to study social psychological phenomena as well (Blascovich et al., 2002; Reeves & Nass, 1996).

Human-computer-interaction researchers should realize that their conventional theories and methods for usability-testing may not work on agent-based systems, because they do not adhere to the conventional (compositional) use image theory postulated by Norman (1986). For instance, modular usability tests of agent-based interfaces cannot be integrated, since it is impossible to present part of the functionality without affecting the other parts. Methods like Heuristic Evaluation (Nielsen, 1994) that inspect each interface widget and reason about its usability are less suitable for agent-based interfaces, since it is not the widgets that determine the use image.

The findings also suggest that interaction designers have to be very careful using a human-like agent in their system designs. The good thing about an agent-based system is that it is possible to instantly create a complex, integrated use image: there is no need for numerous buttons and labels; just a cartoon character is enough to invoke all kinds of expectations. The designer can play with this too, as we did in our railway system, by giving the agent the traditional Netherlands Railways outfit. Even changing the agent’s gender can have subtle but significant consequences (Forlizzi et al., 2007). The problem is, however, that it is really hard to control the use image. Because the use image is integrated, it is hard to switch a tiny part of it “on” or “off”. In order to prevent usability issues, the system should provide every bit of functionality a user may possibly induce from the agent’s appearance. As this is often impossible due to technological constraints, overestimation is likely to occur, and usability may be significantly reduced. It is likely that the other benefits of using agents (attractiveness, fun, social facilitation) do not outweigh the negative effects on usability.

There are however some interesting solutions to try before abandoning the human-like agent altogether. One could use for instance a dog character, which would probably elicit a more command-like user language (Diederiks, 2002). It is a well-known fact in social psychology that when you want to provide help to someone, you should focus on appearing likable, not on appearing competent (Casciaro & Lobo, 2005). If the users like the agent they will be more willing to forgive any mistake it makes.

Our final consideration leads us to view these specific issues explained above in a broader perspective. We have shown the overestimation effect. We have shown that it is caused by an integrated use image. We have also shown that the “integratedness” of this use image makes it hard to change or amend. On the one hand, this leads us to conclude that, concerning the state of the art in natural language processing and inferencing, it might be better not to use any human-like characters in agent-based systems. On the other hand, we believe that in the (near) future it might be possible to make very usable human-like agents. This, however, calls for very sophisticated use image fine-tuning. Such fine-tuning projects need computer scientists and artificial intelligence specialists that can

- 22 -

develop smarter systems, social psychologists that know all kinds of self-presentation techniques, interaction designers that can subtly built these techniques into their characters, and human-computer interaction researchers that can test the correctness of the formed use image with actual users. Probably only such a multidisciplinary team can bring about a paradigm shift from graphical user interfaces to agent-based interfaces.

- 23 -

8. APPENDIX: TABLES WITH DETAILED RESULTS

The following tables display the results of the mixed regressions used to measure the effect of cues on human-like and capability-exploiting responses.

8.1. Regression A: First Person References

A mixed Poisson regression on the number of first-person references, with task number (both linear and squared) and cues as predictors.

Variable Estimate Stand. Error Z p-value Intercept -2.43 0.230 -10.55 0.000 Cues 1.16 0.131 8.87 0.000 Task -0.11 0.054 -2.05 0.040 Cues * Task 0.14 0.064 2.26 0.024 Task2 0.21 0.108 1.97 0.049 Cues * Task2 -0.08 0.130 -0.61 0.542 Random effect variance term: expressed as a standard deviation: Intercept 1.91 0.163 11.74 0.000* * 1-tailed p-value

8.2. Regression B: Words per Chat Request

A mixed Linear regression on the number of words per chat request, with task number (both linear and squared) and cues as predictors.

Variable Estimate Stand. Error Z p-value Intercept 7.99 0.666 11.99 0.000 Cues 2.50 0.809 3.09 0.002 Task 0.31 0.087 3.55 0.000 Cues * Task 0.22 0.106 2.09 0.037 Task2 0.96 0.194 4.92 0.000 Cues * Task2 0.38 0.236 1.59 0.111 Random-effect variance & covariance term(s): Intercept 23.95 4.826 4.96 0.000* Residual variance: 8.92 0.948 9.41 0.000*

- 24 -

* 1-tailed p-value

8.3. Regression C: Grammatical Correctness

A mixed Nominal regression on the grammatical correctness, with task number (both linear and squared) and cues as predictors.

Variable Estimate Stand. Error Z p-value Intercept 1.96 0.826 2.38 0.017 Cues 6.47 2.054 3.15 0.002 Task -0.75 0.311 -2.41 0.016 Cues * Task -0.50 0.274 -1.82 0.069 Task2 -0.03 0.584 -0.05 0.957 Cues * task2 -0.77 0.747 -1.04 0.300 Random effect variance term: expressed as a standard deviation: Intercept 9.30 2.910 3.19 0.001* * 1-tailed p-value

8.4. Regression D: Capability-exploiting Responses

A mixed Linear regression on the number of capability-exploiting responses used (connectedness + context of date + context of time + context of place + multiple questions), with task number (both linear and squared) and cues as predictors.

Variable Estimate Stand. Error Z p-value Intercept 2.22 0.094 23.71 0.000 Cues 0.29 0.114 2.52 0.012 Task 0.01 0.022 0.59 0.556 Cues * Task 0.01 0.026 0.48 0.633 Task2 -0.34 0.048 -7.12 0.000 Cues * Task2 -0.01 0.059 -0.21 0.831 Random-effect variance & covariance term(s): Intercept 0.38 0.097 3.95 0.000* Residual variance: 0.55 0.058 9.41 0.000* * 1-tailed p-value

- 25 -

NOTES

Acknowledgments. We would like to thank Steven Langerwerf for his help implementing the online experiment. We would like to thank Don Bouwhuis, Wijnand IJsselsteijn, Florian Kaiser, and Iris van Rooij for their thorough review of several early drafts of the paper.

Authors’ Present Addresses.

Bart P. Knijnenburg Eindhoven University of Technology, Human-Technology Interaction group P.O. Box 513, 5600 MB Eindhoven Tel: +31 40 247 8420 Email: [email protected]

Martijn C. Willemsen Eindhoven University of Technology, Human-Technology Interaction group P.O. Box 513, 5600 MB Eindhoven Tel: +31 40 247 2561 Email: [email protected]

HCI Editorial Record. (supplied by Editor)

- 26 -

REFERENCES

Bailenson, J. N., & Yee, N. (2005). Digital Chameleons: Automatic Assimilation of Nonverbal Gestures in Immersive Virtual Environments. Psychological Science , 814-819.

Blascovich, J., Loomis, J., Beall, A. C., Swinth, K. R., Hoyt, C. L., & Bailenson, J. N. (2002). Immersive Virtual Environment Technology as a Methodological Tool for Social Psychology. Psychological Inquiry , 103-124.

Bradshaw, J. (1997). An Introduction to Software Agents. In J. Bradshaw (Ed.), Software Agents (pp. 3-46). London: AAAI Press.

Brennan, S. (1991). Conversation With and Through Computers. User Modeling and User-Adapted Interaction 1 , 67-86.

Brinkman, W.-P. (2003). Is usability compositional? Eindhoven: Technische Universiteit Eindhoven.

Brinkman, W.-P., Haakma, R., & Bouwhuis, D. G. (2007). Towards an empirical method of efficiency testing of system parts: A methodological study. Interacting with Computers , 342-356.

Casciaro, T., & Lobo, M. S. (2005, June). Competent Jerks, Lovable Fools, and the Formation of Social Networks. Harvard Business Review , pp. 92-99.

Chin, J., Diehl, V., & Norman, K. (1988). Development of an Instrument Measuring User Satisfaction of the Human-Computer Interface. CHI'88 Proceedings (pp. 213-218). Washington, DC, USA: ACM.

Chin, M. G., Sims, V. K., Upham Ellis, L., Yordon, R. E., Clark, B. R., Ballion, T., et al. (2005). Developing and Anthropomorphic Tendencies Scale. Human Factors and Ergonomics Society Annual Meeting Proceedings , 1266-1268.

Cook, J., & Salvendy, G. (1989). Perception of computer dialogue personality: An exploratory study. International Journal of Man- Machine Studies, 31 , 717- 728.

De Laere, K., Lundgren, D., & Howe, S. (1998). The Electronic Mirror: Human-Computer Interaction and Change in Self-Appraisals. Computers in Human Behavior 14(1) , 43-59.

Dehn, D. M., & van Mulken, S. (2000). The impact of animated interface agents: a review of empirical research. International Journal of Human-Computer Studies , 1-22.

Dennett, D. (1987). The Intentional Stance. Cambridge, MA: MIT Press.

- 27 -

Dey, A. K. (2001). Understanding and Using Context. Personal and Ubiquitous Computing , 4-7.

Dharna, N., Thomas, S., & Weinberg, J. (2001). The Effects of Animated Characters with Human Traits on Interface Usability and User’s Perception of Social Interactions. can be found at www.users.muohio.edu/birchmzp/csi/dharna.pdf.

Diederiks, E. M. (2002). Do TV’s listen better than dogs?: The effects of an animated character as an interface intermediary for voice-controlled interaction. Proceedings of the AISB’02 symposium on animating expressive characters for social interactions, SSAISB, (pp. 17-20). London, UK.

Erickson, T. (1997). Designing Agents as if People Mattered. In J. Bradshaw (Ed.), Software Agents (pp. 79-96). London: AAAI Press.

Forlizzi, J., Zimmerman, J., Mancuso, V., & Kwak, S. (2007). How Interface Agents Affect Interaction Between Humans and Computers. Proceedings of the 2007 conference on Designing pleasurable products and interfaces (pp. 209-221). New York, NY, USA: ACM.

Halliday, M. A., & Hasan, R. (1976). Cohesion in English. London, UK: Longman Group Ltd.

Hedeker, D., & Gibbons, R. D. (1996). MIXOR: a computer program for mixed-effects ordinal regression analysis. Computer Methods and Programs in Biomedicine , 49 (2), 157-176.

International Standards Organization. (1998). Ergonomic requirements for office work with visual display terminals (VDTs) - Part 11: Guidance on usability.

Kenrick, D. T., Neuberg, S. L., & Cialdini, R. B. (2007). Social Psychology: Goals in interaction (4th Edition ed.). Boston: Allyn and Bacon.

Lanier, J. (1995). Agents of Alienation. Journal of Consciousness Studies , 76-81.

Laurel, B. (1990). Interface agents: Metaphors With Character. In B. Laurel (Ed.), The Art of Human-Computer Interface Design. Reading, MA: Addison-Wesley.

Levinson, S. C. (1983). Pragmatics. Cambridge, UK: Cambridge University Press.

Murray, D., & Bevan, N. (1985). The social psychology of computer conversations. Human-Computer Interaction - INTERACT'84. New York, NY: Elsevier.

Negroponte, N. (1997). Agents: From Direct Manipulation to Delegation. In J. Bradshaw (Ed.), Software Agents (pp. 57-66). London: AAAI Press.

- 28 -

Nickerson, R. (1976). On conversational interaction with computers. Conference on Human Factors in Computing Systems (pp. 101-113). New York, NY, USA: ACM.

Nielsen, J. (1994). Usability Engineering. San Francisco, CA: Morgan Kaufmann.

Nilsson, N. (1995). Eye on the Prize. AI Magazine , 16, pp. 9-17.

Norman, D. A. (1986). Cognivite engineering. In D. A. Norman, & S. Draper (Eds.), User Centered System Design: New Perspectives on Human-Computer Interaction (pp. 31-61). Hillsdale, NJ: Lawrence Erlbaum Associates.

Norman, D. A. (1997). How might people interact with agents. In J. Bradshaw (Ed.), Software Agents (pp. 49-55). London: AAAI Press.

Norman, D. A. (1988). The Design Of Everyday Things. Basic Books.

Qiu, L., & Benbasat, I. (forthcoming). Evaluating Anthropomorphic Product Recommendation Agents: A Social Relationship Perspective to Designing Information Systems. Journal of Management Information Systems .

Quintanar, L., Crowell, C., & Moskal, P. (1987). The interactive computer as a social stimulus in human-computer interactions. In G. Salvendy, S. Sauter, & J. L. Hurrell, Social, ergonomic, and stress aspects of work with computers. Amsterdam: Elsevier Science Publishers.

Reeves, B., & Nass, C. (1996). The Media Equation; How People Treat Computer, Television, and New Media Like Real People and Places. Cambridge: Cambridge University Press.

Richards, M., & Underwood, K. (1984). How Should People and Computers Speak to One Another? Proceedings, Interact '84: First IFIP Conference on 'human-Computer Interaction' (pp. 33-36). London: International Federation for Information Processing.

Rosé, C., & Torrey, C. (2005). Interactivity and Expectation: Eliciting Learning Oriented Behavior with Tutorial Dialogue Systems. Human-Computer Interaction - INTERACT 2005 (pp. 323-336). Rome: Springer.

Shechtman, N., & Horowitz, L. (2003). Media Inequality in Conversation: How People Behave Differently When Interacting with Computers and People. Proc. CHI 2003. ACM Press.

Shneiderman, B. (1981). A Note on Human Factors Issues of Natural Language Interaction with Database Systems. Information Systems 6(2) , 125-129.

- 29 -

Shneiderman, B. (1997). Direct manipulation versus agents: Paths to predictable, controllable, and comprehensible interfaces. In J. Bradshaw (Ed.), Software Agents (pp. 97-106). London: AAAI Press.

Taylor, M. M. (1988). Layered protocol for computer-human dialogue. International Journal of Man-Machine studies , 28, 175-257.

Thompson, B. (1980). Linguistic analysis of natural language communication with computers. Proceedings of the 8th International Conference on Computational Linguistics COLING 80. Tokyo.

Walker, J., Sproull, L., & Subramani, R. (1994). Using a Human Face in an Interface. Proc. CHI ’94 Human Factors in Computing Systems (pp. 85-91). Boston: ACM Press.

Williges, R., Williges, B., & Elkerton, H. (1987). Software Interface Design. In G. Salvendy (Ed.), Handbook of Human Factors. New York: John Wiley.

Winograd, T. (1972). Understanding natural language. Orlando, FL, USA: Academic Press, Inc.

- 30 -

FOOTNOTES

1. This can refer to the textual context, in terms of “endophoric references” (also called anaphora for backward references or cataphora for forward references), as well as to the situational context, in terms of “exophoric references” (also called deixis).

2. The responses are never completely unrelated to the cues, so with “unrelated” we here mean “not directly related”.

3. Note that the cue can be present regardless of the actual capability and vice versa, as cues and conditions are varied independently.

4. In the “low capabilities” conditions the system provides feedback in the form of errors when the user tries to exploit nonexistent capabilities. This may lead the users to adjust their use image and their behavior accordingly. Therefore, hypotheses 3 and 4 can be tested in the “high capabilities” conditions only.

5. To our best knowledge, this is the first Wizard of Oz experiment ever conducted over the Internet.

6. Tasks were not counter-balanced, but the effect of task number was modeled as a within-subjects variable in our analyses.

7. A response-timer was used to make the “processing time” of each response 9 seconds.

8. The questionnaire can be found at http://hcibib.org/perlman/question.cgi?form=QUIS. We excluded item 4 (inadequate power – adequate power), because it would be confounded by our “system capabilities” manipulation.

9. A factor analysis (unweighted least squares extraction) provided a single factor solution that explained 58% of the variance, with communalities ranging between .37 and .80, and correlation residuals not exceeding .071.

10. A repeated measures test on the time decrease over all four tasks provided similar results

11. A mixed linear regression (see section “Capability-exploiting Responses”) provided similar results

- 31 -

FIGURE CAPTIONS

Figure 1. Norman’s gulfs of execution and evaluation.

Figure 2: Feedforward (cues) and feedback (actual capabilities) together constitute a use image (believed intelligence). The use image will make the users exploit the capabilities they believe the system has. It will also lead to more human-like responses.

Figure 3: Consequences of the “compositional use image theory” and the “integrated use image theory” on the effects of system cues on user response.

Figure 4: Our experiment uses three different levels of cues to test the “integrated use image theory” of the relation between system cues and user responses. If the theory holds, human-like appearance cues alone should be able to instill capability-exploiting user responses.

Figure 5: A detailed description of the distinction between the system with low capabilities and the system with high capabilities.

Figure 6: A detailed description of the three varieties of “system cues”, the different appearances of the system.

Figure 7: A schematic representation of the remote Wizard of Oz interaction protocol used in our experiment.

Figure 8: Differences in usability metrics for low versus high capabilities systems.

Figure 9: Decrease in time (seconds) needed to perform the task between task 4 and task 1.

Figure 10: People use more first person references (a), longer sentences (b) and better grammatical correctness (c) towards systems with more human-like cues. The left-side graphs show the human-like response-measures for each cue-level). The right-side graphs show how these measures vary per task for each cue level.

Figure 11: Capability-exploiting responses are higher for systems with more human-like cues. The left-side graph shows the number of capability-exploiting responses for each cue-level. The right-side graph shows how this measure varies per task for each cue level.

- 32 -

FIGURES

Figure 1. Norman’s gulfs of execution and evaluation.

- 33 -

Figure 2: Feedforward (cues) and feedback (actual capabilities) together constitute a use image (believed intelligence). The use image will make the users exploit the capabilities they believe the system has. It will also lead to more human-like responses.

Human-like capabilitycues (feedforward)

Human-like appearancecues (feedforward)

Capability-exploitingresponses

Human-likeresponses

Actual capabilities(feedback)

Believed intelligence(use image)

- 34 -

Figure 3: Consequences of the “compositional use image theory” and the “integrated use image theory” on the effects of system cues on user response.

cue B

cue C

response B

response C

response D

cue A

cue B

cue C

response B

response C

response D

cue A

cue B

cue C

response B

response C

response D

cue A

useimage

use image

- 35 -

Figure 4: Our experiment uses three different levels of cues to test the “integrated use image theory” of the relation between system cues and user responses. If the theory holds, human-like appearance cues alone should be able to instill capability-exploiting user responses.

human-like capability cues

human-like appearance cues

computer-like cues

capability-exploiting responses

human-like responses

computer-like responses

- 36 -

Figure 5: A detailed description of the distinction between the system with low capabilities and the system with high capabilities.

Low capabilities The low capabilities system lacks several typically human capabilities, specifically:

High capabilities The high capabilities system has several typically human capabilities, specifically:

The system cannot infer information that is only implicitly stated, e.g. if no departure time is given, it will not process the request.

The system can infer implicitly stated information: e.g. if no departure station is given, it assumes that it is the station where the user currently is located (the “origo” of deixis).

The system does not understand spatial or temporal references, e.g. it will not understand the words “here” or “now”.

The system can determine the meaning of deictic references, e.g. it will understand “here” as a place and “now” as a time.

The system treats every request as a new entity, e.g. a request like: “what is the price for that trip?” will not be processed.

The system can infer anaphoric references to times, places or trips mentioned in previous requests, e.g. “then” and “that trip” will be interpreted in the context of previous requests.

The system can only handle one request at a time.

The system can handle multiple connected requests at a time.

The system cannot handle convoluted sentences, e.g. of a request longer than 10 words, it will only interpret the first 10 words.

The system can handle requests of any length.

- 37 -

Figure 6: A detailed description of the three varieties of “system cues”, the different appearances of the system.

System appearance Description

Computer-like cues The “computer-like cues” systems are used as a baseline. These systems present a logo and a textbox that displays system responses. The systems display “broken” sentences and a strict sentence structure, which provides users with the feel of computer-style dialogue.

Human-like appearance cues The “human-like appearance cues” systems introduce some human-like, representational cues. These systems present a human-like cartoon character that “talks” through a speech bubble. The character uses full sentences and a more varied sentence structure and wording. Note that these systems do not use any capability-implying cues; their appearance and interaction style are just more human-like.

Human-like appearance and capability cues The “human-like appearance and capability cues” systems have the same cues as the “human-like appearance cues” systems, but they also employ some cues that display certain human-like capabilities. In their responses, these systems make extensive use of implicit references, in the form of personal deixis (e.g., you, I), temporal and spatial reference (e.g., there, then) and deixis (e.g., now, here) and nominal reference (e.g. that trip).

- 38 -

Figure 7: A schematic representation of the remote Wizard of Oz interaction protocol used in our experiment.

Survey data

- 39 -

Figure 8: Mean values (and standard error) of usability metrics for low versus high capabilities systems.

Low

capabilities High capabilities

Significance (p) and effect size (r) of difference

Number of requests needed per task

3.32 (0.267) 2.38 (0.136) t(87) = 3.50, p < .005, r = .35

Time needed per task (seconds)

185 (9.49) 211 (17.41) t(82.95) = -1.28, not significant

Satisfaction points (min. 9, max. 45)

24.07 (1.584) 31.47 (0.818) t(43.42) = -4.15, p < .001, r = .53

- 40 -

Figure 9: Mean (and standard error) decrease in time (seconds) needed to perform the task between task 4 and task 1.

Computer-like cues

Human-like appearance cues

Human-like appearance and capability cues

Low capabilities

108.56 (42.52) 40.12 (51.52) 25.89 (19.46)

High capabilities

-6.50 (10.07) 20.42 (7.51) 13.30 (9.86)

- 41 -

Figure 10: People use more first person references (a), longer sentences (b) and better grammatical correctness (c) towards systems with more human-like cues. The left-side graphs show the human-like response-measures for each cue-level). The right-side graphs show how these measures vary per task for each cue level.

0

1

2

3

4

5

6

computer-like cues

human-like appearance

cues

human-like appearance & capability cues

Tota

l num

ber o

f firs

t-per

son r

efer

ence

s

0.0

0.5

1.0

1.5

Task 1 Task 2 Task 3 Task 4

(a)

0

2

4

6

8

10

12

computer-like cues


cues


Aver

age n

umbe

r of w

ords

per c

hat r

eque

st

0

2

4

6

8

10

12


(b)

0.0

0.2

0.4

0.6

0.8

1.0

computer-like cues


cues


0.0

0.2

0.4

0.6

0.8

1.0


(c)Aver

age n

umbe

r of g

ram

mat

ically

corre

ct re

ques

ts

- 42 -

Figure 11: Capability-exploiting responses are higher for systems with more human-like cues. The left-side graph shows the number of capability-exploiting responses for each cue-level. The right-side graph shows how this measure varies per task for each cue level.

0.0

1.0

2.0

3.0

computer-like cues


cues


Tota

l num

ber o

f cap

abilit

y-ex

ploi

ting b

ehav

iors

0.0

1.0

2.0

3.0


inferring capabilities of intelligent agents from their ...1. the power of agent-based interaction...

Documents