designers characterize naturalness in voice user

126
Designers Characterize Naturalness in Voice User Interfaces: Their Goals, Practices, and Challenges by Yelim Kim BSc., The University of Toronto, 2016 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate and Postdoctoral Studies (Computer Science) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) March 2020 c Yelim Kim, 2020

Upload: others

Post on 11-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

User Interfaces: Their Goals, Practices, and
Challenges
by
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
(Computer Science)
(Vancouver)
c©Yelim Kim, 2020
The following individuals certify that they have read, and recommend to
the Faculty of Graduate and Postdoctoral Studies for acceptance, the thesis
entitled:
Practices, and Challenges
submitted by Yelim Kim in partial fulfillment of the requirements
for the degree of Master of Science in Computer Science
Examining Committee:
(VUIs) are becoming ubiquitous through devices that feature voice assistants
such as Apple’s Siri and Amazon Alexa. Naturalness is often considered to
be central to conversational VUI designs as it is associated with numerous
benefits such as reducing cognitive load and increasing accessibility. The lit-
erature offers several definitions for naturalness, and existing conversational
VUI design guidelines provide different suggestions for delivering a natural
experience to users. However, these suggestions are hardly comprehensive
and often fragmented. A precise characterization of naturalness is necessary
for identifying VUI designers’ needs and supporting their design practices. To
this end, we interviewed 20 VUI designers, asking what naturalness means
to them, how they incorporate the concept in their design practice, and
what challenges they face in doing so. Through inductive and deductive the-
matic analysis, we identify 12 characteristics describing naturalness in VUIs
and classify these characteristics into three groups, which are ‘Fundamental’,
‘Transactional’ and ‘Social’ depending on the purpose each characteristic
serves. Then we describe how designers pursue these characteristics under
different categories in their practices depending on the contexts of their VUIs
(e.g., target users, application purpose). We identify 10 challenges that de-
signers are currently encountering in designing natural VUIs. Our designers
reported experiencing the most challenges when creating naturally sounding
dialogues, and they required better tools and guidelines. We conclude with
iii
implications for developing better tools and guidelines for designing natural
VUIs.
iv
Providing natural conversation experience is often considered central to de-
signing conversational Voice User Interfaces (VUIs), as it is expected to bring
out numerous benefits such as lower cognitive load, lower learning curve, and
higher accessibility. Despite its noted importance, naturalness is ill-defined.
There are also no comprehensive standard resources for helping designers to
pursue naturalness. In order to provide support for VUI designers in the
future, it is critical to understand how they currently perceive and pursue
naturalness in their design practices. Hence, we interviewed 20 VUI designers
to understand their notion of a natural conversational VUI and their practices
and challenges of pursuing it. In this thesis, we present 12 characteristics of
naturalness and classify these characteristics into 3 groups. We also identify
10 challenges that our designers are currently encountering and conclude with
implications for developing better tools and guidelines for designing natural
conversational VUIs.
v
Preface
This thesis was written based on the study approved by the UBC Behavioural
Research Ethics Board (certificate number H18-01732). This thesis extends
a conference paper that is currently under review for publication. As the first
author of the submitted paper, I designed and conducted the semi-structured
interviews and analyzed the data under the supervision of Dr. Dongwook
Yoon and Dr. Joanna McGrenere. More specifically, my two supervisors
helped me formulate research questions, design the study and analyze the
collected data. The submitted paper was written with great help from the
two supervisors as well as the help from Mohi Reza, another co-author of
the paper. Mohi Reza, an MSc student, provided a great amount of help in
writing the introduction and related work section of the submitted paper as
well as providing great insight for shaping findings and contributions. Mohi
Reza also provided English writing assistance for the submitted paper.
vi
2.3 Characterizing Conversations . . . . . . . . . . . . . . . . . . 7
2.5 Human-likeness in Embodied Agents . . . . . . . . . . . . . . 8
2.6 Tools and Guidelines for VUI Design . . . . . . . . . . . . . . 9
vii
5 Discussion and Implications for Design . . . . . . . . . . . . . 47
5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1.2 Positioning Naturalness of VUI in the Literature . . . . 48
5.1.3 Contrasting Naturalness of Transactional vs. Social
Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . 53
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.2 The Recruitment Message Posted Through SNS . . . . . . . . . 71
A.3 The Consent Form . . . . . . . . . . . . . . . . . . . . . . . . . 73
viii
A.5 The Semi-structured Interview Script . . . . . . . . . . . . . . 86
A.6 More Descriptions on the User Task . . . . . . . . . . . . . . . 93
A.7 The Post User-task Survey . . . . . . . . . . . . . . . . . . . . 97
A.8 The Data Analysis Process . . . . . . . . . . . . . . . . . . . . 104
ix
4.1 The twelve characteristics of naturalness that designers deem
important . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 The 10 challenges that designers are currently encountering in
designing natural VUIs . . . . . . . . . . . . . . . . . . . . . . 32
x
Acknowledgements
Firstly, I would like to express my sincere gratitude to my two supervisors,
Dongwook Yoon and Joanna McGrenere. They were always generous with
their time whenever I needed their help. Before I entered graduate school, I
did not have much exposure to the Human-computer Interaction (HCI) field
and research environment. My two great supervisors were always so patient
and generous with my progress and encouraged me to explore and make my
own decisions for my project. I really thank them for their extensive help
throughout the whole project. My two supervisors always amazed me with
their great professionalism and their endless passion for HCI research, and
they set an example for me to follow. Secondly, I would like to thank Mohi
Reza, an MSc student who helped me in writing a paper we submitted to a top
tier computing conference. He dedicated 2 weeks for me to help my project.
I really enjoyed working with him and learned a lot from him. Thirdly, I
would like to express my appreciation to Karon MacLean for accepting to
be the second reader of my thesis and being so generous with her time. I
also want to thank her for her insightful guidance and help for the other
project that I worked on with her student, Soheil Kianzad. Lastly, I would
like to thank the students in Multimodal User eXperience lab for providing
insightful feedback on my study and for offering much valuable advice on
graduate life.
Introduction
In this chapter, we first introduce an overview of the problem space. Then,
we motivate and illustrate the contributions of our study, and outline the
overall structure of the thesis.
1.1 Problem Definition
With substantial industrial interest, conversational Voice User Interfaces
(VUIs) are becoming ubiquitous, with the plethora of everyday gadgets, from
smartphones to home control systems, that feature voice assistants (e.g., Ap-
ple’s Siri, Amazon Alexa and Google Home Assistant). Conversational VUI
systems are one of the two general types of VUI systems [1]. In a conversa-
tional VUI system, users perceive the voice agents as conversation partners
and accomplish their goals by having conversations with the agents [1]. While
in a command-based VUI system, which is the other general type of VUI sys-
tem, users are expected to learn and use the appropriate voice commands to
accomplish their goals [2]. Hereafter, we use the term ‘VUI’ to refer to ‘con-
versational VUI’, and we use the term ‘voice agent’ to refer to ‘conversational
agent’ [3].
At the heart of desired properties of VUIs is naturalness. According to
1
prior work and industrial design guidelines, enabling users to accomplish their
goals by having natural conversations with voice agents brings out numerous
benefits such as lowering cognitive load [4, 5], lowering learning curve [5],
and increasing accessibility [5, 6]. Hence, multiple VUI design textbooks and
guidelines recommend that designers make VUIs that provide natural con-
versational experiences to the users [7, 5, 8]. VUIs have only become popular
recently. As such, designing a conversational voice user interface can be dif-
ficult due to the lack of standard and comprehensive design guidelines as
Robert and Raphael noted in their book, “conversational interfaces are at
the stage that web interfaces were in 1996: the technologies are in the hands
of masses, but mature design standards have not yet emerged around them.”.
In fact, currently, available design resources suggest design approaches to-
ward naturalness, but their characterization of the term is fragmented and
hardly comprehensive. Multiple resources recommend different practices to
designers to make VUIs sound more natural [9, 10], feature natural dialogues
[11, 12, 13], or offer natural interactions [14, 15]. However, the term, natural-
ness, is an ill-defined construct lacking precision and clarity [16]. Therefore,
the field is lacking a comprehensive and substantive characterization of nat-
uralness in VUIs despite its advertised importance.
Bridging this conceptual gap is a critical step towards providing compre-
hensive guidance to designers who strive to create natural VUI experiences.
The literature in communications and social science inform the characteri-
zation of naturalness in human dialogues. They suggest that people have
different expectations and concepts of violations in interpersonal communi-
cation, depending on a class of situational factors, such as who is talking,
and the relationship between interlocutors [17, 18]. Given that modern voice
assistants are often situated in complex and dynamic social settings [19], it
is possible that conversational characteristics of a VUI considered natural in
one setting is not perceived the same in another setting (e.g. an extremely
human-like voice agent can be considered deceptively anthropomorphic and
2
uncanny [20]). The extent to which specific characteristics of naturalness
apply in different conversational settings remains an open question.
Within the broader discourse on Natural User Interface (NUI) design,
the preliminary conceptions of naturalness offered have remained abstract
and generic. Some have described naturalness as a property that refers to
how the users “interact with and feel about a product” [21]. Others have
used it to describe devices that “adapt to our needs and preferences”, and
enable people to use technology is “whatever way is most comfortable and
natural” [22]. Such broad characterizations make sense at a conceptual level.
However, the extent to which they can be applied in the domain of VUI
design remains uncertain.
While there are some existing VUI guidelines, they too are “too high-level
and not easy to operationalise” [23]. To offer designers proper guidelines and
tools, we need to seek answers to questions on how designers characterize nat-
uralness, how they align such characteristics to their varying design goals,
and what challenges they face in this pursuit of naturalness. Doing so will
enable researchers to create conceptual and technical tools that support nat-
uralness for VUI designers.
As a first step towards characterizing naturalness, we conducted semi-
structured interviews with 20 VUI designers to understand how they define
naturalness, what design process they use to enhance naturalness, and the
challenges they face. Through reflexive thematic analysis [24], our study
revealed a comprehensive set of characteristics which we mapped into dif-
ferent categories according to the aspects of naturalness each characteristic
contributes to, whether it is for achieving a basic required skill for having a
fluent verbal communication, for providing social interactions, or for helping
users’ tasks. Some of the characteristics mirror those found in the human-
to-human conversation literature, but interestingly designers also identified
characteristics that are “beyond-human”, which reflect the machine-specific
characteristics that outperform people such as superior memory capacity and
3
processing power. Our VUI designers also described significant challenges in
achieving natural interaction related to a lack of adequate design tools and
guidelines and in balancing the different characteristics based on the role of
voice agent (e.g., a social companion or a personal assistant).
1.2 Contributions
In this study, we recruited VUI designers, a group that was not explored
previously in the HCI literature to our knowledge, and ran an empirical
study to uncover their perceptions of naturalness and their current practices
of pursing it. Our work contributes the following:
1. We identified a set of 12 characteristics for naturalness as perceived
by designers and categorized them based on different aspects of nat-
uralness each characteristic contributes to: Fundamental, Social and
Transactional.
2. We identified and characterized the 10 challenges that hinder designers
from creating natural VUIs.
3. We proposed design implications for the tools and design guidelines to
support designers in creating natural VUIs based on our findings from
the interviews.
1.3 Overview
In Chapter 2, we present relevant previous works. For Chapter 3, we describe
our study design and analysis methodologies. Then, Chapter 4 introduces
our study findings, and Chapter 5 discusses our reflections on the findings
and presents design implications that we created based on our reflections
and insights. Finally, Chapter 6 provides the conclusion of the thesis and
suggestions for future work.
Chapter 2
Related Work
We set the stage by first reviewing the existing body of literature on VUIs.
Then, we look at the broad manner in which naturalness is currently con-
ceived and employed, and the ways in which people have characterized con-
versations. We then focus on the rich body of work on anthropomorphism in
conversational agents and beyond, a topic that is of particular importance to
our discussion on naturalness. Finally, we look at orthogonal, yet important
concerns, and review existing tools and guidelines for VUIs.
2.1 VUI Literature in HCI
Researchers have been investigating ways to support speech interactions since
as early as the 50s. With rapid advancements in Natural Language Under-
standing (NLU), we transitioned from rudimentary speech-recognition based
systems such as Audrey [25] and Harpy [26] in the 50s and 70s, to task-
oriented systems like SpeechActs [27] in the 90s, and sophisticated conversa-
tional agents that we now have.
More recently, several studies from the HCI community, have been inves-
tigating how voice assistants impact users [19, 28, 29, 30]. In these studies,
various issues have been explored, including how VUIs fit into everyday set-
5
tings [19], how users perceive social and functional roles in conversation [28],
and the disparity between high user expectations and low system capability
[31].
A common thread between many of these studies is that they take into
account the perspective of the users. As Wigdor [21] put it, naturalness is
a powerful word because it elicits a range of ideas in those who hear it - in
this study, we take the path less trodden, and see what designers think.
2.2 Existing Definitions of Naturalness
Given that naturalness is a construct, we see several angles from which ex-
isting studies define and use the term.
As a descriptor for human-likeness: Naturalness is often seen as a “mimicry
of the real world” [21]. In the context of speech, the human is the natural
entity of concern, and hence, behavioral realism, i.e. creating VUIs that act
like real humans, has become a focus. We can trace the attribution of an-
thropomorphic traits onto computers in a seminal paper by Turing [32] on
whether machines can think. In that paper, he assumes the “best strategy”
to answer this question is to seek answers from machines that would be “nat-
urally given by man”. The pervasive influence of such thought can be seen
in existing definitions of naturalness in VUI literature - [33, 34, 35] all treat
naturalness in this light, as a pursuit of human-likeness.
As a distinguishing term between the novel and traditional modes of in-
put: The term is also used to contrast interfaces that leverage newer input
modalities such as speech and gestures, with more classical modes on in-
put, namely, graphical and command-line interfaces [36]. In this definition,
the term is in essence an umbrella descriptor of countless systems involv-
ing multi-touch [37, 38], hand-gestures [39, 40], speech [41, 42], and beyond
[43, 44].
As interfaces that are unnoticeable to the user: Another usage draws from
6
Mark Weiser’s notion of transparency introduced in his seminal article on
ubiquitous computing [45]. In this formulation, naturalness is a descriptor
for technologies that “vanish into the background” by leveraging natural
human capabilities [46, 47].
As an external property: In this conception, the term does not refer to the
device itself, but rather the experience of using it, i.e. the focus is on what
users do and how they feel when using the device [48]. The characteristics
that we present in this paper can also be viewed from such an angle, i.e.
designers form and utilize characteristics not because they make the VUI
more natural, but rather the experience of using it more natural.
The existing usage of the term has drawn heavy criticism from some -
Hansen and Dalsgaard [16] describe naturalness as “unnuanced and marred
by imprecision”, and find the non-neutral nature of the term to be prob-
lematic. In their view, the term has been misused to conflate “novel and
unfamiliar” products with “positive associations”, akin to marketing propa-
ganda.
Norman [49] contends the distinction between natural and non-natural
systems and notes that there is nothing inherently more natural about newer
modalities over traditional input methods. With speech, for example, he
notes that utterances still have to be learned.
2.3 Characterizing Conversations
Existing literature [50, 51, 52] has demarcated different forms of human con-
versations based on purpose. Clark et al. [28] takes these forms, and classi-
fies them into two broad categories - social and transactional. In the former
category, he notes that the aim is to establish and maintain long-term re-
lationships, whereas, in the latter, the focus is on completing tasks. In our
study, we ground some of the characteristics that designers mention on the
basis of these two categories.
7
Written Language
Previous studies from linguistics have identified the differences between spo-
ken and written languages [53, 54, 55]. Researchers found that people use
more complex words for writing compared to when they are speaking [53,
54, 56, 57]. Bennett said that passive sentences are more frequently used
in the written texts [58]. Also, more complex syntactic structures are used
in written language than spoken language [56]. In our study, we analyzed
our interview data based on these previous works to find out what specific
aspects of spoken dialogues that VUI designers find it challenging to mimic
when they are writing VUI dialogues.
2.5 Human-likeness in Embodied Agents
A rich body of studies explore issues revolving around human-likeness in em-
bodied agents and our relationships with them. They investigate a plethora
of concerns such as ways to transfer of human qualities onto machines [59, 60],
ways to maintain trust between users and computers [61, 62, 63], modeling
human-computer relationships [64, 65], designing for different user groups
such as older adults [66, 67], children [68, 69], and stereotypes [70, 71].
A series of studies by Naas et al. on how people respond to voice assistants
have been done. Results from these studies suggest that people apply existing
social norms to their interactions with voice assistants [72]. The “Similarity
attraction hypothesis” posits that people prefer interacting with computers
that exhibit a personality that is similar to their own [73], and that cheerful
voice agents can be undesirable to sad users.
In our study, designers reflect on issues that echo the literature by con-
sidering factors such as personality, trust, bias, and demographics in their
VUI design practice.
2.6 Tools and Guidelines for VUI Design
Many large vendors of commercial voice assistants provide their own separate
guidelines for designers [74, 12, 75]. These guidelines offer design advice
tailored to developing applications for a specific platform. With regards to
platform-independent options, some preliminary effort has been undertaken
in the form of principles [76], models [77] and design tools [78] for VUIs.
More specifically, Ross et al. provided a set of design principles for the VUI
applications taking a role as a faithful servant [76] while Myers et al. analyzed
and modelled users’ behaviour patterns in interaction with unfamiliar VUIs
[77]. Lastly, Klemmer et al. introduced their tool for intuitive and fast
VUI prototyping process [78]. Our study adds the design implications for
tools and design guidelines to help VUI designers for creating natural VUI
experience.
9
Methods
To understand how VUI designers perceive naturalness in their design prac-
tices, we conducted semi-structured interviews with 20 VUI designers. We
designed the interview questions with a constructivist epistemological stance,
viewing the interview as a collaborative meaning-making process between the
interviewer and the interviewee [79]. This chapter will describe the design
of our study and the methodologies we used for collecting and analyzing our
data.
3.1 Participants
We recruited 20 VUI designers (7 female, 13 male) using purposeful sampling.
To draw findings from a varied set of perspectives, we interviewed both am-
ateur (N = 7) and professional (N = 13) VUI designers. We recruited the
participants through flyers (Appendix A.1) and study invitation messages
on social network services such as Facebook and LinkedIn (Appendix A.2).
Participants’ ages ranged from 17 to 73 (M = 34.3, Median = 30.5, SD =
14.7). The nationalities of our participants were as follows: 4 American, 1
Belgian, 1 Brazilian, 5 Canadian, 1 Dutch, 1 German, 5 Indian, 1 Italian,
and 1 Mexican.
10
In this study, we define professional VUI designers as people working
full-time on designing VUI applications regardless of their actual job titles
(e.g., VUI designer, Voice UX Manager, UX manager, CEO). Participant’s
length of professional VUI design experience ranged from 9 months to 20
years (M = 4 years and 2 months, Median = 2 years, SD = 6 years). Our
participants worked in companies that ranged considerably in size, from 3
employee startups to large corporations with over 5,000 employees. Most of
the professional VUI designers we recruited (8 out of 13) were working for
relatively small size companies (from 2 employees to 49 employees), and the
remaining 2 participants were working for medium size companies (from 50
to 999 employees).
In addition to the information stated above, we collected the participants’
highest level of education (2 with a high school diploma, 2 with a technical
training certificate, 9 with a bachelor’s degree, 5 with a master’s degree,
and 2 with a doctoral degree), and their familiarities with Speech Synthesis
Markup Language (SSML) (Figure 3.1). We collected information about
SSML familiarity because SSML is the only available standard method to
modulate synthesized voices across different platforms.
Figure 3.1: The distribution of the participants’ familiarities with SSML (N=20)
11
About half of the participants considered them to be unfamiliar with
SSML, while the same amount of the participants considered them to be
familiar with SSML.
All of our participants had previously designed at least one conversational
voice user interface, including voice applications for Amazon’s and Google’s
smart speakers, humanized Interactive Voice Response (IVR) systems, and
VUIs for a virtual nurse, a smart home appliance and a companion robot.
3.2 Interviews
For each participant, we conducted one session of a semi-structured interview
based on the interview script in Appendix A.5. The duration of interviews
varied from 30 minutes to about an hour, depending on the participants’ time
availabilities. All of the interviews were conducted by the thesis author. We
arranged online interviews for the participants (17 out of 20 participants)
who could not come to the University of British Columbia for the interview.
Prior to each interview session, the participants were asked to fill out a
survey asking about their demographic data, familiarities with SSML and
previous VUI design experiences (Appendix A.4). In this survey, we also
asked their most memorable VUI projects to contextualize our research ques-
tions based on their vivid memories. The collected data through pre-interview
surveys were analyzed using descriptive statistics to ensure the diversity of
the participant demographics (i.e., gender and VUI design professional level).
The interview questions can be broadly divided into four sections. The
goal of the first section was to understand the participants’ general VUI de-
sign practices and their previous VUI experiences in depth. In this part, we
requested them to describe their design practices for the two most memo-
rable VUI projects that they reported in the pre-interview survey. For the
second section, we sought to understand the participants’ conceptions of a
natural VUI. To achieve this goal, we requested them to provide their own
12
definitions of a natural VUI. Then, we asked them how important it is for
them to create a natural VUI, and if it is important, what benefits they are
expected to gain by doing so. The goal of the third section was to under-
stand the participants’ design practices for creating natural VUIs and the
challenges our participants are currently facing in creating natural VUIs.
Therefore, we asked the participants what particular design steps they take
for creating more natural VUIs, and asked them what the most challenging
aspects of carrying out those steps are. The last section of the questions
was to understand how useful the current design guidelines and tools are for
designing natural VUIs. In this part, we asked the participants about what
tools or design guidelines they are currently using and how helpful they are
for designing natural VUIs.
3.3 User Tasks
Depending on the participants’ time availabilities, 15 out of 20 participants
had the time to do the user task after the interview. In this user task,
participants were asked to write relatively short VUI dialogues in two ways:
by typing using the keyboard, as well as by using a voice typing tool that
we created. However, one participant whose age was 73 dropped in the
middle of the user task because he felt tired of typing. The duration of each
user task was about 15 minutes. This user task was designed to accomplish
two goals. The first goal was to understand the participants’ VUI dialogue
writing procedures and their perceptions of the current synthesized voices.
The second goal was to explore the possibilities of using voice typing instead
of the keyboard input for creating natural VUIs. Please note that the data
collected from the user tasks did not directly inform our study results due
to the lack of the amount and richness of the collected data. However, to
provide a more transparent understanding of our study, we lay out the whole
procedure of the user task in Appendix A.6.
13
3.4 Procedure
Prior to the interview, each participant was asked to fill out the pre-interview
survey (Appendix A.4) mentioned above. Before each interview session, an
email containing the consent form (Appendix A.3) was sent to each partici-
pant, so that our participants could provide their consent by replying to the
emails.
Before recording the interview sessions, the participants were informed
that the recording would start. After asking the interview questions, the
participants who were able to spend more time carried out the user task.
After they finished the user task, they were asked to fill out the post-task
survey (Appendix A.7) to provide their feedback on the task and the current
design guidelines. We provided a link for the survey to each participant.
Lastly, we asked participants if there were any concerns or questions re-
garding this study. We addressed their questions if there were any, and if
there were no further questions, we informed them that the interview was
finished, and thanked them for their help in this study. At the end of each
interview session, each participant received $15/hour for their participation.
The payment was made electronically through Paypal or Interact e-Transfer.
Some of the participants refused to get paid and expressed their desire to
help the study for free.
3.5 Data Analysis
All 20 interviews were transcribed before being analyzed. We used Braun
and Clarke’s approach for reflexive thematic analysis [24] for analyzing the
interview data. Their approach was particularly suited for our study be-
cause of its theoretical flexibility and rigour. The three members of the
research team, including the thesis author and her two supervisors, had a
one-hour weekly meeting where they developed the themes over the course
of several months. Instead of seeking the objective truth, we took an ap-
14
proach to crystallization [80], and developed a deeper understanding of the
data by sharing each other’s interpretations of the data during each meet-
ing. For facilitating a productive discussion, we organized the themes us-
ing several different ways in a concurrent manner, and this includes using
post-its, ‘Miro’ (https://miro.com/), an online collaborative whiteboard
platform, and ‘airtable’ (https://airtable.com/), an online collaborative
spreadsheet application (Appendix A.8). All of the members in the team
coded the interview data using the open coding methodology [81] “where the
text is read reflectively to identify relevant categories.” The two members of
the team went through at least three interview transcripts, and the thesis
author went through them all. We took both inductive and deductive ap-
proaches for coding the data and developed a set of coherent themes that
form the basis of our findings. Our deductive approach for coding was de-
rived from the previous works on “the classification of human conversation”
[52, 50, 51].
3.6 Apparatus
For the participants who were not able to visit the University of British
Columbia to have an in-person interview, we used ‘Zoom’, an online com-
munication software (https://zoom.us/), to conduct the interviews and
record the audios of the interviews. For the in-person interviews, ‘Easy Voice
Recorder’, an Android application for audio recording (https://shorturl.
at/mwzHK), was used to record the audio of the interview. The interviewer
also wrote interview notes on papers.
For the user task, the participants were asked to write VUI dialogues
using Google Docs. Then, the interviewer used a software tool named ‘TTS
reader’, which we built for playing the sound of VUI dialogues (https:
//github.com/usetheimpe/ttsReader). TTS reader used Amazon Polly
voices (https://aws.amazon.com/polly/) to read the VUI dialogues that
tiny.cc/sqjxjz), was used to let the participants add voice comments on
the dialogues that they wrote.
Results
In this section, we first describe how our VUI designers characterize natu-
ralness, followed by the challenges that they are facing in creating a natural
VUI. We also elicit how each challenge is related to the characteristics of
naturalness defined by our designers.
4.1 Designers Characterize Naturalness
To contextualize our findings, we briefly summarize the most memorable
projects described by our 20 participants. In total, we collected data about
38 VUI projects. There was only one chat-oriented dialogue system (e.g.,
[82, 83]) where a participant built a conversational agent, for reducing elders’
loneliness and the rest of them were grouped as task-oriented systems accord-
ing to the definition provided by [84]. Among the types of the applications
mentioned during the interviews, there were 27 Intelligent Personal Assistant
(IPA) systems for smart home speakers (23 Amazon Alexa, 4 Google Home),
8 Interactive Voice Response (IVR) phone systems, 1 voice agent system for
a smart air-conditioner, 1 voice agent system for a mobile application, and 1
voice agent system for a humanoid robot.
When asked to provide their definitions of a natural VUI, our participants
17
responded in terms of the characteristics of human conversation that they
consider important for creating a natural VUI. Later, our thematic analysis
revealed the three categories for classifying these characteristics of human
conversation, namely ones that are: (1) fundamental to any good conver-
sation, (2) ones that promote good social interactions, and (3) those that
help users to accomplish their tasks. Perhaps not surprisingly, the latter two
categories echo classifications for human-to-human conversations in existing
literature [52, 50, 51], labeled as “social conversation” and “transactional
conversation”. To be consistent with that literature, we adopted those la-
bels.
We found that instead of pursuing all the characteristics under the three
categories, our designers selectively choose a particular set of characteris-
tics depending on the context of their VUI applications and their design
purposes. We also found that pursuing multiple characteristics at once can
create conflicts.
The three categories we use in this study can help readers conceptualize
the 12 characteristics that we found in this study. They can also serve as
a lens to understand why designers pursue these characteristics and how
these characteristics can often conflict with each other. The following section
provides detailed descriptions of each characteristic.
4.1.1 Fundamental Conversation Characteristics
Among the conversation characteristics mentioned by our participants, there
was a set of fundamental verbal communication characteristics that a natu-
ral VUI should have, regardless of whether the aim is to support social con-
versations or transactional conversations. The participants consider a VUI
that does not achieve these elements as unnatural. For example, synthesized
voices that do not show appropriate prosodies (e.g., having a monotonous
intonation) were frequently referred to as “robotic” (P1, P4, P6) or being a
“machine”. (P18)
Fundamental Characteristics
lovable.
§ Proactively help users. § Present a task-appropriate persona. § Be capable to handle a wide range of topics in
the task domain. § Deliver information with machine-like speed and
accuracy.* § Maintain user profiles to deliver personalized
services.*
§ Sound like a human speech. § Understand and use variations in human language. § Use appropriate prosody and intonation. § Collaboratively repair conversation breakdowns.
*beyond-human aspects
Figure 4.1: The twelve characteristics of naturalness that designers deem important
19
Sound Like a Human Speech
Six participants (P2, P5, P6, P7, P13, P15) mentioned that utterances of
a natural VUI should have characteristics of spoken language as opposed
to written text. For example, people tend to use more abstract words and
complex sentence structures when writing [85, 86]. To specify, a natural VUI
should use simple words:“When you’re writing it down, and you say it out
loud, sometimes you realize that whatever you’ve written down is way too
long or has way too many big words.” (P6) However, our participants said
that the simple words do not mean less formal words such as slang or vulgar
expressions, but rather more typical words that the persona of the VUI would
use. P4 mentioned that he avoided using words that are “too casual” for his
virtual doctor application because “people would take it more seriously if
they felt that it was a natural doctor.”
A natural VUI should also incorporate filler words [87], breathing, and
pauses:“You need to introduce those pauses...conversation bits, like, ‘Um’,
‘Like’, ‘You know’...it makes the conversation more natural.” (P13) The par-
ticipants mentioned that the patterns for breathing and pausing should give
the impression of a ‘mind’ in the VUI, and make the user feel more like they
are talking to a human:“Pauses before something, like a joke...we need to
create an anticipation for the final jokes.” (P15)
Understand and Use Variations in Human Language
Human language is immensely flexible, and we can express the same request
in countless ways. Thirteen participants (P1, P2, P5, P6, P7, P9, P10, P11,
P16, P17, P18, P19, P20) mentioned that a natural VUI should understand
various synonymous expressions spoken by users, such as:“‘Increase the vol-
ume.’, ‘Turn up the volume.’...at the end of the day, you just want [it] to
increase the volume.” (P16) The participants considered VUIs that heavily
restrict what users can say to be unnatural and of diminished value:“...if you
are instructing people to speak in a certain way, that’s not how, I feel, voice
20
In addition to understanding varied expressions, 4 participants (P6, P7,
P17, P20) mentioned that a natural VUI should be able to respond using
a varied set of expressions to avoid sounding repetitive. “...we don’t always
say, ‘Good choice!’, we say, ‘Great choice!’, or we say, ‘Awesome!’” (P6)
When users repeat the same input, a natural VUI should respond to it using
different expressions:“So if you [the user] go through the application more
than once, the structure [of the dialogue] is similar, but you will always hear
different sentences.” (P7)
The participants mentioned that variation of expressions in human lan-
guage should be considered within the context of the target user. For ex-
ample, factors such as different age groups and even individual differences
need to be considered:“...we found out like hanging up the house phone, and
dial the phone clockwise. These are all words that older people use, while we
don’t anymore.” (P8) Age aside, the same expression can mean very different
things when coming from different individuals:“If I say ‘it’s hot’, then it’s
different from you say ‘it’s hot’, right?” (P5)
However, there are certain use cases where language variation is unde-
sirable. This characteristic is less important when the primary purpose of a
VUI application is having a transactional conversation for helping the users’
tasks, and the target users of the application are not the general public, but
rather people from certain professions such as police officers or fire workers.
This is because such target users are often trained to use special keywords
and have a fixed workflow for faster and efficient communications. Hence, in
order to help them effectively, the application should stick to the fixed set
of the vocabularies:“So one of the main users of this type of application will
be police or fire ambulance [drivers]...They’re used to a very rigid command
set. So they’re always saying things in the same way.” (P2)
21
Use Appropriate Prosody and Intonation
Prosody “refers to the intonation contour, stress pattern, and tempo of an ut-
terance, the acoustic correlates of which are pitch, amplitude, and duration.”
[88]. Eleven participants (P2, P3, P4, P5, P6, P8, P9, P11, P12, P13, P18)
said that a natural VUI should present messages clearly with the appropriate
prosody:“For me, it’s important to put the right intonation in some parts of
the text to make it clear.” (P3) However, they highlighted that the appro-
priate prosody can differ for age groups. Many participants reported that
Amazon Alexa, a popular commercial voice assistant, is “way too fast” (P11)
by default and that prosodies should be modified for seniors:“...for seniors,
you may want to slow down the speed and potentially increase the volume
or put an emphasis on the words.” (P2) Since there is no voice customized
for seniors, P11 had to put a break for each sentence manually. “...it’s like,
‘Okay, here’s your calendar!’, another break. ‘Today you have...’, a slight
pause, like point two, point three-second pause.” (P11)
Collaboratively Repair Conversation Breakdowns
During verbal communication, we often encounter small conversation break-
downs when people do not respond in a timely way or do not understand
what each other said. Four participants (P3, P6, P7, P8) mentioned that a
natural VUI should solve these kinds of conversation breakdowns in a sim-
ilar way how humans collaboratively solve them by asking each other:“Like
in this conversation, how often did we already say, ‘I don’t understand you’
or ‘What do you mean?’ or ‘Can you explain more?’. That’s already for
humans that way...and the robots and the user interfaces have to learn from
humans.” (P7)
A VUI is considered as very unnatural and machine-like if it repeats
the same information when the conversation breaks down:“...if you don’t
understand something sometimes, [and if] the system just keeps repeating the
same information [to get the response from you], just like a robot.” (P3)
22
4.1.2 Social Conversation Characteristics
While the main focus of our participants was on task-oriented applications as
mentioned in section 4.1, they also emphasized the importance of providing
positive social interactions to users. Humans have social conversations to
build a positive relationship with each other [89]. In the VUI context, de-
signers incorporate the social conversation characteristics for providing har-
monious and positive interactions, a more realistic feeling of conversation,
and a feeling of being heard.
Express Sympathy and Empathy
Ten participants (P2, P3, P4, P5, P6, P8, P9, P11, P12, P13) mentioned
the importance of providing empathetic responses to users’ sentiments to
maintain harmonious interactions:“...if I know your favorite team won, I’d
have a happy voice. If I know your favorite team lost, I’d have a sad voice.”
(P2)
Most of the participants’ elaborations on this part were focused on show-
ing sympathy when users experience negative sentiments. The participants
try to make the voice assistants console the users and present empathetic
voices when users feel negative:“If they respond negatively, the Alexa re-
sponds, ‘Oh, I’m sorry to hear that.’” (P4)
Beyond being sympathetic, the participants even actively try to soothe
users’ feelings in situations when they feel heightened emotions such as
anger:“You have a calm reassuring voice when they’re upset because there’s
a traffic.” (P9)
To find out if users feel negative, the participants use user responses, their
personal information (e.g., their favorite sports teams) and the location of
the conversations (e.g., hospital). There was no participant who mentioned
their experiences of using a sentiment analysis approach, and one designer
specifically mentioned that using such an approach required too much of a
time commitment:“I don’t have time to know the APIs that can do sentiment
23
detection.” (P11)
If there is no way to detect users’ real-time sentiments, then our designers
choose to use a “flat voice” (P3) to prevent the happy voice of a voice agent
upsets the user who are currently feeling down, as suggested in [72]:“You
have to control the tone of voice, because you can’t sound very enthusiastic,
things like that, because you never know the situation of the person on the
other side.” (P3)
Express Interest to Users
Four participants (P4, P6, P11, P12) said that they incorporate greetings,
compliments, and words that express interest to the user. These words make
the conversations appear “real-ish” (P11), and make users feel important:“I
think the benefit of providing this type of responses, instead of just blank ones,
is that it actually helps the person feel like their responses actually got heard.”
(P4)
P11 and P6 mentioned that VUIs can even make users feel as if they
have personal connections the applications by providing daily greetings or
feedback on users’ actions:“I’ll say, ‘See you tomorrow!’. Little snippets of
humanity.” (P11) “We could just say the recipe steps and all of that, and not
have to ask questions like, ‘How’s the spice?’ and all of that, but if we do,
then there’s some kind of personal connection.” (P6)
Be Interesting, Charming and Lovable
Social conversations include humour and gossip which fulfill hedonic values
[90, 91] that transactional conversations do not contain. In order to bring
more user engagement for task-oriented applications, 4 participants (P1, P7,
P13, P17) reported trying to write more interesting dialogues and to create
a charming persona:“Interactive means using some good words. Something
which sounds interesting to the user.” (P17)
24
The importance of being entertaining was emphasized, especially when
the target users are children. “So, when it’s a kid’s application, you respond
back in a very funny way. You use, terms like ‘Okie Dokie’.” (P13)
P7, who created an Alexa application for resolving the conflicts between
children, mentioned that the persona does not need to be loving and nice to
be charming. It can be sarcastic and funny, instead:“She’s not loving and
caring, but she’s maybe a little sarcastic. She makes fun of what they say and
I would say she’s lovable, not loving.” (P7)
When the system fails to accomplish what the user asked for, design-
ers mentioned that a persona of a VUI that presents socially preferable be-
haviours can abate negative emotions from users:“...if I had the sense that
it understood its own limitation as opposed to telling me it can not do some-
thing...I’ll be more flexible with it.” (P1)
4.1.3 Transactional Conversation Characteristics
As mentioned in section 4.1, all participants, except one, grounded their an-
swers in their design experiences for task-oriented applications. For transac-
tional conversations, our designers want to achieve naturalness by leveraging
the way the user used to exploit verbal interactions with others to get things
done. P9’s example is especially illustrative:“To me, the greatest benefit [of
voice interaction] is that it’s an interface that someone will naturally know
how to use and won’t have to learn how to use a new [command]. Man, I
know people, especially kids, are great at using phones and all the things they
learn, but I really like how we can do [many things] with voice. One example
I like to give is that our internet was down, and we had to reboot the router
or whatever, and I was wondering if it worked, so I just said ‘Alexa, are you
working?’ and it says ‘Yes, I’m working’ before I even thought about it.”
(P9)
25
designers considered machine-like speed and memory as characteristics that
a natural VUI should have for transactional conversations, which we label
as “Beyond-human” aspects. This was surprising given that existing notions
of naturalness are primarily based on being human-like. In other words,
the designers’ expectations of naturalness in VUIs extend beyond providing
realistic conversation experiences, and include machine-specific benefits:“So
I guess a natural agent would be close to having a real conversation with
someone, but with all the added benefits of an actual application.” (P12)
Proactively Help Users
Eleven participants (P2, P3, P8, P9, P10, P12, P13, P15, P16, P18, P20)
mentioned that a natural VUI should be efficient, and proactively “detect or
even ask for the things that [the user] needs.” (P12)
P18 said a VUI should not wait for a command. If someone says they
have a problem, a natural response for humans is to ask if the person needs
help, even when not explicitly asked:“From a linguistic perspective, ‘Could
you help me with my software?’ is a yes-no question. ‘I have a problem
with my software.’ is not even a question yet. So for ‘I have a problem’,
bots need to be more proactive and ask a question, ‘Could I help you with
the software?’” (P18) Hence, a natural VUI should understand the meaning
behind the statement and take action proactively to help the users.
The efficiency of a VUI was not strictly measured by the total time taken
for the task. Instead, they considered the quality of the results that users
would obtain compared to the number of the conversation turns that they
took to finish:“...not necessarily as short of a time as possible. But, something
that makes sense and it is value-driven for me as a user. So that means if
I’m engaging in multiple [conversation] turns just to get more, I guess, more
valuable information on my end, I’m okay with that.” (P12)
Related, a natural VUI should avoid overloading users with excess infor-
mation, and instead, it should minimize the number of conversation turns:“...you
26
do not become a robot who can keep going on and on and on and on about
all this information...You should not overload the user with a lot of informa-
tion. You should try to cut down as many decisions for the user as possible.”
(P13) The number of conversation turns can be minimized by proactively
“asking them [users] less and less and assuming more...” (P13) To ask fewer
questions, a natural VUI should actively make decisions based on contextual
information:“But if the user tells me the zip code correctly, I don’t ask him for
city and state, I use some libraries to find the name of the city and state...We
need to have a record of the entire conversation from top to bottom.” (P13)
Even though minimizing the number of questions is important, if the
consequence of failing the task is considerable, a natural VUI should ask the
user to confirm:“...so if I say things like ‘You wanted your checking account.
Is that correct?’ and I say ‘No, I want my savings account’ then that to me,
that confirm and correct [strategy] is a very important part in making it more
conversational.” (P10)
Present a Task-appropriate Persona
Four participants (P3, P4, P5, P13) said that a natural VUI application
should present an appropriate persona for certain tasks. The tone of voice
should match the application’s purpose to increase user trust and elicit more
useful responses. For example, P4 mentioned that for financial applications,
the voice agent should sound serious for making the application feel more re-
liable:“So when you’re creating these prompts, every company has a different
tone of voice...Like do you want the machine to be quirky? Do you want a
machine to be very serious? If you’re talking about your wealth management,
you don’t want to have a fun guy. It has to be serious.” (P5)
As another example, P4 designed an application for collecting elders’
health status. He tried to make the voice agent sound like a real doctor as
much as possible. This was done to ensure that users take the task more
seriously and report their status correctly. “...as if someone was visiting
27
their doctor and asking the questions...it was better than making it seems like
you were having a conversation with a friend, because it was kind of a serious
topic dealing with...people would take it more seriously if they felt that it was
a natural doctor, something like that.” (P4)
Be Capable to Handle a Wide Range of Topics in the Task Domain
Nine participants (P2, P5, P6, P7, P9, P10, P11, P16, P17) mentioned that
a natural VUI should not only be able to respond to the questions directly
related to its task but also be able to handle a wide range of topics within
the domain of its task:“I would think it [a natural VUI] would need to handle
anything that is specific to that institution, right? If I call Bank of America
and ask about my Bank of America go-card, you know you need to understand
me.” (P10)
A natural VUI should also be able to handle changes in conversation
topics as long as the topic belongs within the task domain. “Let’s say I
want to book a table for three people, and I said [to a waiter] ‘I want to book
a table.’ and [a waiter asked] ‘For how many people?’ and what if I say,
‘What do you have outdoor seating?’ I didn’t answer the question. I didn’t
say like six people or two people, because I need to know this other piece of
information, but I also didn’t say like ‘How tall is Barack Obama?’” (P9)
When a user brings a topic that is beyond the task domain, a natural
VUI should still continue the conversation and remind the user about the task
domain in which it can help with:“...if a person says ‘I want to order a pizza’,
and your skill [Amazon VUI application] has no idea what that is...give them
a helpful prompt saying ‘This is the senior housing voice assistant. I can help
you with finding when the next bus is, or finding when the next garbage day
is, or this or this.” (P2)
To help users aware of the boundaries of the serviceable topic domain, the
designers recommended preemptively providing context to users to help them
understand what they can do with the application:“A lot of people make a
28
mistake in the design by saying ‘Welcome to Toyota. How can I help you?’
And it’s like you’re going to fail right there because that’s so open-ended.
No one will have an idea of what they can or can’t say. They will probably
fail. So you have to be really clear...like ‘Welcome to Toyota’s repair center!
Would you like to schedule an appointment?’” (P9)
However, when the target users are children, the designers should expect
them to ask a lot of questions outside of the task domain:“ What’s your
favorite color, Alexa?’ and they [children] would like to shift the conversation
or just like go to a totally different topic.” (P7)
Beyond-human Aspect #1:
Deliver Information With Machine-Like Speed and Accuracy
Our designers mentioned that, to accomplish its tasks in an efficient manner,
a natural VUI should incorporate machine-specific attributes such as high
processing powers, and selectively mimic certain parts of human conversa-
tion instead of pursuing every aspect of a natural human conversation:“This
machine will be able to talk to us as if it was a human of course...more effi-
cient, of course, more...you know you have a couple of these rules, but that’s
how I personally define natural.” (P5)
P12 mentioned that a natural VUI should attain the human-level ability
to maintain conversational context while being able to deliver accurate in-
formation in a blazing fast manner:“So it’s just super-fast processing times,
being able to deliver information while maintaining conversational context.”
(P12)
Our participants described natural human speeches as often being indirect
and inefficient, so these aspects of human conversation should be left out
when designing for a natural VUI.
“There are so much more words like in the human version of asking, but
it sounds human. You know, it’s not as direct or not as efficient, but there
is a kind of like a personality behind it, I guess?” (P1)
29
“Oh, no less conversational, because you don’t want...something that you’re
using every day. You don’t want to have that be chatty and friendly right?
You want to get your work done. so you know concentrating on being effi-
cient and giving them the information and exactly the way that they want
it.” (P10)
Humans’ memories are volatile in contrast to machine memories. Designers
mentioned that a natural VUI should store a vast amount of personal infor-
mation of users such as personal histories or family relationships to combine
this information all together and provide a customized experience for users.
“We customize all the knowledge of the user.” (P8)
“We personalize things and make things fit each user. Suppose you have
an allergy or specific dietary requirements, then we could filter out all of those
recipes and only suggest you the recipes that fit your needs.” (P6)
However, designers are, of course, aware that storing a huge amount of
information comes with concerns about privacy. So the importance of trans-
parency on what data is stored was highlighted:“You need to be transparent
about the collected and stored data.” (P8)
4.2 Designers Experience Challenges
In order to inform where and how we should invest our efforts for future
technological and theoretical advancement, we asked our participants what
makes designing for a natural VUI the most challenging. We grouped the
challenges they described and ordered them in a list by the number of par-
ticipants who mentioned the issue. To contextualize their challenges, we also
asked them to describe their design practices.
We found that the designers largely follow a user-centered design process
30
that includes three phases, namely a user research phase, a high-level design
phase, and a testing phase each described next. This matches with the VUI
design process laid out by Google [92]. During the user research phase, they
conduct user research to determine requirements for their applications, collect
the user utterances, and create the personas for their applications. This
phase involves multiple user observations and interviews. In the high-level
design phase, designers create high-level designs such as sample dialogues and
flowcharts of dialogues for their applications. Designers interactively develop
high-level designs through collaborations with other designers. During this
phase, while writing the dialogues, designers create the audio outputs of the
dialogues and modify the audio and dialogues more or less in parallel. In the
testing phase, designers create prototypes and conduct user tests to validate
their high-level designs and iteratively develop real VUI applications through
multiple user testings.
Here we present the most prevalent challenges, as reported by our par-
ticipants (Figure 4.2), in designing a natural VUI. For each challenge, we
illustrate how it is related to the twelve characteristics of naturalness de-
scribed in Figure 4.1.
1. Synthesized Voice Fails to Convey Nuances and Emotion
The same sentence can convey different meanings depending on the way one
narrates the text. Paralanguage, such as intonation, pauses, volume, and
prosody, are essential components in expressing subtle nuances and emotions
in speech. During the high-level phase, to make a VUI sound natural, the
designers want to have control over the way the speech synthesizer will nar-
rate their dialogues to the users. However, 9 participants (P1, P4, P5, P7,
P8, P10, P13, P18, P20) reported that current speech synthesis technology
is lacking the expressivity to interpret the intended meaning of the dialogue
text and convey it to the user. They felt that even the best speech synthe-
sizer still sounds like “just a robo-voice” (P4) or like “just putting the sounds
31
Primary 10 Challenges in Designing a Natural VUI
1. Synthesized voice fails to convey nuances and emotion. 2. SSML is time-consuming to use while not producing the desired results. 3. Existing VUI guidelines lack concrete and useful recommendations on how to design
for naturalness. 4. Writing for spoken language is difficult. 5. Reconciling between “social” and “transactional” is hard. 6. Conveying messages clearly is difficult due to the limitation of synthesized voice. 7. Handling various spoken inputs from the users is difficult. 8. Impossible to capture all the possible situations. 9. Difficult to capture the users’ emotions. 10. Difficult to understand users’ perceived naturalness.
Figure 4.2: The 10 challenges that designers are currently encountering in designing natural VUIs
32
together” (P18) rather than “really meaning it [the script].” (P18) They
think that the voice synthesis technology has a large gap to bridge, saying
“there’s a long way to go for it to become very expressive.” (P7)
Hiring voice actors who can narrate the script in a natural tone and flow of
a “real voice” was reported by many participants (P3, P6, P7, P8, P9, P13)
as a common solution to make a VUI sound natural. However, recording
audio is considered to be significantly limited in flexibility and scalability.
When there is a need to change the narration, editing audio of a recorded
speech is more laborious than re-synthesizing audio from an edited text:“...if
we discover during research there are more words, then we have to hire that
actor again to speak those words again. So it was not practical at all.” (P8)
Moreover, using pre-baked recordings was not a scalable solution to gen-
erate spoken utterances of modern VUIs that are required to handle a wide
variety of data and conversation context:“Of course, when you are using a
real voice actor, it’s impossible because I should record every single street
name that we have in Brazil.” (P3)
This challenge is more severe for non-English languages. “Also the way
she [Amazon Alexa] is speaking for us in German it sounds very ironic and
sarcastic, her tone of the voice.” (P18) “[The] Dutch language is not so de-
veloped yet for Google...English words, sometimes Dutch people use English
words, but if Google is [speaking] in Dutch, then it’s sometimes complicated
[not correct].” (P8)
Due to the limited expressivity of the synthesized voices, for the applica-
tions that need to convey human-like emotions, our designers reported that
they often had no other choice than using their own voices to record the au-
dios. For example, P1 found that the currently available synthesized voices
were not good enough to express nuanced emotions that he desired to express
for his storytelling application:“...there are some subtleties that I couldn’t get
Alexa to feel nostalgic about, you know, there is no command like nostalgia
about the house party that you first met this guy that you are still in love with
33
at, you know?” (P1)
Interestingly, for the applications that do not need to express emotions,
the importance of the expressive voice was not highlighted, and was even
de-promoted. Being humorous is often considered to be relatively human-
specific behaviour. P1 said that using Alexa’s “robotic” voice for pulling a
joke creates irony and makes the situation funnier. Hence, he used Alexa’s
voice for his VUI application where Alexa is asking about the users’ feces
every day:“Alexa asks a question that only a human would ask and it’s like
a very human written skill [Amazon VUI application] that sounds like very
robotic and because you’re talking about poop because it’s like a joke...There’s
something I think funnier about it.” (P1)
Relation with the characteristics of naturalness: The difficulty in
expressing nuanced emotions inhibits the designers from achieving the social
characteristics of the VUIs listed in Figure 4.1. P8 found that the synthesized
voice used by IBM Watson was not able to produce natural laughing sounds
and posit a risk of misrepresenting itself with unintended negative social
expressions:“The robot can not laugh, because if the robot laughs and you
just say, ‘ha, ha, ha’ it sounds sarcastic...older adults actually feels like Alice
[the robot] is laughing at them. So that’s bad.”
2. SSML is Time-Consuming to Use While Not Producing the
Desired Results
Designers can emphasize the parts of dialogues or modify the prosodies of the
audio such as pitch, volume, and speed by using Speech Synthesis Markup
Language (SSML) during the high-level design phase. Tech giants such as
Google, Amazon and IBM continuously develop and support their own sets of
SSML tags. However, many of our participants (P2, P3, P7, P9, P10, P11,
P13, P15, P16) pointed out that writing and editing SSML tags is “time
digging” (P13), while it frequently fails to yield the desired result:“I haven’t
been very successful at doing this with SSML. It makes very minor changes,
34
but it doesn’t come close to what it would be if you use a voice actor, for
example.” (P7)
It was difficult to make the whole sentence flow naturally, and it felt “still
too mechanical.” (P5) Even after fine-tuning speech timings by meticulously
entering numerical values (e.g., 0.5 seconds of whispering):“I think it’s not
very natural, like another 0.5-second break here, another somewhat slower
here, all those things.” (P15)
The poor design of SSML authoring interfaces, which resulted in SSML
requiring too much time to use, was another point of consternation. Most of
our designers were using a simple text editor or generic XML mark-up tools
for writing SSML tags. Hence, they had to (re)write and (re)listen to the
whole sentence or paragraph even when only making a small change to their
dialogues: “I mean even just in the best-case scenario, let’s say you listen to
a prompt, you decided that you wanted to change one thing by using SSML.
You change that thing. You listen to it again. That‘s your best-case scenario,
and right there you just spent, I don’t know, couple minutes maybe, and if
you have a hundred prompts to do, it’s just not worth it for the small benefit
you’ll get.” (P9) Also, it was hard to evaluate when the SSML tag reached
the optimal level of expressiveness. Hence, our designers often spent a lot of
time iteratively modifying SSML tags without knowing when to stop:“Hard
to stop, like, I’m not satisfied with what I got there, so I just keep on changing
something here and there.” (P15)
Our designers had a hard time using SSML when trying to imbue a nar-
ration with emotional expressivity. Designing for expressing a subjective
experience of emotion requires holistic control of all prosody features at the
same time. However, SSML only offers control of each prosodic element at a
time separately:“The technology should be mature a little bit for us to have
an SSML tag that is empathetic. It has a very subjective nature. The thing
is that it’s not very objective. The objective things are being loud, slient...but
this is completely different.” (P13)
35
Due to these limits of the current SSML, most of the participants had
abandoned using SSML except for making simple changes of the audio such
as putting the breaks, slowing the speeds, and correcting the pronunciations
of mispronounced words. “In the SSML, I spent some time a while ago, like
there is a prosody tag like that. I just tried and tried and, usually it just didn’t
do what I wanted it to do.” (P9) “The prosody tag can be really difficult to
deal with right?...I haven’t played with SSML for a couple [of] years.” (P10)
“It’s been a while since I played with SSML. Well...[the SSML tag for putting
a] break I use it all the time still.” (P7)
If the application is targeting multiple platforms (e.g., Google Home and
Amazon Alexa), using SSML takes even more time, because designers should
test it for each platform. In other words, even for the same tag, the resulted
voices can be different depending on the platforms and the same set of SSML
tags might not be available on some platforms. This requires designers to test
their SSML tags for each platform:“Different speech synthesizers are going
to have different packages, so I want to be able to play with the SSML before
I decide on how this is going to work.” (P10)
Relation with the characteristics of naturalness: Being able to
produce an empathetic voice is essential for having positive social interactions
with users, as stated in Figure 4.1. Even though SSML is the only available
way for modifying synthesized voices, our designers reported the limitations
of SSML in truly creating an empathetic voice and the significant amount of
time required as the obstacles to the achievement of the social characteristics.
3. Existing VUI Guidelines Lack Concrete and Useful Recommen-
dations on How to Design for Naturalness
In terms of writing VUI dialogues during the high-level design phase, 6 par-
ticipants (P2, P5, P6, P10, P11, P19, and P20) mentioned 3 problems with
the existing VUI guidelines.
First, they found the existing VUI guidelines easy to dismiss as cliche.
36
More specifically, our designers mentioned that some of the existing guide-
lines are “somewhat common sense in terms of avoiding using technical lan-
guage, try and make it casual and simple”, and easy to let go:“I feel like
it’s kind of obvious and you know that when you’re creating something like a
voice skill...I probably read it once, and I just left it.” (P6)
Secondly, our designers found the existing design guidelines do not ap-
ply to a certain VUI depending on the context of the project:“At the same
time, I think every company will have its own set of these [Design guide-
lines]...I mean, some apps are made to comfort people and make them feel
less alone, and those guidelines are completely irrelevant, so it does depend
on the context.” (P5)
Lastly, they pointed out that the guidelines are lacking useful linguistic
insights:“I feel like linguistic principles are a little more difficult to come
by for voice interaction designers.” (P2) Linguists have been discovering
hidden patterns of natural human conversations that people use without
realizing it. For creating VUIs that enable natural conversations with users,
VUI designers should know how to use these linguistic insights on natural
conversations as well as how to incorporate these insights with their design
strategies for better user experiences. However, our designers reported that
there is a “disconnection” between the linguistic insights and their design
strategies.
“So there’s a lot you can do with language, with pragmatics [a sub-field
of linguistics] and this is not the rocket science. I mean, we have a lot of
research about pragmatics already, since the 70s, since the 80s. Well, they’re
explored in research well enough, but they’re not connected well enough with
the IT department.” (P18)
The designers who found the existing design guidelines as useful reported
that whenever they are designing VUI applications, they need to put an
intentional effort to return to the guidelines to brush up their memories on
them:“So for example, I need to work on confirmations. Let me go to refresh
37
my memory on how to do confirmation style...I don’t have to like constantly
go back to them [the design guidelines], but I certainly do go back in and look
[at them]” (P9), or not applying them due to the time constraint:“So there
are principles that I work with, but that’s just going to have to come to me
at the moment. I’m not going to go back.” (P11)
Relation with the characteristics of naturalness: This challenge
hinders designers from creating a VUI that provides a great user experience
while achieving the fundamental characteristic of a natural VUI, ‘Sound like
a human is speaking’. More specific design guidelines that provide actionable
advice on how to connect these linguistic components and design intentions
are requested from our designers.
“I have no idea on how to use this construct behind these phrases, and
how to break it down to use in my favorite design process...The contrac-
tion, phonetics, and ‘Umm’...I think understanding how meaning can change
depending on how things are structured and organized in the dialogue” [is
important]. (P20)
When the participants are writing dialogues during the high-level design
phase, they often find it is hard to write for spoken language. When people
are talking, they adopt spoken language characteristics such as filler words,
personal pronouns and colloquial words [53, 54, 55]. However, they show
these characteristics subconsciously. Hence, when our designers are writing,
they often forget to incorporate these characteristics.
“What I want to say is when you just started [as a VUI designer], it’s
difficult because you have to really understand that the text you’re writing
is not something to be read. It’s something to be spoken. It’s for a spoken
interface, so you have to really try to imagine yourself in that situation and,
like I said, just write like you are writing a screenplay or something like that.”
(P3)
38
Designers write the dialogues by typing on the keyboard instead of speak-
ing it, and our designers reported that it is often hard to detect unnaturalness
of the dialogue just by reading it:“Challenges? So the platform limitation is
definitely a challenge. A lot of times, the conversation sounds good on paper,
but you really have to just say it.” (P12)
Since we subconsciously use the characteristics of spoken language when
conversing with others, P10 mentioned that the unnaturalness caused by
writing can be hard to detect, and many people often treat this problem as
something insignificant and hence do not put the effort in enhancing it:“I
think the hard thing about these kinds of interfaces is that people feel like
because they can talk, because they speak English, so they can write one of
these interfaces.” (P10)
From our survey on the existing VUI design guidelines, we found that mul-
tiple guidelines already exist for supporting designers in writing for spoken
dialogues [12, 13]. These guidelines tell VUI designers about what character-
istics of the spoken language they should incorporate in their VUI dialogues.
Since people use these characteristics unconsciously, if designers do not re-
turn to guidelines and verify their dialogues manually, they often forget to
apply the rules from design guidelines when they are writing their dialogues.
For example, even though there is a design guideline asking designers to
avoid putting too much information in one line [93], designers often make
mistakes:“The frequent mistake is people are giving a big amount of the texts
and yeah just read it.” (P18)
This makes designing a natural VUI challenging. However, P3, a profes-
sional VUI designer, mentioned that this could be overcome as designers gain
more experience:“At first, it is difficult, because, like I said, normally when
you’re writing, you’re writing for someone to read it, not to speak it...[It]
took something like maybe three months or so to really get used to this way
of writing, the practice of writing dialogue.” (P3)
Relation with the characteristics of naturalness: This difficulty
39
hinders designers from creating VUI dialogues that sound like spoken dia-
logues, and it thus counters to the fundamental characteristic of a natural
VUI, ‘Sound like a human is speaking’.
5. Reconciling Between “Social” and “Transactional” Is Hard
Our designers tried to embrace the characteristics of social conversations
for providing more realistic and interactive conversation experiences to their
users, however since most of them were designing for task-oriented applica-
tions, they found the goals for task-oriented applications often conflict with
the desire to embrace social conversation characteristics.
Five designers (P5, P11, P15, P17, P20) mentioned their desire to add
more components of social conversations to their VUI designs to make them
more realistic. However, in an attempt to do so, they found that the dialogue
gets longer and it conflicts with the task-oriented goal to complete the tasks
efficiently:“So obviously I wanted to write the dialogues that felt [like a] human
[is speaking and] didn’t feel robotic, but I soon realized that things are more
complicated. The more you want to add personality to things, then the longer
becomes your dialogue.” (P20)
So, our designers felt that the two categories of the naturalness character-
istics often don’t get along with one another:“...efficient, but it has to come
up as like friendly [and] conversational.” (P5) “Challenges, I told you earlier,
challenges are keeping it simple, yet interactive. It should sound familiar. It
should sound friendly. It should not go out of the voice, so like that.” (P17)
Relation with the characteristics of naturalness: This often con-
fused our designers when they were writing the dialogues and often made
them give up incorporating social characteristics. “But again we’re still
thinking, we’re still in the process of, ‘Should we actually put in those lit-
tle sentences [for having social interactions] or not?’” (P6) “I would prefer,
right now, to focus more on helping people achieve their goals and move on
with their lives more than a kind of having these artificial entities talking to
40
6. Conveying Messages Clearly Is Difficult Due to the Limitation
of Synthesized Voice
Our designers (P4, P8, P9, P12, P14) are facing a challenge in conveying
precise meanings through synthesized voices. This is because synthesized
voices often mispronounce certain types of words and sentences, and are not
able to produce proper intonations and tones.
Our designers elaborated on three specific problematic cases of mispro-
nunciation. Synthesize voices often fail to: (1) put proper breaks when pro-
nouncing long sentences:“...some words just kind of got mashed together, like
into one whole long word, and it [the long setence] sounded weird.” (P4), (2)
pronounce contractions clearly. “That’s what it’s supposed to say, ‘What’ll
it be?’, and then when it [Amazon Alexa] actually says it, it will be like,
‘Whatill it be? ’” (P12), and (3) pronounce proper nouns such as names of
cryptocurrency and people. “...she can’t pronounce ethereum [a cryptocur-
rency] and mmm that’s a popular one.” (P14), “Names of people, Google says
it differently.” (P8)
Our participants also elaborated on two problematic cases inadequate
tones and intonations: (1) sentences with the question mark, and (2) non-
lexical words. P9 mentioned the difficulty of producing natural sounds for
interrogative sentences ending with a question mark. “It’s so hard sometimes
to get the text to speech to ask a question in a way that makes sound natu-
ral...so I changed the question mark to a period, and then it said ‘Which one
would you like?’, which is more how a person would actually say it, because a
lot of times when we ask a question, we’re not doing a rising intonation.” P8
reported that synthesized voices do not produce proper tones for non-lexical
words (i.e., words do not have a defined meaning) such as laughter. Her
design intention was to make her robot laugh with a happy tone, but it had
a sarcastic tone instead:“The robot can not laugh, because if the robot laughs,
41
Relation with the characteristics of naturalness: The limitations
of speech synthesis systems (i.e., mispronouncing words and sentences and
not being able to produce proper intonations) hinder designers not only from
attaining the fundamental characteristics of a natural VUI (‘Sound like a
human is speaking’ and ‘Use appropriate prosody and intonation’) but also
from achieving a social characteristic of it (‘Express sympathy and empa-
thy’). This is because non-lexical words are essential in human conversations
for social interactions. People feel mutual understanding and compassion
towards each other when they laugh together. Hence, by not being able to
produce natural laughing sounds, it’s more challenging to provide harmo-
nious interactions with users.
7. Handling Various Spoken Inputs From the Users Is Difficult
Four participants (P2, P6, P9, P11) mentioned that the current NLU engine
is still not good enough to understand various expressions of our language,
and this requires VUI designers to provide possible synonyms and expressions
when they are writing dialogues for training the NLU engine:“Yeah, I still
need to interview them [the VUI users]. I will, just because I don’t think
the natural language engine is good enough to be able to figure out all the
different ways you can ask for a bus.”(P2) However, they often found that
the collected synonyms and expressions do not cover all the possible ones:“I
don’t necessarily know what the users are going to say back. So I’ll make
up something that you might say but that’s certainly not the same as a user
who’s never used it [the VUI application] before.”(P9)
Designers pointed out that understanding human vocabularies can be
especially challenging due to its personal aspect. The same word can be used
to express different meanings based on the conversational context (e.g., age
and individual differences). Hence, designers mentioned that user utterances
should be understood in the personalized context:“You need to personalize
42
the VUI to a user...You can not talk with Alexa if you‘re older because Alexa
won’t understand you.&rdquo