distributed multimodal interaction in a smart home ...interact with multiple computing devices while...

69
Distributed Multimodal Interaction in a Smart Home Environment Dilip Roy June 22, 2009 Master’s Thesis in Computing Science, 30 ECTS credits Supervisor at CS-UmU: Dipak Surie Examiner: Per Lindstr¨ om Ume ˚ a University Department of Computing Science SE-901 87 UME ˚ A SWEDEN

Upload: others

Post on 27-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

Distributed MultimodalInteraction in a Smart Home

Environment

Dilip Roy

June 22, 2009Master’s Thesis in Computing Science, 30 ECTS credits

Supervisor at CS-UmU: Dipak SurieExaminer: Per Lindstrom

Umea UniversityDepartment of Computing Science

SE-901 87 UMEASWEDEN

Page 2: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser
Page 3: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

Abstract

This thesis is about an infrastructure which explores distributed multimodal interaction inthe ’Ubiquitous Computing Environment’. The proposed DMI(Distributed Multimodal In-teraction) was considered for avoiding the device-centric interaction. Multimodal interactionalways extend the new user groups, where the user becomes a mobile user while interactingwith the system. During the design phase, two input modalities namely speech and gestureand two output modalities such as visual and audio, were decided, in order to provide anexplicit interaction. The Wizard-of-Oz software was implemented for providing an implicitinteraction. DMI was designed to interact with the proposed SEOs embedded in a livinglaboratory home environment. During the implementation, different software componentswere developed, such as GBI(Gesture Based Interaction) and SUI(Speech User Interface),for recognizing the human initiated gesture and speech based commands. The qualitativeevaluation was conducted for getting a higher precision and recall values. The proposedGBI scored a precision value of 89.38% and the recall value of 92.19% and on the otherhand SUI scored a precision value of 89.28% and the recall value of 90.13%.

Page 4: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

ii

Page 5: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

Contents

1 Introduction 11.1 Ubiquitous Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Multimodal Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Problem Description 32.1 Background of the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Related Work 53.1 Multimodal Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Hand Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3.1 DataGlove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3.2 PowerGlove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3.3 GestureWrist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3.4 GesturePad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3.5 InfoStick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.6 DigitEyes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.7 VBA(Vision Based Approach) . . . . . . . . . . . . . . . . . . . . . . 103.3.8 GestureVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.9 Finger-Pointer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.4 Ambient Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.5 Context-Aware Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.6 Smart Home System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Conceptual Design 134.1 Distributed architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 Multimodal Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2.1 Gesture Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2.2 Speech Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.3 Visual Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

iii

Page 6: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

iv CONTENTS

4.2.4 Audio Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2.5 How DMI works: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 Designing Interaction States . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.4 Interaction Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.5 Addressing Implicit and Explicit Interaction . . . . . . . . . . . . . . . . . . . 23

5 System Implementation 255.1 Hardware Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.1 Smart Everyday Objects . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Software Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2.1 Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2.2 Speech Recognition and Synthesis . . . . . . . . . . . . . . . . . . . . 295.2.3 Thin Client Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2.4 Mock-up applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.2.5 Wizard-of-Oz Software . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2.6 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3 Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6 Evaluation,discussion and limitations 416.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.1.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.1.2 Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2 Discussion and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7 Conclusion 47

8 Acknowledgements 49

References 51

Page 7: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

List of Figures

4.1 Left hand gesture user interface functionalities [41] . . . . . . . . . . . . . . . 154.2 Right hand gesture user interface functionalities [41] . . . . . . . . . . . . . . 164.3 Speech user interface functionalities [41] . . . . . . . . . . . . . . . . . . . . . 164.4 Visual outputs represented through Bookshelf SEO [41] . . . . . . . . . . . . 174.5 Conceptual model describing DMI(Distributed Multimodal Interaction) . . . 184.6 Interaction states between the user and the SEOs[41]. . . . . . . . . . . . . . 194.7 Speech and Gesture recognition areas for recognizing and providing an explicit

input to the system by a user. . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.8 User’s current activity based on three scenarios which provides implicit input

to the system through a wizard. . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1 Wearable hardwares equipment. . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 WLAN Access Point Belt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.3 Representing SEOs in a smart home. . . . . . . . . . . . . . . . . . . . . . . . 275.4 Visual output for open/closing gestures and controlling audio channel . . . . 275.5 SpeechListBox components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.6 Data flow diagram for speech recognition based on commands . . . . . . . . . 315.7 Proposed thin client along with different components. . . . . . . . . . . . . . 345.8 Representing Dining Table SEO designed to interact with ten applications. . 365.9 Wizard-of-Oz software with different components. . . . . . . . . . . . . . . . . 385.10 Scheduler Table for identifying time dimension. . . . . . . . . . . . . . . . . . 395.11 Database Relationship Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.1 User holding (1)Bluetooth headset (2) Accelerometers (3)Access point belt . 426.2 User’s activities based on weekday morning scenario . . . . . . . . . . . . . . 436.3 User’s activities based on weekday evening scenario . . . . . . . . . . . . . . . 436.4 User’s activities based on weekends scenario . . . . . . . . . . . . . . . . . . . 44

v

Page 8: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

vi LIST OF FIGURES

Page 9: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

List of Tables

3.1 Speech Engines with various functionalities . . . . . . . . . . . . . . . . . . . 8

4.1 Dialog conversation between human and the computer. . . . . . . . . . . . . . 22

5.1 Left hand recognized gestures,associated operations based on acceleration values 285.2 Right hand recognized gestures,associated operations based on acceleration

values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.3 Nineteen applications with respect to the SEOs. . . . . . . . . . . . . . . . . 36

6.1 Subjects daily activities based on scenarios. . . . . . . . . . . . . . . . . . . . 416.2 The average quantitative evaluation result for the GBI (Gesture Based Inter-

action) and the SUI (Speech User Interface) [41] . . . . . . . . . . . . . . . . 45

vii

Page 10: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

viii LIST OF TABLES

Page 11: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

Chapter 1

Introduction

1.1 Ubiquitous Computing

The WIMP(Windows, Icons, Menus, Pointers) paradigm introduces the generation of desk-top computing era, where a user interacts with a computer through a pointing device likemouse or touch pad, or by utilizing a graphical icons. The WIMP method was first in-troduced at Xerox PARC which is now available in most of the graphical user interfaces,such as the Apple Macintosh, Linux, Microsoft Windows etc. Mark Weiser [1] in his workquoted, ”The world is not a desktop” and emphasized due to the technological advancement,such as sensing, distributed computing, communication protocol, advance software compo-nents, hardware components etc., the traditional desktop computing can be reshaped withthe invisible computing or the third era of computing termed as ”Ubiquitous Computing”and shortly known as ”Ubicomp”. In the Ubiquitous Environment, a user can invisiblyinteract with multiple computing devices while performing the everyday activities, based onthe preferred context [2]. Mark Weiser et al.[3] also denoted that,the mentionable systemshould act like a calm, which emphasizes the user’s as well as the peripheral attention forproviding an ambient interaction. The evolution of the Ubiquitous computing is inheritedfrom two core areas, distributed systems and mobile computing [4]. Distributed systemsmainly deal with the remote communication, high availability of the designed resources,access to remote information(distributed file system or distributed database) etc. However,the mobile computing mainly deals with mobile networking and wireless information accessthrough ad-hoc protocols. Mark Weiser[5] described his views about ”Ubiquitous Comput-ing” and explained that ”The most profound technologies are those that disappear”. Sucha disappearance can be accomplished by embedding the existing computing technology inour everyday objects, available in the environment.During the work,a similar kind of ap-proach was chosen by embedding the computing technology in a living laboratory homeenvironment, termed as SEOs.

1

Page 12: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

2 Chapter 1. Introduction

1.2 Multimodal Interaction

The way human beings communicate with each other does not depend on a single com-munication channel but they use different modes or ways. These channels can be speech,gestures, handwriting (digits/painting/characters), touch screen, vibration, hand expres-sion by different sign languages, facial expressions (gaze/eye movement), or even traditionalinput channels like keyboard or pointing device [6]. Multimodal interaction introduces agreat variety of design options while comparing them with the traditional WIMP paradigm.Every multimodal based system is concerned mostly with different recognition techniques(speech, gesture or handwriting) for recognizing different input channels. The availabilityof the multiple modalities overcomes the weakness of a single modality, for example, busywhile driving or talking, by replacing with the other modality as desired. For example, auser might interact with the system through speech modality, while cooking or driving acar. In general, while talking about input modalities (speech, gestures, facial or body ex-pression), the user can decide which modality to select based on their freedom of choice[7].On the other hand, while considering output modalities (visual/tactile/aural), the systemmight select an appropriate channel by checking the context of the user’s attention. Forexample, a user is attending weekly meeting and the system selects tactile(vibration) orvisual feedback by knowing some contextual information (user’s location, identity, time oractivities) rather than selecting an aural feedback. The following examples show that, howmultimodal interaction increases the accessibility of the system:- User is trying to interact with the interface attached in the fridge through speech or ges-tures (due to the distance) or by pointing device (touch screen) and getting feedback throughvisual or audio according to the preference.-User is in the bus stop or in the railway station and trying to talk with his/her friend butcould not communicate due to the noise created by different medias. In this context, theycan provide gesture command or handwriting input for communicating with the system.- System is presenting unnecessary information through aural or visual channel. In thiscase the human initiated facial expression such as gaze/eye movement or the body move-ment(sign languages), might affect the behavior of the output channel.-Visually impaired users can communicate through a limited set of speech commands andmight prefer aural or tactile output channel over the visual modality.-Hearing impaired users can provide inputs through speech or gestures and can prefer visualor tactile output channel.The next chapter focuses on the problem description and clarifies the background of theproblem and the overall goals. The third chapter refers to the related works being reviewedand describes through different sections namely (Multimodal Interaction,Speech Recogni-tion, Hand Gesture Recognition,Ambient Displays,Context-Aware Computing and SmartHome Systems). Chapter 4 and chapter 5 briefly explains about the conceptual design andthe implementation of the proposed system respectively. Chapter 6 deals with he systemevaluation,general discussion and the limitations. Finally, chapter 7 focuses on the conclu-sion.

Page 13: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

Chapter 2

Problem Description

2.1 Background of the problem

The WIMP interaction paradigm has been dominant since early 1980’s. It has various lim-itations related to the user’s mobility and the device-centric approach. Overcoming theselimitations was the motivation behind this work. Designing multimodal approach, alongwith the traditional computing environment or with the ubiquitous environment, is alwaysa challenging task in the HCI field. Multimodal interaction helps in the following manners:(a) Any system with multimodal approach can introduce a user-centric approach,where theinteraction depends on the user’s choice.(b) User has the freedom while choosing between the input or output modalities.(c) Multimodal based system supports the natural language processes (speech, gestures,body expression etc.), which might reduce cognition errors.[8](d) User has the facility to recover from the errors while interacting through a single modal-ity and can avoid errors by switching to some other modality.[8](e) Multimodal supported system can increase the accessibility of the system (explained inthe section 1.2).

During this work, a similar kind of system, DMI (Distributed Multimodal Interaction)was introduced where a user can interact with the multiple virtual objects , embedded inthe SEOs spread throughout in a smart home.DMI was decided to solve the limitationswith respect to the user’s mobility and a user can interact through different input(speechand gesture) and output(visual or aural) channels based on the preferred context. DifferentSEOs were decided, designed and experimented in the living laboratory home environment.They were virtual objects (mock-up applications like weather, daily schedule etc), wearableobjects or mediators (Bluetooth headset-for speech, accelerometer wrist bands-for gesture).The proposed interaction channels fulfill the multimodality approach in a smart home envi-ronment, which improves the limitations behind the traditional interaction paradigm,WIMP.According to the design perspective, the user can decide which modality (speech/gesture)to select, based on their preferred context. During the implementation, the Wizard-of-Oz software was constructed to maintain the possible synchronization between the user’spersonalized content and the designed SEOs.

3

Page 14: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

4 Chapter 2. Problem Description

2.2 Goals

Several goals were considered, like below:

– Exploring multimodal interaction in a distributed architecture where a user can in-teract with the proposed SEOs through different input and output channels. Thefollowing subgoals were decided:-Selecting or filtering the SEOs for an explicit as well as implicit interaction.-Managing synchronization between the user’s personalized content, managed througha personal server in the mock-up database and the mock-up application running inthe overall SEOs.

– To know about the user’s, as well as the peripheral attention for initiating an ambientinteraction.

Page 15: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

Chapter 3

Related Work

Different Related works being reviewed in terms of the following keywords and they are asfollows

– Multimodal Interaction

– Speech Recognition

– Hand Gesture Recognition

– Ambient Display

– Context-Aware Computing

– Smart Home System

3.1 Multimodal Interaction

Multimodal Interaction is a concept which enables by combining the traditional graphicalinterfaces, used in the computer environment, with different modalities, like speech orienteddialog or by gesture based interaction. H. Rossler et al. [9] describes how multimodalinteraction takes place for existing mobile environments, where a new dialog-based MML(Multimodal Mark-up Language) has been introduced. MML was designed to work withdifferent modalities, like speech, graphics, handwriting, touch-sensitive input and so on. Tra-ditional mobile phones have limitations due to small display size, limited number of keys,small memory size etc. These limitations might be overcome by adding multimodality con-cept with the existing technology. For example, speech-input interacts without navigatingor pointing, if it is correctly recognized for extracting a designed application. Any systemwith multimodal orientation could introduce effective context awareness. Synchronizationbetween different modalities is not required while recognizing a single modality. However,it is required, if there is a need to recognize more than one modality. H. Rossler et al. [9]built a system Alcatel which has two prototypes and evaluated them by introducing multi-modality into a browser environment. The first prototype introduces a GUI (Graphical UserInterface) which integrates speech synthesis and handwriting recognition, as applets. Thissystem can run on any standard browser. The second prototype implements a multimodalbrowser by integrating speech and handwriting recognition.

5

Page 16: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

6 Chapter 3. Related Work

Modality selection demands calculated situation [10] or a context, for interacting withthe available resources which are basically free in that context. This paper mainly dealswith the modalities in different contexts and the major effectiveness of interaction methods.During the design phase messaging application was selected and evaluated in the two sim-ulated scenarios, while driving a car and while walking. They used speech-enabled browserran on a N800 Nokia mobile for representing visual and audio modalities, 2D and 3D ges-ture(tilt) to represent as input modality. Simple 3D tilting gestures like left, right, up anddown were calculated wherever it was required. Both contexts, i.e. driving and walking weresimulated for an evaluation purpose. In the walking scenario, the user was holding NokiaN800 Internet Tablet for interacting with the application designed and also had head-wornmike for giving a voice command, and aural output was played through worn headphone.In the car scenario the designed application was running on the Tablet PC and the userwas given a voice command through a mike worn on the head. Aural output was playedthrough the speaker embedded on the car. Both scenarios had specific purpose and theresult showed that the users preferred gestures (2D and 3D) in the walking context, whereasthey preferred speech in the driving context.

It has always been a big challenge to select best modality [11] in mobile as well aspervasive environment. Information presentation depends on the environment and has to bechanged according to it. This paper gives an idea that how user’s attention or some cognitiveload, plays an important role while selecting certain modality (input or output). Whiledesigning a multimodal input, the system sometimes offers combinations and sometimes asingle modality and it is up to the user to decide which one he or she prefers to select.But in the case of multimodal output, system itself decides where and how to present theinformation. While interacting with multimodal equipped system, user should know aboutthe character of the system, like how many ways to interact, language to communicate,error prevention techniques, and how to get a feedback from it, in case the user gets lost.Different contexts have been suggested while designing output modalities. For example,while users are in the public places or in the offices, visual or graphics modality might beeffective. Whereas audio based feedback might be more effective when the users are doingsome outdoor activities like playing football, drawing, walking etc. The haptic modality(vibration, presure, temperature etc.) might be effective when the users are in the noisysituation like in the railway station, bus stop or in the airport. This paper also suggested inwhich context input modalities are used. For example, when the users are watching TV orobserving some device visual input modality (Camera vision or by proximity sensor) mightbe effective. The aural input based modality (speech through mike) was designed when thehands were busy, maybe while writing. Gesture based interaction (hand gestures like 2D or3D) might be a good choice when users are talking with somebody or eating pizza.MATCH (Multimodal Access to City Help) [12] introduced multimodal dialogue systemfor New York City. Different input modalities like speech, handwriting and gestures weredecided while designing an interface. On the other hand, output modalities representedthrough visual display ran on PDAs or smart phones, and also through speech. MATCHwas designed to locate the restaurants on city maps while someone tries to communicatethrough input modalities by speech or by handwriting. For example, if the users would like tosee the restaurant location by drawing a circle on the location presented on the PDA and alsomentioning some quires like cheap pizza under the circle, they will get informed immediatelyby showing some relevant informations, like restaurant location, telephone number etc.User can also combine modes of modality, like asking for ”Italian pizza” by speech andshowing the location through hand writing For example, making a circle on the location

Page 17: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

3.2. Speech Recognition 7

presented on the map. The system was designed to allow a dynamic conversation.Thesystem also shows complete subway information related to the restaurant location. Forexample, if the user asks like ”How do I go there?” and also circle any of the restaurantdisplayed on the existing map, the system will ask ”Where is your location?” then theuser might reply by saying some address or by writing the same address. The systemreplies immediately with the complete map that shows how to reach his desired destination.MATCH uses AT&T’s speech recognition engine. The pen-based commands, like gesture andhandwriting recognition, uses Rubine’s classic temple-based gesture recognition algorithm.The handwriting recognizer recognizes 256 words in total, including words like ’cheap’,’italian’ etc. The gesture recognizer recognizes approximately 10 gestures which includearrows, circle, lines etc.

3.2 Speech Recognition

Speech Recognition (SR) is an emerging technology which recognizes the input through dif-ferent channels from body-worn microphone, mike connected to body-worn computer, tra-ditional mobile phones holding mike or sometimes from an environmental mike attached inour everyday object. Speech Recognition (SR) is done through a software component calledSpeech Recognition Engine which basically follows different algorithms like HMM (HiddenMarkov Models), ANN (Artificial Neural Networks) or DTW (Dynamic Time Wrapping)etc. SGRS (Speech Recognition Grammar Specification), defined by W3C [13], describeshow well the grammar should be prepared or used for an effective speech recognition.W3Cproposed 2 forms of grammar format. They are known as Augmented BNF Form and anXML Form. Augmented BNF form basically considers a plain text whereas XML Formrepresents the grammar using an XML element. The following Table represents differentspeech engine developed with their certain standards

Page 18: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

8 Chapter 3. Related Work

Speech Engine Description Advantages/DisadvantagesLumenVox [14] LumenVox is known as an ac-

curate speech recognizer for itsreal time accuracy. It was builtfor both Windows and Linuxplatform. It has an effectiveplatform for the developers tochange the grammer, customiz-ing different components duringtheir design pace, audio formatetc.LumenVox has 2 versions andthey are known as Speech En-gine Lite (500 Pronunciations)and Speech Engine (Unlimited).

LumenVox supports multiplelanguages for the purpose ofglobalization.LumenVox sup-ports MRCP(Media ResourceControl protocol) Version 1 and2 and SRGS(Speech RecognitionGrammar Specification) definedby W3C.It has server sidegrammars which allows client todump a pre-loaded grammarstowards server.LumenVox is nota freeware to use.

Dragon NaturallySpeak-ing 10 SDK [15]

Dragon NaturallySpeaking is amost powerful Speech Recog-nition Software developed byNuance.Differnt industries likemedicine,Insurance,Government,Education or people with dis-abilities (for accessibility issues)use this software for their dailypurpose. It has received 175awards for its accuracy andusability factors. Every cus-tomer should train themselves inorder to get use to. It basicallyhas 2 versions namely Clientand Server editions and alsoprovides a package supportsfor an application developer fortheir rapid prototyping.

Dragon NaturallySpeaking has agreat support for both Windowsand Macintosh users. It has agreat impact on all kinds of win-dows applications like MS Wordfor document creation,MS excelfor spreadsheets or writing anemail through internet explorer.Blind people can make use of thissoftware for their daily activitieslike opening or closing a docu-ment. Dragon NaturallySpeak-ing does not have a freeware ver-sion but still it is very popular forits accuracy and the usability.

EmbeddedViaVoice SDK [16]

Embedded ViaVoice is a greatspeech Technology for a mobileand automobile components de-veloped by IMB.It have a pow-erful architecture oriented en-gine for speech recognition andspeech synthesis. It introducesphonemes-based model which isresponsible for high accuraterecognition and noise detection.

Embedded ViaVoice has a broadlanguage coverage through thesupport from IBM Speech Re-search Team. It recognizes ap-prox. 200,000 words vocabulary.It provides a toolkit for an appli-cation developer. It is not free touse.

SpeechStudio Suite [17] SpeechStudio is a complete util-ity for rapid prototyping for anapplication developer who en-gaged with speech based solu-tion. It works fine with the tele-phone based application too. Itworks desktop as well as Tabletpc based applications and showshigh accuracy. It has 4 versionsnamely SpeechStudio Suite, Pro-file Manager, Lexicon Developerand Lexicon Lite.

SpeechStudio has a differentkinds of telephony support likespeech recognition, audio out-put and recording, recognitionthrough touch-tone etc.It has arecording option in order to rec-ognize a certain speech and alsoworks like a synthesizer (Text-To-Speech).It has API and whichis free for the application devel-oper. Whereas, the mentionedversions are not free to use.

Table 3.1: Speech Engines with various functionalities

Page 19: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

3.3. Hand Gesture Recognition 9

3.3 Hand Gesture Recognition

Different Surveys conducted on Hand gesture recognitions like below

3.3.1 DataGlove

This is the technology [18] developed by VPL Research Inc. in 1989, It uses optical fibertechnology for identifying flexion and a kind of magnetic sensor for detecting a correspondingposition [19].The no of degrees of freedom that the DataGlove allows are 16, where 10 forflexions and 6 for positional data. Some DataGloves include sensors for finger abductions.DataGlove(produced by VPL Research) uses fiber-optic sensors that allowed computersto measure fingers as well as thumb bending and also interaction was possible throughgestures. The main drawbacks were lack of tactile feedback ,Non-Wearable for certainenvironment(For example, when getting a bath,cleaning) and also having the problem toaccommodate with different hand sizes.

3.3.2 PowerGlove

PowerGlove(Nintendo Entertainment System,1989)[20] is a down version of the VPL Data-Glove.It uses ink bend sensors to track the positions of the fingers and also uses ultrasonictracking to give the corresponding x,y and z coordinates.

PowerGlove begins at very early stage of the video game industry.It was able to detectfinger motion and ”wrist roll” through an ultrasonic detection system.It was ”infamous”because of its longer existence on the market [21].The main drawabacks were like belowNottechnologically designed to handle 3D based environment.

3.3.3 GestureWrist

Rekimoto’s [22] ’GestureWrist’ is basically a wristwatch type device that operates like ahands-free operation on both hands.It actually recognizes human hand gestures by capaci-tively measuring changes in wrist shape.An acceleration sensors mounted to the wristbandin order to act as a command-input device. GestureWrist is an input device recognizes hu-man gestures and also measures forearm movements.The another advantage calculates likebelow When a gesture is recognized, the GestureWrist give feedback to the user by tectilesensation [22].This device was designed to be as unobtrusive as possible,so that the peoplecan control wearable computers in any social situation.

3.3.4 GesturePad

Rekimoto’s ’GesturePad’ is another input device which acts like ”Interactive Clothing”.Thisis basically a module that contains a layer of sensors(An array of capacitive sensors(a combi-nation of transmitter and receivers)) that can be attached to the inside of cloths [22].Whena module inside of a lapel,the situation may control the volume of a worn MP3 Player by afinger stroke gesture.Totally,Type(A),Type(B), Type(B’) and a combination of Type(B)+GestureWrist [22] were introduced in order to investigate. As the wearer control the modulefrom outside of his/her body,a part of the clothes becomes interactive without changing itsexact appearance. GesturePad requires specially designed clothes,and its difficult to applythis with our normal clothes.

Page 20: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

10 Chapter 3. Related Work

3.3.5 InfoStick

InfoStick is a small portable device which initiates drag and drop operation by pointing atthe target objects by using a small video camera, button and a combination of microproces-sor [23]. Although the results were positive when interacting with different type of objects,but it does not follow the hands free operation.This device has to hold by the hand.

3.3.6 DigitEyes

DigitEyes [24] a complete vision oriented hand gesture recognition system that basicallyuses a 3D cylindrical kinematic model of the human hand.The system was tested in a 3Dgraphical application for mouse(used single camera) and also were tested for the hand jointangle estimation(used stereo cameras). For feature tracking and model-parameter estima-tion a modified Gauss-Newton minimization concept was used.Although this system facesthe difficulties of complex environments. The system was quite expensive too as because ofits equipments like different cameras.

3.3.7 VBA(Vision Based Approach)

Darell and Pentland [25] introduced a vision-based technique to model both object andbehavior.Each hand gesture was represented by its own set of different angles of viewsand later matched to the gesture image sequence using temporal correlation and dynamictime wrapping.A straight forward camera based approach was introduced for achieving thespeed of 10 fps. This approach allowed them to learn their model by observation. Thedisadvantage of this method is that complex articulated objects have a very large rangeof appearances.The experiment was also user-dependent because each of the 7 users wereinvolved in both the testing and training phases.

3.3.8 GestureVR

J. Segen and Sentil Kumar [26] introduced a system called GestureVR at Bell Laboratories.Itwas basically a novel video based hand gesture recognition interface system that controlsup to 10 consecutive input parameters.The system was quite robust as well as faster. Itwas experimented in a desktop environment with 2 video cameras about a meter distancejust above the table. The system was quite robust as well as faster as it could recognize 3simple gestures and tracks the thumb and the pointing finger in a real time. The camerascontinuously observe images of the scene at the rate of 60Hz [20]. The most noticeablerestriction is a narrow limit of the range of elevation angle and that is about ±40◦.Whenthis limit is reached no pose is computed. Introducing a more camera could get the realsolution for a this kind of system.

3.3.9 Finger-Pointer

Fukumoto,Mase and Suenaga [27] of NTT Human Interface Laboratories introduced a Sys-tem called ”Finger-Pointer” that can detect 3D position of the finger-tip,pointing action,thumb clicking, and the number of shown fingers in a real time basis by a simple and fastimage processing method.Totally 2 small stereoscopic TV cameras attached on the walland ceiling while experimenting. Finger-Pointer is basically a glove free interface which

Page 21: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

3.4. Ambient Display 11

could identify 3D position of the finger-tip. It is also stable measurement by VPO (VirtualProjection Origin).By calculating VPO, the system can identify stable and accurate point-ing regardless of the operator’s pointing style.It also has the multi channel integration by”Timing Tag”. The main limitation is both cameras can not be worn for calculating theoperations done by this system.

3.4 Ambient Display

Breakway [28] introduced an ambient display in order to change a human attention andwhich can be placed on the office desk for alerting the user about the breaking informationwhile he or she is sitting for a long time in the office. Breakway collects the relevant informa-tion from the sensor embedded on the office chair. It draws different pose for reminding theuser that they should take certain break as they are sitting for a long time. The limitationof this system was not providing any communication channel between the sensor embeddedon the chair and the Breakway (small sculpture). Basically a long cable was used betweenthe sensor and Breakway. The Evaluation was organized for 2 weeks long and a local staffmember (age 55, women) was identified for acting as a subject. The evaluation result wasfound as expected.AmbientROOM [29] introduces a space where information was provided for backgroundprocessing. It presents information through different media like motion, light and sound.AmbientROOM mainly focuses on peripheral awareness. The Information Percolator [30]was designed for presenting ambient information by presenting different expressions placedon different decorative objects. It basically explores two design selections one focuses oncreating expressive decorated object designed through vertically aligned no of tubes, andthe other one focuses on pixel based meaningful expression showed on the tube by releasingan air controlled through micro-controller for a short duration.Digital Family Portraits [31] is an another example of ambient interaction which provides aqualitative sense of person’s activity based on the digital family portrait as like traditionalhung portrait or sometimes located with other household objects. The information changesdynamically by comparing with different activities performed by the family members. Itbasically reflects or changes based on users’ activity. Mainly this work was proposed for thefuture home environment and was evaluated in the BI (Broadband Institute) ResidentialLaboratory mainly for aging adults, for identifying their peace of mind.

3.5 Context-Aware Computing

Context-Aware Computing is another area which begins an interaction between human andtechnology. Mobile phone interaction is a perfect example for context aware computing.Louise Barkhuus and Anind Dey [32], describes how context awareness focuses on threelevels of interactivity (i.e. Personalization, Active Context-Awareness and Passive-ContextAwareness). In this paper, they had investigated ”Is Context-Aware Computing takingusers’ control or not?” and also examined through the levels mentioned above. During anevaluation time, 23 participants (age 19-35 and all having mobile) were decided to performbased on groups and different services. The services, Private ringing profiles and Publicringing profiles, were presented to them in order to perform in the 3 levels of context. Inthe personalization context, user had the chance to choose the ringing profile by themselves.In the Passive Context-awareness, the user was prompted by the private ringing profile bysensing the location like in office meeting. According to Active-Context Awareness, the

Page 22: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

12 Chapter 3. Related Work

phone changed the private ringing profile automatically by sensing the location, like beingat restaurant. The evaluation result showed that users lost their control while interacting inthe level of active or passive state but still preferred them over the personalization orientedapplication.Another example of Context-Aware computing is about determining context awareness [33]through wireless communication. This is used for identifying user’s location and was de-veloped and evaluated at Carnegie Mellon University. They have decided to use wirelesscommunication because GPS (Global Positioning System) does not work with the indoorapplications. During the setup, 400 wireless access points were being decided for a reliablesignal strength measurement. These points used, running on a mobile client, finally mapthe information about 2D campus areas on his mobile device. The final goal was to findout the location information based on signal strength measurement. Experiments were con-ducted both during the day and night. The evaluation of this experiment, related to locationtracking, was found to be quite promising and which could be used in future applications.

3.6 Smart Home System

INSPIRE [34] was developed to act as a Smart-Home System which supports two languages,German and Greek. The design phase had different components, like noise detection, speechand speaker recognition, controlling dialogs and output related with speech etc. INSPIREwas designed to work in the regular home with the supports the activities, like operating TV,Lamps, video recorder etc., through speech. Speech input was picked through a microphoneembedded on the wall. This system was designed to work from a remote distance via anexisting telephone network. The complete evaluation was conducted based on 3 scenariosand results ware calculated based on the designed questionnaire. 24 test subjects (aged, 19to 29) were engaged to act in the Wizard-of-Oz test. The overall quality of the system wascalculated through a range from 1 to 5(5 referred to excellent and 1 to bad) and the averageresult was found approximately 3.3(mean).Another implementation of smart home environment is in the context oriented personalizedhealth monitoring system [35].They had designed a system which basically detects user’scurrent condition through the wearable sensor and the collected information can be utilizedby the doctor or the caretaker. This system facilitates the user by informing his or hercurrent condition or diseases on the PDA, written by the doctor. During the design phase,different contextual elements, like who, how, when and what, were being chosen for inter-acting with the user’s attention. The complete scenario was tested in the UbiHome Testbed[36].The system called Smart Home [37] addressed how future home would be ’smart and intel-ligent’. They worked with a ’smart’ home environment, which included different householdobjects like doors, alarm, fridge, bread toaster and video recorder etc., and controlled themby using a service called JINI [38]. JINI is an open source software architecture, which wascreated by Sun Microsystems. The following example explains how their system worked.User can open or close the door/fridge by knowing the states (bolt/unbolt) of it from anindoor or from outdoor (For example,someone is shopping/playing) by connecting with thehome network. Similarly, they can also control alarm switch on/off by knowing the internalstates.

Page 23: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

Chapter 4

Conceptual Design

During the design phase, the following issues, related to DMI, were included

4.1 Distributed architecture

Distributed Computing is an area where a defined problem can be split into a group ofsub problems, which communicate through different networking protocol. Want et al.[39]introduced a personal server concept in the mobile device, which gets access to the applictionscreen embedded in the local environment through wireless communication. In the proposedsystem, a similar kind of distributed architecture was decided for attaining the goal. Thinclients were distributed over the network, identified as SEOs and the server was identifiedas a personal server. The data communication between the SEOs and the personal serverwas handled through a wireless network.

4.2 Multimodal Design

Multimodal interaction has a great variety of designing approaches as compared to the lim-ited approaches in the WIMP interaction design.According to [40], designers should knowabout the users’ psychological status, like their understanding abilities, depth of experi-ence, environmental factors as well as their physical abilities, like their hearing or low visionproblems etc. Multimodalities extend the users’ usable areas while being in the dynamiclocations (users can use their speech input while being in the dark place or the users canprovide a gesture or touch based input to the system while being in the noisy environment).Different multimodalities, like two input modalities, gesture and speech, and two outputmodalities, visual and aural were considered. During the design phase, different functionali-ties were focused which were related to the individual modalities, while interacting with thedesigned SEOs. Users were free to choose between different modalities For example, userscan use single input modality, like speech, and also can switch to others, like selecting ges-ture input without closing speech modality. Different contexts could be explained like below

13

Page 24: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

14 Chapter 4. Conceptual Design

1) Case 1: User is interacting with the ’Bookshelf’ SEO

– User has selected speech input for listening to music.

– User’s attention goes to the mobile phone, when it rings and he or she receives it butcan’t stop the music through the speech input.

– So, in this context user can select gesture input to pause the music without exitingthe selected speech modality.

2) Case 2: User is interacting with the ’Toilet’ SEO

– User has selected speech input for viewing the news headlines.

– User is busy with his or her mouth because of brushing or shaving.

– So, the user has the chance to select the gesture input for interacting with some otherapplications like daily schedule or reminder.

3) Case 3: User is interacting with the ’Kitchen’ SEO

– User has selected gesture input for the medicine information.

– User is busy with his or her hand because of making bread.

– So the user has the chance to select the speech input for the menu along with therecipe for bread in menu.

According to the design, different functionalities are related to individual modalities asfollows

4.2.1 Gesture Input

Different functionalities and their operations, related to the gesture input are as follows:(a)Open/Close Gesture InputThis function was used in order to begin the gesture input. Left hand was identified toperform these operations. It was designed through the combination of two gestures. Thegesture input would open if the users will combine their left hand - left navigation and lefthand - right navigation between the gap of maximum 500 milliseconds each. On the otherhand, users would be able to close the gesture input channel by providing the similar actionsas they performed. The time frame was included in order to avoid recognizing more thanone gesture at the same time. The basic approach behind Open/Close was to have the accessfrom the gesture input channel. This was necessary in a situation when the user cannot usegestures because of doing other physical activities, like writing, carrying something etc.(b)Modality ControlThis was used for controlling the way in which mediator accesses the system. Left hand wasselected to perform these operations. It was also designed through the combination of twogestures namely left hand-up gesture and left hand-down gesture. By using this users cancontrol the volume of the speakers, embedded in SEOs.(c)Command basedGesture based command was used for accessing the application that denotes an informationrelated to news, music, books etc. Left hand’s forward gesture was designed for directassociation with ’Reminder Application’, without even interacting with the visual display,embedded in SEOs. This is useful in a situation, when the users were performing some

Page 25: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

4.2. Multimodal Design 15

gestures in the outdoor environment and they do not have any visual display to interactwith. Whereas, Right Hand Forward Gesture was designed to access individual applicationafter certain navigation through visual display worn in SEOs. Right Hand Backward Gesturewas designed to perform an exit operation.(d)Navigation basedNavigation based gestures were included to enable the users for interacting with the visualdisplays, embedded in SEOs. Four gestures of right hand: left, right, up and down, weredesigned to perform above operations.

Figure 4.1: Left hand gesture user interface functionalities [41]

Page 26: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

16 Chapter 4. Conceptual Design

Figure 4.2: Right hand gesture user interface functionalities [41]

4.2.2 Speech Input

The following table explains different functionalities and their operations, related to thespeech input:(a)Open/Close Speech InputIt provides a technique for users when they were engaged with their speech, not willing ornot able to give their speech input. Dialog based approach was considered for providingsuch an input. ”Select Speech User Interface” opens the speech input channel and ”ExitSpeech User Interface” closes the speech input channel.(b)Modality ControlThis was used because of knowing the current state of modality that is being used. ModalityControl was useful in a situation when the users were in the meeting or were listening tosome other conversation or were not intending to get the feedback through the environ-mental speakers. In this situation, users could have switched the feedback through his orher Bluetooth headset (personal mike) by the dialog (Close Home Audio Output). On theother hand, the dialog (Open Home Audio Output) transfered the control to EnvironmentalSpeakers which were embedded in SEOs.(c)Smart Object(SEOs) ControlThis was designed for selecting and de-selecting the smart objects, identified through SEOs.User may have overruled the decision, taken by personal server, by de-selecting the SEOsFor example, the photo frame SEO is selected by the personal server for displaying reminderinformation and the users intended to deselect it because of their own purpose (distance,reading ability) and might have selected bedroom SEO for reading the same reminder in-formation. Selecting or de-selecting SEOs through gesture was not included in the design.(d)Application ControlIt was useful in a situation when the users were interacting with the personal applications,using dialog based speech input. ”Select Personal Application” dialog was considered forinteracting with personal applications like news, reminder, culture etc. ”Exit Personal Ap-

Page 27: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

4.2. Multimodal Design 17

plication” dialog was decided for closing an application.

Figure 4.3: Speech user interface functionalities [41]

4.2.3 Visual Output

Different functionalities designed for Visual Output are as follows:(a)Ambient NotificationDuring the design phase three states, neutral, active and passive, were identified for ambientnotification. Figure 4.4(a), represents the neutral state. It was useful when there was nointeraction between a user and the system. On the other hand, active and passive states,represent the information window shown in figure 4.4(b) because of the peripheral attentionof the users.(b)Visual MessagingIt shows information related to the personal application presenting through SEOs.(c)Visual FeedbackVisual feedback represents the application which is active or selected. For example, in thefigure 4.4(b), reminder application was selected.(d)Modality StatusIt shows the status of the currently used input and output modalities. Different icons wereaddressed for conveying the modality status (figure 4.4(b)).(e)Application StatusIt represents number of applications running on individual SEOs.(f)Smart Object StatusSmart Object’s status appeared in the title bar of the information window (figure 4.4(b)).

Page 28: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

18 Chapter 4. Conceptual Design

Figure 4.4: Visual outputs represented through Bookshelf SEO [41]

Page 29: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

4.2. Multimodal Design 19

4.2.4 Audio Output

The following table explains different functionalities designed for Audio Output:(a)Ambient NotificationAmbient Notification was decided by playing a small sound file (.wav file) for attaining theuser’s peripheral attention without even considering the actual aural message by using aspeech synthesizer.(b)Aural MessagingIt shows information related to the personal application through speech synthesizer present-ing through SEOs.(c)Aural Feedback It represents which application is active or selected through an audiochannel.(d)Modality StatusIt conveys the modality status through an audio domain.(e)Smart Object StatusIt confirms SEOs name through a synthesizer.

4.2.5 How DMI works:

DMI was designed to work in a smart home,where a user interacts with an individualSEO(For example, Entrance,Bookshelf,Toilet etc.) whether through speech or gesture in-put.For initiating a speech input,a user has to wear a bluetooth headset, termed as BTHin the figure(4.5).However,In order to begin with a gesture input,a user has to wear ac-celerometer wristbands in the left and right hands,indicated by the terms LHS(Left HandAccelerometer) and RHS(Right Hand Accelerometer) in the figure(4.5).User has to carrya personal server(PS in the figure(4.5)) running in the laptop or PDA, which has differ-ent software components like speech recognition engine for processing speech inputs,gesturerecognition engine for precessing gesture inputs,a personalized database to communicatewith the applications running in the SEOs.Every SEO has a user interface to provide anoutput whether through visual or audio channels.A speech synthesizer has been embeddedalong with an individual user interface for activating the audio output channel.Nineteenmock-up applications(For example, whether,news,medicine etc.) were designed by consid-ering different contexts(location,time etc.).A customizable database was developed in thepersonal server,which is responsible to store and retrieve a data relevant to the applicationsembedded in the SEOs.The complete data communication between personal server and theSEOs was maintained through a WLAN.A separate software component was handled in thepersonal server for activating a WLAN.

Page 30: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

20 Chapter 4. Conceptual Design

Figure 4.5: Conceptual model describing DMI(Distributed Multimodal Interaction)

Page 31: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

4.3. Designing Interaction States 21

4.3 Designing Interaction States

Weiser at el. [3] defined the term ”Calm Technology”, where a user interacts through twokinds of attention. They were: (a) The center of the user’s attention and (b) The peripheralattention. Peripheral attention facilitates the user to have the background information orthey can decide whether the required information needs the attention or not [41]. Duringthe design phase, three interaction states were introduced for interaction between the usersand the SEOs. They were:(i) Neutral state (no attention)(ii) Passive state (peripheral attention) and(iii) Active state (center attention).

Figure 4.6: Interaction states between the user and the SEOs[41].

Page 32: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

22 Chapter 4. Conceptual Design

According to figure (4.6), in the neutral state, there is no exchange of information be-tween the user and the SEOs in the smart home. In the passive state, the SEOs grabbed theuser’s attention in an implicit manner with one-directional exchange of information. How-ever, in the active state, the user explicitly did the interaction between the center of theirattention and the designed SEOs, resulting in the bi-directional exchange of information.The above mentioned states were controlled through the personal server orientation withrespect to the corresponding SEOs. In this work, the personal server initiated an interac-tion with the SEOs, which change the states from neutral to passive (transition arrow a,figure (4.6)). However, the system was designed to give the highest preference to the userwhere they initiated the interaction, which was changing the state from passive to active(transition arrow f, figure (4.6)) by overruling the decisions made by personal server.

4.4 Interaction Languages

While maintaining the dialog between the user and the speech or gesture modalities, thefollowing languages were decided in order to communicate with the system:

Page 33: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

4.4. Interaction Languages 23

Thin-client(s)-SEOsH (Speech): Select display C (Voice): Provide display nameH (Speech): Bedroom C (Voice): Bedroom display selected

C (Visual): Bedroom display ”SELECTED”H (Speech): Exit display C (Voice): Bedroom display exited

C (Visual): Bedroom display closedModality:SUI(Physical to virtualto Physical):H (Speech): Select speech C (Voice): Speech input is open

C (Visual): Speech icon (green)H (Speech): Open voice feedback C(Voice): Voice output is open

C (Visual): Speaker icon (not crossed)H (Speech): Close voice feedback C (Voice): Voice output is closed (headset)

C (Visual): Speaker icon (crossed)C (Visual): Speaker icon (crossed) C (Visual): Speaker icon (crossed)

C (Visual): Speech icon (red)Modality:GBI(Physical to virtualto physical(left hand)):H (Gesture): Left hand (left to right toleft)

C (Voice): Gesture input is openC (Visual): Gesture icon (green)

H (Gesture): Left hand (left to right toleft)

C (Voice): Gesture input is closedC (Visual): Gesture icon (red)

Application(s)SUI :H (Speech): Application name C (Voice) : Application information

C (Visual): Application informationH (Speech): Exit application C (Voice):playing ir end·wav

C (Visual): Gesture icon (red)Application(s):GBI (Navigation:right hand):H (Gesture) : Navigation to application(left, right)

C (Visual) : Application tab highlighted

H (Gesture) : Select application (for-ward)

C (Voice): Application informationC (Visual): Application information

H (Gesture) : Navigation within appli-cation (up, down)

C (Voice) : Application informationC (Visual): Application information

H (Gesture) : Exit application (back-ward)

C (Voice): playing ”ir end.wav”C (Visual) : App· inf· disappears

Application(s):GBI(Association :left hand):H (Gesture) : Select Reminder applica-tion (forward)

C (Voice) : Application information

Page 34: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

24 Chapter 4. Conceptual Design

Special case :Gesture (Outdoor):H (Gesture): Left hand (left to right toleft)

C (Voice): Gesture input is open

H (Gesture) : Select Reminder applica-tion (left hand forward)

C (Voice) : Application information

H (Gesture): Left hand (left to right toleft)

C (Voice): Gesture input is closed

Special case :Gesture (Outdoor):H (Gesture): Left hand (left to right toleft)

C (Voice): Gesture input is open

H (Gesture) : Select Reminder applica-tion (left hand forward)

C (Voice) : Application information

H (Gesture): Left hand (left to right toleft)

C (Voice): Gesture input is closed

Available modalities andapplications (Indoor)H (Speech): Available modalities C (Voice): Speech, gesture input and voice

outputH (Speech): Available applications C (Voice): Available applications are

weather, schedule and shopping.Available modalities andapplications (Indoor)H (Speech): Available modalities C (Voice): Speech, gesture input and voice

outputH (Speech): Available applications C (Voice): Available applications are

weather, schedule and shopping.Special case : Speech (Outdoor):H (Speech): Select speech C (Voice): Speech input is openH (Speech): Available modalities C (Voice): Speech, gesture input and voice

outputH (Speech): Available applications C (Voice): Available applications are

weather, schedule, Transportation and shop-ping.

H (Speech): Application name C (Voice) : Application informationH (Speech): Exit speech C (Voice): Speech input is closed

Available modalities and applications (Indoor)C (Visual): Available modalities (icons)C (Visual): Available applications (tab)H:(Human),C:(Computer),SUI:Speech User Interface,GBI:Gesture Based Interaction

Table 4.1: Dialog conversation between human and the computer.

Page 35: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

4.5. Addressing Implicit and Explicit Interaction 25

4.5 Addressing Implicit and Explicit Interaction

Contextual information always provides an unknown input for human beings to communi-cate with each other. This contextual information, like different angles of gestures, spokenwords like ”yes” or ”no” commonly known as body languages [42], which can also be theuser’s everyday activities, like cooking, hearing music, reading the books , etc. This sensitiveinformation could be tracked through the sensors embedded in the SEOs or through user’swearable devices. Contextual information, obtained through sensors, provides implicit in-put to the computing system, which in return facilitates or opens a way for the furtherinteraction. During this work, Wizard-of-Oz method was used for addressing an implicit in-teraction. In this context, the wizard selects the user’s present activity and replaces it withan activity recognition system (the system which recognizes the user’s everyday activitiesthrough the sensors embedded in the user’s body). Explicit interaction provides an explicitinput to the computing system through different channels like speech, gestures, keyboard,mouse, touch screen etc. During this work, speech or gesture based explicit inputs were usedfor interacting with the SEOs. The following figures addresses both implicit and explicitinput to the system:

Figure 4.7: Speech and Gesture recognition areas for recognizing and providing an explicitinput to the system by a user.

Page 36: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

26 Chapter 4. Conceptual Design

Figure 4.8: User’s current activity based on three scenarios which provides implicit inputto the system through a wizard.

Page 37: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

Chapter 5

System Implementation

5.1 Hardware Infrastructure

During the hardware setup, the system combined computing technology with differenthousehold appliances (photoframe, toilet mirror, bookshelf etc.) termed as SEOs. SEOsuse WLAN 802·11b/g protocol for communicating with the personal server. Nine thinclients(Intel Pentium-4 machine,OS:XP) were placed in the smart home. For the SEOsdevelopment, 8 LCD Screens (figure (5.6)) and a projector (figure (5.6)), along with sevenaudio speakers for producing aural feedback, were chosen. During the setup, another threewearable SEOs were included. They were, BTH-8 Bluetooth headset (used for aural inputand output) and two Phidgets 1059 3-axis accelerometer embedded wrist bands for tracking3D gesture based acceleration values. The following figure shows the wearable hardwares:

Figure 5.1: Wearable hardwares equipment.

27

Page 38: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

28 Chapter 5. System Implementation

During the hardware setup, blue tooth dongle and WLAN access point were embeddedin the user’s belt which was connected to the personal server through long wires. The accesspoint belt was considered in order to get a WLAN signal strength value [43]. The belt hadanother two ports for plugging 1059 3-axis accelerometers, which collected the accelerationvalues and directs them to personal server, for calculating the intended gestures. The accessbelt diagram is as follows:

Figure 5.2: WLAN Access Point Belt

Page 39: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

5.2. Software Infrastructure 29

5.1.1 Smart Everyday Objects

During the design phase, nine locations were decided for identifying different SEOs in asmart home. Locations were identified based on user’s context and also considering theirdaily activities. These locations were entrance, toilet, fridge, kitchen, dining table, bedroom,wall, photo frame and the bookshelf( figure (5.3)).

Figure 5.3: Representing SEOs in a smart home.

5.2 Software Infrastructure

5.2.1 Gesture Recognition

Phitget 21 MSI (Versions:2.1.4.20080821)[44] was used for recognizing 3D gestures and im-plemented through two different methods namely accel1 AccelerationChange() andaccel2 AccelerationChange(). Both of these methods were used to calculate the thresholdvalues in order to decide the left as well as right hand gestures. Accel1 AccelerationChange()

Page 40: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

30 Chapter 5. System Implementation

calculated left hand gestures (For example, gestures like open/close,up/down).However, ac-cel2 acceleration() method was used to calculate the right hand acceleration values for de-ciding the navigation based gestures (For example, left/right/up/down/forward/backward).The background noises being handled by having a timer concept, i.e. 500 milliseconds forrecognizing a single gesture. The timer concept was introduced in order to avoid an un-necessary recognition. The Timer Tick() method was implemented which fired every 100ms(millisecond) and update global variables(used for filtering the acceleration data from theaccelerometer) for Open/Close gesture input. For example, the user is opening gesture inputby providing the pattern (left hand left gesture + left hand right gesture), which changesthe textbox(gesture status.text) values from stop-gesture to start-gesture and closing thegesture by performing the same actions(left hand left gesture + left hand right gesture) andwhich updates the text values from stop-gesture to start-gesture. The change modalities()method was called every time when the user is trying to Open/Close the gestures. The sim-ilar approach was included to recognize the gestures, like left hand down gesture + left handup gesture, when the user tried to control the audio output channel, like personal audio out-put and home audio output, thus changing the variables from maximized-to-minimized orfrom minimized-to-maximized. The following figure shows how the modalities are changingduring an interaction in the SEOs:

Figure 5.4: Visual output for open/closing gestures and controlling audio channel

Page 41: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

5.2. Software Infrastructure 31

The technique was implemented by having an Open/Close gesture input channel and ac-tivated through the left hand. For example, if the user is trying to navigate the applicationsin the SEOs, it can only be done after activating the gesture input channel. The followingtable is explains about the ranges are related to the acceleration value that determines thegesture recognition:

Left hand gesture recognitionAcceleration values Recognized gesture Associated Operation2 < attached.axes[e.Index].Acceleration Right Open/Close-2 > attached.axes[e.Index].Acceleration Left gestures3 < attached.axes[e.Index].Acceleration Up Maximize/Minimize aural-1 > attached.axes[e.Index].Acceleration Down output channel

Table 5.1: Left hand recognized gestures,associated operations based on acceleration values

Right hand gesture recognitionAcceleration values Recognized gesture Associated Operation2 < attached.axes[e.Index].Acceleration Right Right Navigation-2 > attached.axes[e.Index].Acceleration Left Left Navigation-0.8 > attached.axes[e.Index].Acceleration Forward presents application

information1.8 < attached.axes[e.Index].Acceleration Backward Exiting application3 < attached.axes[e.Index].Acceleration Up Up Navigation-1 > attached.axes[e.Index].Acceleration Down Down Navigation

Table 5.2: Right hand recognized gestures,associated operations based on acceleration values

Page 42: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

32 Chapter 5. System Implementation

5.2.2 Speech Recognition and Synthesis

Microsoft Speech SDK 5.1 API was used for speech recognition as well as for speech syn-thesis, and both ran in the personal server on a stand alone PC and were implementedin C#. The SDK provides both TTS (Text-To-Speech) and SR (Speech Recognition) en-gine. SDK supports the speech applications which work in a real context. For example,the developers can make use of the mentioned SR or TTS engine, when they are produc-ing a speech oriented applications for a desktop PC, PDA or a traditional mobile phones,for an effective interaction. During an implementation, speech supported listbox control,termed as SpeechListBox shown in figure (5.4), was used. SpeechListBox is inherited fromthe ”System.Windows.Forms.Listbox” and exposed its standard properties, methods andevents. SpeechListBox control was used in the main class, called mainTester, and man-aged to work with the designed speech based dialogues. The following figure shows thecomponents with respect to the SpeechListBox:

Figure 5.5: SpeechListBox components

Page 43: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

5.2. Software Infrastructure 33

Different functions related to SpeechListBox control explains in the followingmanner:

AddItem()It takes a single parameter and adds a single word or phrase to the listbox item. It updatesthe grammar list and returns the index of the newly added text.

RemoveItem()It also takes a single parameter and removes the index of the selected item and updates thegrammar list.

RecoContext Hypothesis()This method takes three parameters and uses as an event handler function for SpShare-dRecoContext object.

RecoContext Recognition()It has four parameters in total and is called every time, whenever the speech engine mapsor recognizes a word or phrase or a designed language.

InitializeSpeech()It creates a SpSharedRecoContext Object and works with the grammar rules and calls Re-buildGrammar() for updating the grammar list. This function also handles an exceptionrelated to the speech recognition.

EnableSpeech()It initializes all speech objects, rebuild grammar and start recognizing speech.

RebuildGrammar()It updates grammar objects and called by AddItem() or RemoveItem().

DisableSpeech()It stops the speech recognition.

Page 44: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

34 Chapter 5. System Implementation

During an implementation, SpeechListbox SelectedIndexChanged() was called every time,whenever the selected index of the list box control changed. This function works based onthe dialogues like below:

Figure 5.6: Data flow diagram for speech recognition based on commands

If the user’s speech based dialogs matches with the designed commands, according tofigure (5.5), the system tries to communicate with the running applications, embedded inthe SEOs.

Page 45: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

5.2. Software Infrastructure 35

Different functionalities related to the speech commands explains like below:

Select Speech User Interface :Initializes speech recognition like below:User: Selects Speech User InterfaceAural Feedback : Confirm Select Speech User InterfaceVisual Feedback: Confirm Yes/NoUser: The Yes confirmation loads the commands from the .xml file and no confirmationmoves the control one step back.

Select Personal Application:This command loads .xml file related to the application functionalities and works like below:User : Select Personal ApplicationVisual Feedback: The system loads different commands like (refresh personal application,select ”name” application) etc.

Select Home Display:This command loads .xml file related to the displays in the SEOs and works like below:User: Select Home DisplayVisual Feedback: The system loads different commands like (refresh home display, select”name” display) etc.

Open Home Audio Output:The command enables the environmental speakers and disables the personal mike embeddedin the user’s body.Aural Feedback: ”Home audio output is open and personal audio output is close”.Visual Feedback: Shown in figure (5.3).

Close Home Audio OutputThe command disables the environmental speakers and enables the personal mike embeddedin the user’s body.Aural Feedback: ”Home audio output is close and personal audio output is open”.Visual Feedback: Shown in figure (5.3).

Refresh Personal ApplicationIn this approach the user can select the application according to their choice.Aural Feedback: ”Select personal application”.

Select ”name” ApplicationThrough this command the user can select the application by providing the ”name” of theapplication.In this context, the user cannot select other application without using the func-tion called ”Refresh Personal Application”.Visual Feedback: Shown in figure(4.4(b)).

Exit Personal ApplicationThis command is used for closing the applications.

Select Home Display

Page 46: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

36 Chapter 5. System Implementation

This command is useful in a situation, when the user can select the best display for him/herrather than the one selected by the wizard.

Exit Home DisplayThrough this command the user can exit from the selected thin client and the control goesto the display, selected by the wizard.User: Exit Home DisplayAural Feedback: Confirm exit displayVisual Feedback: Confirm Yes/NoUser: The Yes confirmation closes the selected thin client.

Refresh Home DisplayThis command refreshes the display selection mechanism where a user has the change toselect some other display or SEO.Aural Feedback: Select Home Display

Select ”name” displayThrough this command the user provides the name of the thin client or display and locksthe display selection mechanism.For example, ”Select Photoframe display”.

Exit Speech User InterfaceThis command closes speech recognition like below:User: Exit Speech User InterfaceAural Feedback: Confirm Exit Speech User InterfaceVisual Feedback : Confirm Yes/NoUser: The Yes confirmation is closing the speech recognition and the No confirmation keepsthe control ideal.

The speech synthesis was included in the system through SpeechListbox control and wascalled through a simple function called test(). The required information was decided tosend through this function to the SpeechListbox Control class and designed to speak bythe system with the defined Voice object. The system was decided to have different voicesby changing the properties through function, GetVoices(). Finally, the function, Speak()was called, for the system to speak the necessary information through the user’s Bluetoothheadset.

Page 47: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

5.2. Software Infrastructure 37

5.2.3 Thin Client Software

The software called eclipse was used during the development of the proposed thin client.Eclipse is commonly known as Java IDE (Integrated Development Environment) and whichhas in-built JRE (Java Runtime Environment). A simple GUI with various components wasdecided during an implementation, and which had the following appearance:

Figure 5.7: Proposed thin client along with different components.

Page 48: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

38 Chapter 5. System Implementation

During the SEOs development, the following methods were handled:

connectToEasyADLWlan()This method was used to connect to the easyADL WLAN with the jwrapi object[43].

interfaceDesign()This was used to arrange the controls in the application window as well as for the modalityicon setup.

Send()This was used to send data to a personal server.

parseMessage()This function was responsible to parse the received message from the personal server.

Voicetest()This method took two parameters and processed them for the required aural output. Thesynth was created to act like synthesizer object and it was allocated and resumed throughallocate() and resume() methods. The voice object was created in order to decide the pre-defined voice tone. Finally, the synth function, namely speakPlainText(), was used to speakthe available text by the defined tone.

getMessageList()This method was used for separating the messages, storing them into an array, and return-ing them as needed.

setData()This was used to set the text in the available control.

setTab()Through this method panels for different applications were identified.

displayMessage()This method had an important role while displaying a message. It checked different param-eters like recognition type, message type, client state etc.

Page 49: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

5.2. Software Infrastructure 39

5.2.4 Mock-up applications

Nineteen applications were implemented by considering different contextual informationsuch as the user’s daily activities, time, identity and the location. For example, the usermight interact with the news application while being in the Toilet SEO and would like toknow about their daily schedule or bus time table in the Dinning Table SEO, while havingdinner in the dining hall. The Reminder application was designed to run in all the SEOs.The following table includes every applications that were designed for individual SEOs:

SEO ApplicationsDining Table News,Culture,Weather,Diet,Medicine, Sched-

ule,Emails,Transport,Buy,ReminderWall TV schedule,News,Culture,Emails,Transport,

Schedule,Offers,Weather,Buy,ReminderFridge Buy, Diet, ReminderBookshelf Music, Books, Movies and CD, Offers, Re-

minderBedroom Alarm, Schedule, Buy,Offers,Weather, Cloth-

ing, ReminderKitchen Menu, Medicine, Diet, Emails,Buy, Offers,

Schedule, ReminderToilet Buy, News, Emails, Offers, Schedule, Re-

minderPhotoframe Messages, ReminderEntrance Reminder

Table 5.3: Nineteen applications with respect to the SEOs.

Page 50: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

40 Chapter 5. System Implementation

Figure 5.8: Representing Dining Table SEO designed to interact with ten applications.

5.2.5 Wizard-of-Oz Software

The Wizard-of-Oz software was implemented in the personal server which ran on a standalone PC. It was developed in C#.The following figure explains about the components related to the Wizard-of-Oz software:

Page 51: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

5.2. Software Infrastructure 41

Figure 5.9: Wizard-of-Oz software with different components.

According to figure (5.9), the wizard selects the object’s IP, based on the availabilityin the listbox showed in figure 5.9(a).Every IP had its own object identification name. Forexample, ”192.168.1.30” referred to Kitchen SEO. Speech recognition is indicated in figure5.9(b), and gesture recognition in figure 5.9(c). However, figure 5.9(d) shows voice limitparameters (0.0-0.99) for speech recognition, and Speech/Gesture recognition identifier isshown in figure 5.9(e). Different message categories like news, reminder, weather etc. areindicated in figure 5.9(f) and applications related to the information are presented in figure5.9(g).The time dimension i.e. the schedule is identified through figure 5.9(h) and thedetailed object properties are showed in 5.9(i). During the implementation, the user’sactivity was not recognized and was kept for the future work. This problem was handledby replacing a wizard to select the user’s activity based on three defined scenarios, shownin figure 5.9(j).

Page 52: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

42 Chapter 5. System Implementation

5.2.6 Scheduler

During the implementation, the ’scheduler’ class identified the proposed time dimension.It was calculated through the system time by dividing them into three different scenarios.They were like weekdays morning (Monday-Friday), weekdays evening (Monday-Friday) andweekends (Saturday, Sonday) scenarios. The calculated scheduler looks like below:

Figure 5.10: Scheduler Table for identifying time dimension.

Page 53: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

5.3. Database Design 43

5.3 Database Design

The purpose of having a database is to introduce a real time storage or update througha user or by care-taker. For example, a doctor might update required medicine list ordiet information for the user. Another situation could be that the user updates his orher personal information like daily schedule or alarm system etc. A care-taker can alsoupdate the personal information with respect to the user, who is identified as a disable.A mock-up database was constructed, namely ”profile-activity” which was running in thepersonal server for communicating with the applications embedded in the SEOs, based onthe designed quires. During the database design, tables and quires were made for askingdifferent information, related to the applications. The database relationship tables are asfollows:

Figure 5.11: Database Relationship Table

Page 54: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

44 Chapter 5. System Implementation

Page 55: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

Chapter 6

Evaluation,discussion andlimitations

6.1 Evaluation

6.1.1 Experimental Setup

The experiment was conducted through a scenario based evaluation in a living laboratoryhome environment designed for the ubiquitous computing research work as shown in fig-ure(6.2). 20 subjects(males-14 and females-6) participated in the quantitative evaluationfor getting higher precision and recall values. The youngest subject was 18 years, whereasthe oldest was 52 years. The subjects were asked to perform their regular activities basedon the scenarios(weekday morning, weekday evening and weekend scenario), while inter-acting with the SEOs. The following activities were performed by the subjects during anevaluation:

Weekday morning scenariobased activities

Weekday evening scenariobased activities

Weekday evening scenariobased activities

a)Getting up from the bedb)Going to toilet and brushingteeth and shavingc)Dressingd)Having breakfaste)Leaving the home

a)Arriving homeb)Selecting music cd and play-ingc)Fika/Light foodd)Cookinge)Having dinnerf)Watching TV/Moviesg)Going to sleep

a)Preparing the shopping listb)Weekend cookingc)Having weekend dinnerd)Leaving the home in theweekends

Table 6.1: Subjects daily activities based on scenarios.

On an average, it took approximately 12 minutes for the subjects to get used to thegesture based interaction. The designed manual was provided to them to follow differentexplicit gesture techniques. On an average it took 10 minutes for every subject to gettrained to their voice tone to get it recognized through the Microsoft’s speech recognitionwizard tool. Subject’s voice profile was stored and activated during an evaluation. Thesubjects were asked to wear Bluetooth headset (used for controlling speech user interface)

45

Page 56: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

46 Chapter 6. Evaluation,discussion and limitations

and the accelerometer embedded wrist bands (used for gesture based interaction), duringthe experiment.

Figure 6.1: User holding (1)Bluetooth headset (2) Accelerometers (3)Access point belt

Page 57: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

6.1. Evaluation 47

6.1.2 Scenarios

The complete evaluation setup was conducted based on three scenarios, weekday morning,weekday evening and weekend.

Weekday Morning

The alarm wakes him up in the morning(figure 6.2(a)) then he gives the speech inputfor the news headlines. After getting out of the bed, he goes to the toilet and startsbrushing his teeth and shaving(figure 6.2(b)) and initiates the gesture input for interactingwith the weather application. He prepares his breakfast(figure 6.2(c)). Then he gives thegesture input for informing himself about his daily schedule, by interacting with the scheduleapplication, while having his breakfast(figure 6.2(d)). He gets ready and before leaving thehome he interacts with the reminder application at the entrance SEO(figure 6.2(e)).

Figure 6.2: User’s activities based on weekday morning scenario

Weekday Evening

The user arrives home and interacts with the reminder application through speech input atthe entrance(figure 6.3(a)). He moves to the bookshelf SEO and interacts with the music

Page 58: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

48 Chapter 6. Evaluation,discussion and limitations

application(figure 6.3(b)). While having his fika,he gives the gesture input for interactingwith the transport application(figure 6.3(c)). Then he interacts with the menu applicationthrough speech input while preparing his dinner(figure 6.3(d)). Finally he goes to bed afterhaving his dinner(figure 6.3(e) and 6.3(f)).

Figure 6.3: User’s activities based on weekday evening scenario

Page 59: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

6.1. Evaluation 49

Weekend

After going through his basic morning activities, the user interacts with the shopping appli-cation at the fridge SEO(figure 6.4(a)). He prepares his meal and at the same time he inter-acts with the medicine application(figure 6.4(b)). While having his dinner he searches for thecultural activities through gesture input(figure 6.4(c)) and then he leaves the home(figure6.4(d)).

Figure 6.4: User’s activities based on weekends scenario

Page 60: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

50 Chapter 6. Evaluation,discussion and limitations

6.1.3 Results

Different methods, like hand written notes and log files(user’s status-location, performedactivities, date, time, interaction type-neutral, active or passive etc) were included for ob-taining the high precision and recall values with respect to the GBI(Gesture Based Inter-action) and the SUI(Speech User Interaction).The precision value explains the percentageof documents being retrieved based on the preferred query.However, the recall value showsthe percentage of documents based on the query and were retrieved. True positive, falsepositive and false negative scores were counted for calculating the desired precision andrecall values and noted based on the user’s interaction with the mentioned SEOs, whileperforming the scenarios, consisting of the daily activities.True positive score was countedwhile a user is trying to interact with any application,and acknowledged as expected.Falsepositive score was counted while a user is not trying to interact ,but acknowledged a positiveresult.However,False negative score was counted while a user is not trying to interact,butacknowledged a negative result. Initially the SUI’s recognition accuracy rate was low, maybe due to some background noises. This problem was handled by restricting the languagesand also providing a kind of confirmation(Confirm Yes/Confirm No) mechanism before ac-cessing the major speech based commands. This concept helped in a situation, where thesystem is trying to exit without the user’s intentional input. The following table (6.2) showsthe quantitative evaluation result for the respective gesture based interaction and the speechuser interfaces:

TruePositive

FalsePositive

FalseNegative

Precision(in %)

Recall(in %)

Userperformederrors(in %)

GestureBased Inter-action

732 87 62 89.38 92.19 6.48

Speech UserInterface

776 103 85 88.28 90.13 5.30

Table 6.2: The average quantitative evaluation result for the GBI (Gesture Based Interac-tion) and the SUI (Speech User Interface) [41]

6.2 Discussion and limitations

The subjects were open to provide their comments regarding the quality of the entire sys-tem. The possible benefit and usefulness of this system, also acknowledged by many subjects,was for the elderly people and people with disabilities (blindness, physical impairment, mildcognitive disabilities etc). The subjects felt more comfortable with the SUI (Speech UserInterface) then with GBI(Gesture Based Interaction), even through the average scores re-lated to the GBI(showed in table 6.2), found slightly higher than the SUI. The possiblereason for this could be the following :(a) The possible health hazard especially in the handjoins,during GBI (b)Not being used to with the possible gesture inputs. The subjects hadsome doubts and arguments regarding the customized database, handled in the personalserver. Few subjects had some queries, like: (a)”Can I have the real weather report?”(b)”Can I have real shopping list offered by a shopping mall?” (c)”Can I have the doctor’smedicine list/diet list?” . The above questions introduced by the subjects was ignored, due

Page 61: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

6.2. Discussion and limitations 51

to the limitations in the database and also lack of time. The possible cost effectiveness andthe overall design factors (mainly hardwares) were the other matter of concern among theusers. This could not be explained to the point to the subjects, due to the prototypicalnature of the study. The possible background noises (unexpected/unrecognized gesture in-puts, performing daily activities like shaking hands, lifting the hands while cooking) duringthe gesture recognition were handled by recognizing each gesture input at the interval of 0.5second(500 milliseconds). The overall GBI recognition rate was found promising after intro-ducing the proposed timer concept and by having the simple gesture based commands. Theregular conversation between the users was another limitation during the speech recognitionand handled by introducing a technique(confirmation dialogs) for avoiding the unnecessaryor unexpected recognition. During the work, the user customizable gesture or speech basedlanguages were not considered, due to the complexity in the design and kept for the fu-ture work. The mock-up applications ran on the SEOs, had some limitations for its detailfunctionality and the GUI appearance. For example, subjects were realizing that, the iconspresented through the applications window, should have the better look and feel. Subjectswere providing the inputs towards personal server, which decides the possible recognitionfor the given inputs, and presented the required information to the desired SEO, whetherselected by the wizard or by the subjects. Another limitation of the system was that theentire system was designed to have the one directional flow of data. The subjects wereinformed to wear access point belt for getting a better network signal strength, Bluetoothheadset and the accelerometer wrist bands during the experiment. The users did not feelcomfortable with the hardware design especially when they were carrying access point beltand they pointed it out as one of the system’s limitation. This limitation can be overcomesby embedding the mentioned wearable objects and the personal server architecture alongwith the personalized database, in the portable devices like PDA or mobile or wrist watchor mp3 player. This combined technology has been kept for the future work, by improvingfrom the dimensional (size factors) as well as the computational(processing capability) pointof view.

Page 62: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

52 Chapter 6. Evaluation,discussion and limitations

Page 63: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

Chapter 7

Conclusion

During the work, the core concept regarding DMI (Distributed Multimodal Interaction) wasdeveloped by exploring a multimodal interaction in the distributed manner. Different partsof this thesis, like functionalities related to individual modalities (speech and gesture) werequantitatively evaluated for calculating the higher precision and recall values. During theexperiment, the GBI based interaction scored a precision value of 89% and a recall valueof 92.19%.Whereas, the SUI based interaction calculated a precision value around 88.28%and a recall value of 90.13%(shown in table (6.2)). While calculating the precision andrecall values, due to the prototypical study, the user driven errors or mistakes were ignored.The results were found promising with some limitations, mentioned in the previous chapter.Those limitations were handled through the proposed techniques(confirmation dialogs-SUI)and timer concept-GBI) and other limitations were tried to be overcome by restricting theGBI and SUI languages. The main motivation behind this prototype was to improve thelimitations regarding the traditional WIMP interaction paradigm, with the proposed DMI.

53

Page 64: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

54 Chapter 7. Conclusion

Page 65: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

Chapter 8

Acknowledgements

I would like express my thanks to Dipak Surie for his special guidance and the feedbackin the design and the overall help during the experiment.Moreover, I would also give mythanks to Florian Jakeal for designing the smart home environment.My special thanks toThomas Pederson for his valuable advices.

This work would not have been possible without the support and presence of my wifeSnigdha. Her support was always there for me in every situation, always motivating andencouraging me.

Last but not the least it was the support and motivation of all my friends and familymembers that helped me to reach this goal.

55

Page 66: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

56 Chapter 8. Acknowledgements

Page 67: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

References

[1] M. Weiser. The world is not a desktop. Interactions, pages 7–8, 1994.

[2] M. Weiser. Some computer science issues in ubiquitous computing. SIGMOBILE Mob.Comput. Commun. Rev., 3:12, 1999.

[3] M. Weiser and John S. Brown. Designing calm technology. Powergrid Journal, Decem-ber 21, 1995.

[4] M. Satyanarayanan. Pervasive computing: vision and challenges. IEEE Personal Com-munications, 8(4):10–17, 2001.

[5] M. Weiser. The computer for the twenty-first century. Scientific American, pages94–10, 1991.

[6] C. Benoit, JC. Martin, C. Pelachaud, L Schomaker, L. Schomaker, and Suhm. B.Audio-visual and multimodal speech systems, 2000.

[7] Andreas Ratzka and Christian Wolff. A pattern-based methodology for multimodal in-teraction design. Technical report, Institute for Media,Information and Cultural Stud-ies,University of Regensburg,Germany, 2006.

[8] S. Oviatt, Ph. Cohen, L. Wu, J. Vergo, L. Duncan, B. Suhm, J. Bers, T. Holzman,T. Winograd, J. Landay, J. Larson, and D. Ferro. Designing the user interface formultimodal speech and gesture applications: State-of-the-art systems and research di-rections for 2000 and beyond. Human Computer Interaction, 15(4):263–322, 2000.

[9] Horst Rossler, Jurgen Sienel, Wieslawa Wajda, Jens Hoffmann, and MichaelaKostrzewa. Multimodal interaction for mobile environments. International Workshopon Information Presentation, 2001.

[10] Saija Lemmela, Akos Vetek, Kaj Makela, and Dari Trendafilov. Designing and evaluat-ing multimodal interaction for mobile contexts. In Proceedings of the 10th internationalconference on Multimodal interfaces, pages 265–272. ACM, 2008.

[11] Saija L. Selecting optimal modalities for multimodal interaction in mobile and pervasiveenvironments. In Proceedings of the Pervasive Workshop, 2008.

[12] M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker, S. Whittaker,and P. Maloor. Match: An architecture for multimodal dialogue systems. In Proceedingsof the Annual Meeting of the Association for computational Linguistics, pages 376–383,July 2002.

57

Page 68: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

58 REFERENCES

[13] W3c speech recognition grammar specification, March 2004. http://www.w3.org/TR/speech-grammar, accessed 2009-12-01.

[14] http://www.lumenvox.com/products/speech_engine, accessed 2009-01-12.

[15] http://www.nuance.com/naturallyspeaking/products/default.asp, accessed2009-01-12.

[16] Embedded viavoice sdk functionalities. http://www-01.ibm.com/software/pervasive/embedded_viavoice, accessed 2009-14-01.

[17] SpeechStudio Suite Speech Development Environment. Embedded viavoice sdk func-tionalities. http://www.speechstudio.com/suite.htm, accessed 2009-14-01.

[18] Dataglove model 2 operating manual, 1989.

[19] S.S Fels and G.E. Hinton. a neural network interface between a data-glove and a speechsynthesizer. IEEE Transaction on Neural Networks, pages 2–8, Jan 1993.

[20] http://www.geocities.com/mellott124/powerglove.htm, accessed 2008-10-12.

[21] http://www.armchairarcade.com/aamain/content.php?article.39, accessed 2009-20-01.

[22] Rekimoto J. Gesturewrist and gesturepad : Unobtrusive wearable interaction devices.In Proceedings IEEE International Symposium on Wearable Computers, pages 21–27,2001.

[23] Rekimoto J., Khotake N., and Anzai Y. Info stick an interaction device for inter appli-ance computing. In Proceedings International Symposium on Handheld and UbiquitousComputing, 1999.

[24] T. Kanade and J.M Rehg. Digiteyes: Vision-based hand tracking for hci,motion ofnon-rigid and articulated objects. In Proceedings of the 1994 IEEE Workshop, pages16–22, 11-12 November 1994.

[25] T. Darrell and A. Pentland. Space-time gestures. In Proceedings of the Computer visionand Pattern Recognition Conference, 1993.

[26] J Segen. and S Kumar. Gesture vr: vision-based 3d hand interace for spatial interaction.In Proceedings of the 6TH ACM international conference on Multimedia, pages 455–465,1998.

[27] M. Fukumoto, K. Mase, and Suenaga Y. ’Finger-Pointer’: A glove free interface. InConference on Human Factors in Computing Systems, pages 62–62. ACM, 1992.

[28] N. Jafarinaimi, J. Forlizzi, A. Hurst, and J. Zimmerman. Breakaway: an ambientdisplay designed to change human behavior. In Conference on Human Factors inComputing Systems, pages 1945–1948. ACM, 2005.

[29] H. Ishii, C. Wisneski, S. Brave, A. Dahley, M.Gorbet, B. Ullmer, and P.Yarin. Am-bientroom: Integrating ambient media with architectural space. In Published in theConference Summary of CHI, pages 18–23. ACM, 1998.

Page 69: Distributed Multimodal Interaction in a Smart Home ...interact with multiple computing devices while performing the everyday activities, based on the preferred context [2]. Mark Weiser

REFERENCES 59

[30] Jeremy M. Heiner, Scott E. Hudson, and Kenichiro Tanaka. The information percolator:ambient information display in a decorative object. In Proceedings of the 12th annualACM symposium on User interface software and technology, pages 141–148, 1999.

[31] Elizabeth D. Mynatt, Jim Rowan, Annie Jacobs, and Sarah Craighill. Digital familyportraits: Supporting peace of mind for extended family members. In Proceedings ofthe SIGCHI conference on Human factors in computing systems, pages 330–340, 2001.

[32] Louise Barkhuus and Anind Dey. Is context-aware computing taking control away fromthe user? In Proceedings of UbiComp, pages 150–156, 2003.

[33] Jason Small, Asim Smailagic, and Daniel P. Siewiorek. Determining user location forcontext aware computing. Technical report, Institute for Complex Engineered Systems,December 2000.

[34] S. Moeller, J. Krebber, A. Raake, P. Smeele, M. Rajman, M. Melichar, V. Pallotta,G. Tsakou, B. Kladis, A. Vovos, J. Hoonhout, D. Schuchardt, N. Fakotakis, T. Ganchev,and I. Potamitis. Inspire:evaluation of a smart-home system for information manage-ment and device. In proceedings of the LREC 2004 international conference, pages1603–1606, May 2004.

[35] Sukkyoung Kwon Ahyoung Choi and Woontack Woo. Context based personalizedhealth monitoring system in a smart home environment. In The 4th internationalsymposium on ubiquitous VR, pages 99–100, 2006.

[36] Yoosoo Oh Seiie Jang, Choonsung Shin and Woontack Woo. Introduction of ’UbiHome’testbed. In ubiCNS 2005, pages 215–218, 2005.

[37] Manish Anand, Jalal Al-Muhtadi, M. Dennis Mickunas, and Roy H. Campbell. Smarthome : A peek in the future. Technical report, Department of Computer Sci-ence,University of Illinois, December 1999.

[38] Jini architecture. http://java.sun.com/developer/products/jini, accessed 2009-20-01.

[39] Want R., Pering T., Danneel G., Kumar M., Sundar M., and Light J. The personalserver: Changing the way we think about ubiquitous computing. In UbiComp 2002:Ubiquitous Computing, volume 2498/2002, pages 223–230, 2002.

[40] Leah M. Reeves, Jennifer Lai, James A. Larson, Sharon Oviatt, T. S. Balaji, StephanieBuisine, Penny Collings, Phil Cohen, Ben Kraal, Jean-Claude Martin, Michael McTear,TV Raman, Kay M. Stanney, Hui Su, and Qian Ying Wang. Guidelines for multimodaluser interface design. Communications of the ACM, pages 57–59, 2004.

[41] Dipak Surie, Dilip Roy, and Thomas Pederson. Towards distributed multimodal ambi-ent interaction with smart everyday objects. Submitted for the 12th IFIF Conferenceon Human-Computer Interaction, Interact 2009(unpublished), August 24-28,2009.

[42] Albrecht Schmidt. Implicit human computer interaction through context. Personal andUbiquitous Computing 2000, 4:191–199, June 2000.

[43] Florian Jackel. Managing interaction between users and applications using a situativespace model. Master’s thesis, Umea University, 2009.

[44] http://www.phidget.com, accessed 2009-04-12.