geeair: a universal multimodal remote control …gpan/download.php?ff=2010-puc-geeair.pdfgeeair: a...
TRANSCRIPT
-
ORIGINAL ARTICLE
GeeAir: a universal multimodal remote control device for homeappliances
Gang Pan Jiahui Wu Daqing Zhang
Zhaohui Wu Yingchun Yang Shijian Li
Received: 1 June 2009 / Accepted: 22 October 2009 / Published online: 10 March 2010
Springer-Verlag London Limited 2010
Abstract In this paper, we present a handheld device
called GeeAir for remotely controlling home appliances via
a mixed modality of speech, gesture, joystick, button, and
light. This solution is superior to the existing universal
remote controllers in that it can be used by the users with
physical and vision impairments in a natural manner. By
combining diverse interaction techniques in a single
device, the GeeAir enables different user groups to control
home appliances effectively, satisfying even the unmet
needs of physically and vision-impaired users while
maintaining high usability and reliability. The experiments
demonstrate that the GeeAir prototype achieves prominent
performance through standardizing a small set of verbal
and gesture commands and introducing the feedback
mechanisms.
Keywords Universal remote controller Gesture recognition Speech recognition Smart home
1 Introduction
Nowadays, it is almost impossible for home inhabitants to
go for a day without interacting with the home appliances.
Although remote control of home appliances such as
TV, DVD, windows, lights, etc. serves well for ordinary
people with acceptable physical or emotional comfort, they
can provide more for the dignity, security, and well-being
of elderly or disabled people [1]. One can imagine a situ-
ation where a person has lost some of his/her physical
dexterity or mobility. In the absence of suitable controls,
he/she would need a caregiver to assist with the operation
of home appliances, with the attendant expense and loss of
independence and privacy. But with adequate assistance,
this person might be able to live independently at his/her
home.
The current home appliances are often equipped with
remote controllers operating via infrared (IR) light signals.
Each household is likely to own several remote controllers,
which are often incompatible with each other and have
different layouts. In order to reduce the number of remote
controls, universal remote controllers (URCs) were intro-
duced to merge the functions of individual controllers into
one device [25]. A URC learns IR command sets from
each appliance and operates the appliance selected by a
user. There are two fundamental steps involved in the
control procedure of a URC: target object selection and
command issuing. To select a target object for operation, a
user might press a button, turn a rotary wheel, or touch an
icon depending on how the panel of the URC is designed.
To issue a command, a user needs to point the controller to
G. Pan (&) J. Wu Z. Wu Y. Yang S. Li (&)Department of Computer Science, Zhejiang University,
Zhejiang, China
e-mail: [email protected]
J. Wu
e-mail: [email protected]
Z. Wu
e-mail: [email protected]
Y. Yang
e-mail: [email protected]
S. Li
e-mail: [email protected]
D. Zhang
Handicom Lab, Institut TELECOM SudParis, Evry, France
e-mail: [email protected]
123
Pers Ubiquit Comput (2010) 14:723735
DOI 10.1007/s00779-010-0287-7
-
the target appliance and press a specific button on the
controller. Subsequently, the controller emits the infrared
signal to the selected appliance for the specified operation.
Although URCs combine the functions of remote con-
trollers into one device, elderly and disabled home users
may still have difficulties in using a URC due to a number
of reasons: First, a URC has too many buttons that need to
be remembered, and several button presses may be needed
to achieve a simple function. Second, the buttons on a URC
may be too small for the elderly, physically disabled and
vision-impaired people to use. Finally, button operation is
just one modality to interact with the home appliances,
which may not be the most natural and efficient means for
human machine interaction.
Speech and gesture are two natural ways that people
interact with each other. Much research has been done to
use speech, gesture, or eye-gaze to control home appli-
ances. However, there is limited success reported in the
literature on the deployment of these modalities due to the
constraint of each single modality. Controlling through a
spoken language or oral command is indeed straightfor-
ward for expressing intentions, but the single modality of
speech has the following limitations in real implementa-
tion: First the accurate extraction and recognition of control
commands from daily continuous speech is still difficult
due to the ambiguities of natural languages, especially in
noisy environments. Second, speech is not instant, e.g.
some commands need complex phases or sentences, which
may need a long time to process and react.
Using the single modality of gesture to control home
appliances has also been explored. Since the computer
vision-based gesture and eye-gaze control is highly
dependent on the lighting condition and camera facing
angle, it turns out to be rather difficult to accurately rec-
ognize gestures under poor lighting condition using a
camera-based system. In addition, it is also uncomfortable
and inconvenient if the user is required to face the camera
directly to complete a gesture. Different from the vision-
based gesture recognition approach, the accelerometer-
based gesture interaction is an emerging technique that
exploits the acceleration data of hand motion for recogni-
tion and control. No camera is required but a wearable or
portable accelerometer-equipped device in daily life, such
as a watch, a smart phone or a MP3 player. These wireless-
enabled portable/wearable devices provide new possibili-
ties for interacting with a wide range of home appliances
such as doors, window curtains, TVs, etc.
In this paper, we present a universal multimodal remote
control device which unifies several interaction modalities
such as speech, gesture, button, joystick, and light, so that
home inhabitants ranging from common users to elderly,
physically disabled, and vision-impaired people are all able
to interact with the home appliances in the way they feel
comfortable. Specifically, we develop a universal multi-
modal remote controller, called GeeAir, which not only
provides comfort and convenience for common users in
controlling home appliances, but also meets the special
needs of physically and vision-impaired people in operat-
ing the home appliances to live independently and enjoy a
better quality of life.
The paper is organized as follows. First, the related work
on universal remote controllers and multimodal control
systems is summarized in Sect. 2. Then an overview of the
GeeAir system architecture is presented in Sect. 3. In Sect.
4, the key techniques to select the desired target appliance
for operation are described, followed by the introduction of
feedback mechanisms ensuring the reliable confirmation.
Section 5 proposes a standard set of hand gestures for
operating different home appliances and a novel algorithm
for the accelerometer-based gesture recognition. Section 6
reports the implementation details and the experimental
results of the speech/gesture recognition algorithms com-
pared to other existing algorithms. An initial evaluation of
the GeeAir prototype with 10 users is also given in this
section. Finally, we provide our conclusions for the design
and test of GeeAir and highlight some future research
directions in Sect. 7.
2 Related work
In the consumer electronics market, several universal
remote control products can be found in the home elec-
tronics stores. These products can be roughly categorized
into two groups, according to how the target appliance is
selected: button-based URCs and screen-based URCs. The
former group allocates a few buttons in the control panel of
the URC for appliance selection, where one button corre-
sponds to one appliance. For example, Phillips 4-in-1
URC has four buttons reserved in the panel to control TV/
VCR/DVD/SAT, respectively. Users select one of the four
appliances by pressing the corresponding button [2]. Since
the number of buttons in a URC control panel is fixed, the
extensibility of the button-based URCs is limited. The
screen-based URCs overcome this limitation by putting a
built-in mini-screen and a navigation button in the control
panel of URCs. When users press the navigation button, the
mini-screen shows the selected home appliance one after
another. When the target appliance appears in the screen,
the user completes the device selection by releasing the
button [35]. Apparently, both kinds of URCs only support
button-pressing as the single input modality, thus people
with limited motor skills, finger dexterity, or weak vision
might not be able to use these remote controls.
In parallel to the efforts of developing universal remote
controllers by consumer electronics manufacturers, there
724 Pers Ubiquit Comput (2010) 14:723735
123
-
has been a lot of research on universal GUI to enable
mobile devices for home appliance control. Different
approaches have been proposed to generate the universal
graphical user interface in various mobile platforms [6, 7].
All those solutions assume that users can navigate the GUI
on the tiny screen of a mobile device with a pen or button.
Thus, they support only one single input modality and
consequently cannot meet the needs of elders and those
with certain physical or vision impairment.
Compared to the single modality solutions, multimodal
control systems combine the strengths of multiple modal-
ities, and thus increase the applicability and usability of
humanmachine interaction. To meet the different
requirements of varied users and applications, various
combinations of input and output modalities have been
explored in previous projects. For example, the seminal
work by Bolt [8] created a Put-That-There system where
people can use pointing gesture to select an object from a
virtual diagram of a room which is shown in a large-screen
display and subsequently use speech to operate on the
selected object. The EU HOME-AOM project [9, 10]
applied the mixed modality of speech, gesture, and GUI for
the home appliance control for disabled people, in which
speech and gesture were used to assist in the navigation of
GUI commands. GWindows [11] operated the Microsoft
window applications by using speech to move/close/mini-
mize/maximize/scroll and using motion gestures to deter-
mine the movement distance. Krum et al. [12] implement a
system that helps user navigate in a whole earth 3D visu-
alization environment at a distance from the display. It
employs Gesture Pendant [13] for tracking of simple hand
motions and utilizing speech for navigation commands.
Different from those projects, our work intends to provide a
single, multimodal control device for a wider range of
home users, including the elders and those with physical or
vision impairment besides ordinary users. Our solution
supports a mixed modality of speech, gesture, button,
joystick and light as input and output, adapting to different
needs and interaction preferences of various user groups. In
addition, we use an accelerometer-based gesture recogni-
tion approach instead of the camera-based one used pre-
viously, which allows users to move freely in a ubiquitous
home environment and control the home appliance in any
lighting condition.
The closest research to our work is by Kela et al. [14]
who used several modalities to interact with a design studio
environment. The modalities explored include speech input
and output, gesture input, RFID-tag, a laser-tracked pen
and a mobile device with touch screen. Our work differs
from theirs in the following aspects:
(1) While Kela et als work uses diverse modalities in a
studio environment, they deploy multiple devices to
control multiple applications, and we focus on
building a handy, single multimodal device for
controlling multiple home appliances.
(2) Kela et als work takes the design studio as the
application environment, the designers as the user
group, and convenience and comfort as the design
goal. Instead, our research aims at a different, actually
larger, user group. We not only provide ordinary
home inhabitants with convenience and comfort, but
also elders and those with physical and vision
impairment. For example, we provide joystick as
one input modality which is very useful for people
with hand disability.
(3) In order to ensure the reliability and robustness of the
multimodal remote controllers for elders and disabled
people, we introduce voice and light as feedback. So
that the desired control object can be reliably
identified even if speech recognition is not 100%
accurate. In our GeeAir solution, users are allowed to
use speech or joystick to select a target appliance for
operation and use voice and light to get feedback.
Such solution can satisfy the needs of user groups
with disabilities in speaking, hearing, vision, and
hand.
(4) Although we also use accelerometer-based approach
for gesture control as Kela et al. did, we developed
a novel and very different algorithm [15] which is
more accurate than the algorithm used in Ref. [14].
While they adopted a HMM (hidden Markov
model)-based approach for gesture recognition and
process the acceleration data in the time domain
without conducting feature extraction, we processed
the data in frequency domain with feature extraction
to reduce the noise and variation of a gesture data,
thus significantly improving the recognition
performance.
3 GeeAir: an overview
The design goal of GeeAir is to become a single universal
remote controller which serves not only common users but
also those physically disabled and vision-impaired people.
In the home environment as illustrated in Fig. 1, GeeAir
takes the inputs from the users to select a target appliance
first and then recognizes the predefined hand gesture of
users to control the selected target appliance. As described
before, the mixed modalities of speech, joystick, light, and
button are used for selecting a desired target appliance. In
order to avoid any potential error during the selection, two
feedback mechanisms are introduced in GeeAir design:
lighting feedback and voice echo.
Pers Ubiquit Comput (2010) 14:723735 725
123
-
The look and feel of GeeAir prototype is shown in
Fig. 2, which borrows the design from Nintendo Nunchuk.
The key components of GeeAir and their functionalities are
described as follows:
(1) A three-axis built-in accelerometer: to capture users
3-D hand gesture signals.
(2) An eight-orientation joystick: to select a target
appliance efficiently.
(3) A built-in microphone: to acquire users speech
commands.
(4) A speaker: to provide users with voice feedback and
reminders.
(5) Button A and B: used to label the beginning and end
of speech and gesture commands. These two buttons
are designed in different sizes and shapes, in order to
help user differentiate them by tactility.
(6) A built-in digital signal processing unit: to handle the
computation involved in the processing of multimo-
dality inputs and outputs.
(7) A built-in communication unit: to send and receive
wireless signals.
The workflow of using GeeAir consists of three main
stages: appliance selection, feedback and confirmation, and
operation command issuing, as shown in Fig. 3. At any
moment, GeeAir has a current appliance for operation. The
current appliance is indicated by the light signal or voice
reminder. If a user intends to control another appliance
rather than the current appliance, he/she needs to select the
desired one via joystick or speaking the target appliance
name. If speech is used, GeeAir will obtain the name of the
target appliance with speech recognition. The feedback for
appliance selection has two options: light signal (a con-
trollable light attached to each appliance) and voice echo,
which help users correct occasional errors of speech rec-
ognition of the target appliance name. If the current
appliance is exactly the one that the user wants to operate,
the user can wave the GeeAir in air for the follow-up
operations. Then the gesture will be recognized by GeeAir
Fig. 1 Illustration of theGeeAir for remote control of
home appliances
Fig. 2 Conceptual illustrationof the GeeAirs components for
multimodal control. a a three-axis accelerometer, joystick,
microphone, speaker, and two
buttons are built in GeeAir;
b two buttons (Button A andButton B) in the front view of
GeeAir
726 Pers Ubiquit Comput (2010) 14:723735
123
-
and the corresponding command will be issued to the
current appliance wirelessly.
4 Multimodal selection of a target appliance
4.1 Selecting via speech commands
Speech is one of the most natural ways for interaction
between human and machines. However, for home
appliance control, it is still a great challenge to robustly
extract and recognize the control commands in real life
environment using user-independent large vocabulary
continuous speech recognition technology. In contrast,
small vocabulary recognition of isolated words is quite
reliable and accurate, verified by many successful prac-
tical applications.
GeeAir provides the option of selecting a target appli-
ance via speech commands. GeeAir will record users
utterance through the equipped microphone and then rec-
ognize the appliance name. In this case, the vocabulary to
be recognized is small because the number of home
appliances is limited and their names are relatively fixed. In
order to avoid the segmentation of the appliance name from
the natural utterance, users are asked to press Button A on
GeeAir to start speaking the appliance name for object
selection, and release the button after speaking the appli-
ance name.
For isolated word recognition, the commonly used
techniques include VQ (Vector Quantization), DTW
(Dynamic Time Warping), and HMM (Hidden Markov
Model) [16, 17]. For GeeAir, we build an isolated word
recognition system based on continuous density hidden
Markov model (CDHMM) [18]. The whole recognition
process consists of the following steps:
(1) Defining the lexicon: recording the words to be
recognized by the system. Each word is repeatedly
recorded several times by each participant.
(2) Feature extraction: the MFCC (Mel Frequency Cep-
strum Coefficient) feature vectors [19] are computed,
together with their first derivatives.
(3) Modeling words: for each word in the lexicon, a left-
to-right CDHMM is built with a number of states.
Each state is characterized by a Gaussian mixture
model (GMM).
(4) Training the models: the parameters of the distribu-
tions in GMM and the state transition probabilities
within CDHMMs are estimated using the Baum-
Belch algorithm [17].
(5) Recognition of a word: first, we compute observations
of the word (feature vector), and then the probability
of its observations is generated from each of the
words CDHMM models using the Viterbi algorithm.
The word is recognized to be the one whose model
has the highest probability.
4.2 Selecting via joystick
The second modality GeeAir provides to select a target
appliance is through the built-in joystick. Joystick is a
traditional input device in machine control of trucks, CT
scanner, as well as video games. It outperforms buttons in
navigation due to its continuity, fast reaction and nearly no
relative movement between hand and itself during the
controlling process. Thus, joystick is a good choice for
selecting objects which are circled around in the spatial
space.
The operation principle of the joystick is illustrated in
Fig. 4. The accessible area is octagonal. There are two
states defined for joystick operation: inactive and active.
Inactive state indicates that joystick is not pushed and stay
in the middle of the octagon; active state indicates that
joystick is pushed to the edge of the octagonal at any angle.
The eight valid joystick positions are: north, northeast, east,
southeast, south, southwest, west, and northwest. Each
position occupies 45 degrees.
Wrong
Operate the current appliance with
Yes
B
Signallight
Feedback
Select a target appliance
Rotate joystick
gesture and command issuing
Continue operatingthe current appliance?
Begin
orVoiceecho
or Speakits name
Right
No
Fig. 3 Workflow of GeeAir
Pers Ubiquit Comput (2010) 14:723735 727
123
-
A user can move the joystick along the octagon to select
appliances in the physical spatial space. Intuitively, an
octagonal joystick can be matched to eight appliances
statically. However, to select the target appliance from the
different number of appliances in each household, GeeAir
exploits the rule of dynamic and relative association
between the positions and the appliances. A valid position
is not necessarily associated with a fixed device. In this
sense, when a user intends to select an appliance, the initial
position which he/she pushes the joystick to first is
dynamically associated with the current selected appliance.
While the user rotates the joystick to a neighboring posi-
tion, the current appliance will also shift to its neighboring
appliance. Whether the left nearest one or the right nearest
one is selected depends on the users rotating direction, i.e.
counter-clockwise and clockwise. The dynamic association
ensures the flexibility when the number of appliances
varies. Thus, any number of appliances can be easily
navigated by using the joystick.
4.3 Feedback mechanism
GeeAir has two kinds of feedback mechanisms available for
confirmation purpose: voice echo and signal light. GeeAir
has a built-in mini-speaker, which can replay the name of the
appliance when the appliance is selected by either speech or
joystick. Voice echo informs the user whether the object
recognized by the system is the desired one that users intend
to select. If a controllable LED light is attached to each
appliance, the lights can be used as a feedback, i.e. the red
LED light of the selected appliance is turned on for user
confirmation while the other lights are keeping off.
For joystick-based appliance selection, the light feed-
back will immediately occur as soon as the joystick
changes a position, that is, when the joystick moves from
one position to another, the light signal will also shift from
one appliance to the next. The instant lighting during
joystick rotation will be much helpful for user due to the
quick response of joystick operations. However, the voice
echo cannot occur for every covered position if joystick
rotates too fast because there is no enough time for voice
echo. For this reason, GeeAir sets a movement speed limit,
one position/second, for voice echo. If the joystick stays in
a position for less than 1 second, the voice echo of the
appliance associated dynamically with this position will be
suppressed. Any voice echo can be interrupted by rotating
joystick to the next position when users know that the
current one is not the desired one, which helps users to
speed up the selection process.
With the feedback mechanisms, if the user finds the
recognized object is not the desired one, he/she can correct
it immediately by repeating the appliance selection. Thus,
the command issuing for a wrong appliance could be
avoided. Any of the two feedback mechanisms can be
combined with one of the two selection schemes introduced
previously, i.e., there are four combinations available:
speech-voice, speech-light, joystick-voice, joystick-light.
Both feedback modalities of voice and light are suitable
for motor-impaired people, they also free users from
reading on-screen prompts. The voice-based feedback is
suitable for any people with normal hearing. Although the
signal light requires users vision, it is less demanding to
recognize the binary states, ON, and OFF, of a light, than
the semantic information in text or picture on screen.
5 Operating an appliance via gesture
After the target appliance is selected, GeeAir uses gesture
commands to operate it. Gestures performed by GeeAir are
recognized based on acceleration data acquired by the built-
in three-axis accelerometer [15]. Compared to the camera-
based gesture recognition techniques [20], the accelerome-
ter-based gesture recognition does not rely on lighting con-
ditions and camera facing angle, and also does not require
any deployment of devices in the environment. Similar to
issuing speech commands, users begin a gesture by pushing
the Button B, and end it by releasing the button, avoiding the
accuracy degradation caused by gesture segmentation.
5.1 Gesture command definition
In order to enable effective gesture-based interaction,
several requirements must be met when designing a set of
gesture commands for home appliances: (1) the semantic
connection between gestures and commands should be
natural, so that the meaning of a gesture is easy to learn and
remember for users; (2) gestures should be simple and
terse, avoiding those require high precision over a long
period of time. Moreover, they should be quick to perform
and repeat, without causing fatigue over time; (3) the
gesture commands for different appliances should be con-
sistent, i.e., similar operations of different appliances
Fig. 4 Octagonal accessible area of joystick. Each position covers 45degrees. The joystick can be rotated either clockwise or counter-
clockwise to change the position
728 Pers Ubiquit Comput (2010) 14:723735
123
-
should be defined as the same gesture to reduce the size of
gesture vocabulary which the users have to learn.
Usually there are two different ways employed in gesture
command definition: user-dependent and user-independent.
Previous work focuses more on user-dependent gesture
recognition [2123], where each user is required to perform a
couple of gestures as training/template samples before using
the system. In this case, users are requested to personalize a
remote controller by mapping each operation to a certain
gesture they find suitable and comfortable. However, the
training process is still a burden for users, although some
work [23, 24] has been done on optimizing recognition
algorithms to reduce the size of training sample set. GeeAir
aims at user-independent gesture recognition and control.
Different users will share a common set of gesture com-
mands and do not need to train GeeAir from person to person.
In this paper, we define a nine-gesture vocabulary to
control the frequently used functions of seven categories of
home appliances, as listed in Table 1. The gesture of
ForwardBackward is performed in the XY plane, and the
other eight gestures are waved in the YZ plane.
(1) The gesture of ForwardBackward is performed as if
pushing an ON/OFF switch button on a control panel
of electronic appliances.
(2) The swinging gestures of Up and Down are very
natural to express the meaning of up and down, e.g.
volume up/down, temperature up/down.
(3) Similarly, the two gestures of Left and Right are also
natural to represent the meaning of previous and next.
(4) The gestures of Double-Left and Double-Right denot-
ing a fast move toward left or right suggest users of
fast backward/fast forward.
(5) The gesture of alphabet V implying a tick or rising
up suggests a Play operation. Additionally, we
follow the tradition that most of the current players
use the same button to share operations of Play and
Pause.
(6) The gesture of Inverted V implies a decreasing trend,
which we define as a Stop operation.
Specifically, however, Up/Down and Double-Left/Dou-
ble-Right are continuous commands rather than instant
ones, for example, modulating the volume or adjusting
curtains is a continuous operation. In order to avoid fre-
quently performing the same gesture, when such com-
mands are recognized, GeeAir will continuously issue the
commands with a certain interval until users push Button B
or it reaches to its maximum.
5.2 Gesture recognition with FDSVM
GeeAir employs the algorithm FDSVM [15], proposed by
the authors, to recognize gesture commands from acceler-
ation data. FDSVM uses a frame-based descriptor to
compactly represent a gesture, which reduces noise and
variation of a gesture data, and thus improves the gesture
recognition performance significantly.
The FDSVM system has two main phasestraining and
recognizingand four componentsacceleration data
Table 1 Definition of gesture commands for appliances
Appliance Gesture commands
Forwardbackward Up; down Left; right Double-left; double-right V; inverted-V
Television ON/OFF Vol. up
Vol. down
Prev. channel
Next channel
DVD ON/OFF Vol. up
Vol. down
Prev. track
Next track
F Forward
F Backward
Play/pause
Stop
Radio ON/OFF Vol. up
Vol. down
Prev. channel
Next channel
Speaker ON/OFF Vol. up
Vol. down
Air conditioner ON/OFF Temp. up
Temp. down
Lamp ON/OFF Brtn. up
Brtn. down
Curtain Open/Close Curt. up
Curt. down
Vol, Volume; F Forward, Fast Forward; F Backward, Fast Backward; Temp, Temperature; Brtn, Brightness; Curt, Curtain
Pers Ubiquit Comput (2010) 14:723735 729
123
-
acquisition, feature extraction, training SVM, and recog-
nition by SVM, as shown in Fig. 5. The former two com-
ponents are shared by the training and recognizing phases.
5.2.1 Feature extraction: frame-based gesture descriptor
The three-axis accelerometer built in GeeAir can discretely
sense the gestural acceleration data of three spatial
orthogonal axes. We denote a gesture command as:
G ax; ay; az
where ax, ay, az are the acceleration sequences from three
axes. We divide a gesture into N ? 1 segments with iden-
tical length, and every two adjunct segments make up a
frame with a segment-length overlap, as illustrated in Fig. 6.
We employ five features in both frequency and spatial
domain to characterize each frame:
In frequency domain (discrete Fourier transform (DFT)
on each frame per axis):
(1) mean l: the DC component over the frame(2) energy e: the sum of the squared DFT component
magnitudes except the DC component, and subse-
quently divided by the number of the components for
the purpose of normalization.
(3) entropy d: the normalized information entropy of theDFT component magnitudes with the DC component
excluded.
In spatial domain:
(4) standard deviation r: indicates the amplitude vari-ability of a gesture
(5) correlation c among the axes: implies the strength ofa linear relationship between each pair of axis.
We combine all features extracted as described above to
form a feature vector s, which represents the gesture
command itself. Considering 5 features per frame per axis,
3 axes, and N frames per gesture, the dimension of the
feature vector should be d = 53N = 15N.
5.2.2 Gesture classification: multiclass SVM
Suppose there are two types of gestures GTR1, GTR2
needed to be classified. We denote the training set with n
samples as
fsi; gig; i 1; . . .; n
where si 2 Rd represents a feature vector of a gesturecommand and
gi 1; if si belongs to GTR11; if si belongs to GTR2
A separating plane written as
w s b 0
which can be obtained by solving a dual convex quadratic
programming problem [25].
The extension to multiple gestures classification is
achieved by a multiclass SVM using one-versus-one
strategy or one-versus-all strategy. SVM is a method to
deal with the highly non-linear classification and regression
problems. Benefiting from structural risk minimization
principle and avoidance of over-fitting by its soft margin,
SVM usually outperforms the traditional parameter esti-
mation methods which are based on Law of Large Num-
bers when there are merely limited training data available.
6 Evaluations
6.1 Implementation
We build a prototype of GeeAir, including hardware and
algorithms implementation, to verify the design and
AccelerationData
Acquisition
Feature Extraction
TrainingSVM
Recognitionby SVM
FrameSegmentation
FeatureCalculation
Fig. 5 Block diagram of theFDSVM gesture recognition
system
Segment 0 Segment 1 Segment 2 Segment N
Frame 1Frame 0
Frame N-1
Gesture
. . . . . .
.
.
.
Fig. 6 Illustration of segmentsand frames for a gesture
730 Pers Ubiquit Comput (2010) 14:723735
123
-
performance. Currently, the GeeAir can acquire speech and
gesture commands with two buttons, and perform joystick-
based selection. The software, including algorithms for
speech recognition and gesture recognition, is still imple-
mented on a PC instead of GeeAir. We use Bluetooth to
connect the GeeAir and the PC.
6.1.1 Hardware setup
The GeeAir prototype is built based on Nintendo Wiimote
for acceleration sensing and its expansion Nunchuk for
joystick selection. It has a 3-D accelerometer, a joystick,
and two buttons: Button A and Button B (inspired by
Button C and Button Z of Nunchuk). The built-in micro-
phone and speaker of GeeAir are simply replaced with
Bluetooth wireless headphone connected to a laptop com-
puter. Wiimote is also employed to help build communi-
cation between the laptop computer and GeeAir.
GeeAir utilizes Bluetooth as the non-directional wireless
communication. However, most of current appliances
adopt infrared remote controllers and therefore are unable
to receive Bluetooth signal. We developed a Bluetooth
infrared Adaptor (BI Adaptor) to convert the Bluetooth
signals to infrared signals, which will be unnecessary when
the appliances are able to communicate via Bluetooth. Also
the signal light for feedback mechanism is embedded on
the BI Adaptor, shown in Fig. 7.
6.1.2 Algorithms implementation
For the isolated word recognition in GeeAir, the lexicon has
12 words for seven categories of home appliances, shown in
Table 2. The utterances are recorded with 16 kHz sampling
frequency and 16-bit resolution. The feature vector of 26
dimensional MFCC (13 dimensional cepstrum coefficients
and their first derivatives) is employed, which is computed
with a window size of 32 ms and a step size of 16 ms. Each
word is represented by a trained left-to-right CDHMM
model with 3 states, which is implemented on the base of
HTK (Hidden Markov Toolkit) [26]. The eight-dimension
mixture Gaussian distribution is used for modeling states.
We use 6 Baum-Welch re-estimation iterations.
Gesture recognition with FDSVM for GeeAir uses an
open source software package of FFTW [27] for discrete
Fourier transformation. Then five features mean, energy,
entropy, correlation, and standard deviation of individual
axis in one frame are calculated. The feature vector is
eventually put into a classifier in order to train an SVM
model or retrieve a recognized gesture type. The SVM
component utilizes the package SVMmulticlass [28]. The
details may refer to the reference [15].
6.2 Data acquisition
To evaluate the GeeAirs performance of oral command
recognition and gesture recognition, we built a speech
Fig. 7 Components of theBluetoothinfrared adaptor
Table 2 Speech vocabulary of twelve Chinese words for sevenappliances
No. Appliances Chinese words
1 Television Dian sh
Dian sh j
2 DVD player DVD
3 Radio Shou yn
Shou yn j
4 Speaker Yn xiang
Yn xiang
5 Air conditioner Kong tiao
6 Lamp Dian deng
Tai deng
R guang deng
7 Curtain Chuang lian
Pers Ubiquit Comput (2010) 14:723735 731
123
-
database with 7 appliance names and a gesture acceleration
database with 9 gestures. Both databases are acquired by 10
persons, including 5 males and 5 females. The collection
procedure lasts 5 days.
The vocabulary in speech database includes 12 Chinese
words of 7 appliances, listed in Table 2. Some of the
appliances may have more than one name, depending on
users habits. Each user is required to record 4 times per
word per day. Thus, each user has 20 samples for each
Chinese word.
For the gesture acceleration database, each participant
was asked to perform each gesture for 6 repetitions per
day. Thus, there are 6 9 5 9 9 9 10 = 2,700 samples.
The start and end of a gesture are labeled by pressing the
Button B on the Wiimote during data acquisition. Fig-
ure 8 illustrates the acquisition devices. We divided the 9
gestures into 3 groups as listed in Table 3, for the pur-
pose of evaluating usability for different potential appli-
ances. For example, Group 1 is for speaker, air
conditioner, lamp, and curtain; Group 2 is for television
and radio.
We employed the leave-one-day-out cross validation for
the user-dependent case and the leave-one-person-out cross
validation for the user-independent case in speech and
gesture experiments. For the leave-one-day-out cross-val-
idation, we divide all the samples into five partitions,
choosing 1 days samples for a partition (namely 60 sam-
ples per gesture per partition, and 40 samples per word per
partition). At each time, four partitions from five are for
training, the remainder of one partition is for testing. We
then repeat it five times and finally take the average rec-
ognition rate. For the leave-one-person-out cross-valida-
tion, nine participants data (out of ten) is used as the
training set; the data of the remaining participant is used as
the testing set.
6.3 Speech recognition accuracy
Using the 12-word speech data described previously, the
experimental results shows that the user-dependent speech
recognition achieves the accuracy of 98.21%, and user-
independent performance has the recognition rate of
91.79%. Figure 9 illustrates the recognition performance
over time in the user-dependent case.
6.4 Gesture recognition accuracy
6.4.1 Experiment 1: effect of frame number N
The purpose of analyzing a gesture in frames rather than as
a whole is to describe its local characteristics correspond-
ing to time span. The frame count N indicates the precision
we know about a gesture. Intuitively, the more frames a
gesture is broken up into, the more details are known about
the gesture. However, it may lead to the over-fitting
problem if the frame number N is too large. It will also
increase the dimension of the feature space, which
increases computational complexity. This experiment is to
examine the effect of varying N.
Figure 10 shows the experimental results for varying
frame number N using the data set of Group 3. As can be
seen, higher-rating occurs at the center in both curves and
lower-rating at both ends. This result supports our
assumption that the feature will convey little discriminativeFig. 8 Acquisition devices of gesture acceleration data
Table 3 The nine gestures are divided into three groups for thegesture recognition experiments
No. Size Gesture
1 3 Forwardbackward, up, down
2 5 Forwardbackward, up, down, left, right
3 9 Forwardbackward, up, down, left, right, double-left,
double-right, V, inverted-V
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
Day 1 Day 2 Day 3 Day 4 Day 5 Average
Recognition rate
Fig. 9 User-dependent speech recognition result varying over time
732 Pers Ubiquit Comput (2010) 14:723735
123
-
information when N is too small, and the over-fitting
problem will occur when N is too large. The recognition
accuracy is obviously lower than the rest when N is 2. The
two curves are nearly flat when N is between 4 and 7. In the
following experiments, we choose N = 5.
6.4.2 Experiment 2: user-dependent gesture recognition
In this experiment, to demonstrate the performance of our
method, we compare it with four methods: decision tree
C4.5, Nave Bayes, DTW, and the HMM algorithm. We
employed the implementation of C4.5 by Quinlan [29] for
comparison purpose.
We carried out the experiments and comparison tests on
the 3 groups of data set, respectively. The comparison
results are shown in Fig. 11. When recognizing the three
gestures of Group 1, all the five approaches obtain the
recognition rate of more than 90%, where our proposed
FDSVM achieves 99.17% (a little bit lower than DTW,
99.76%). When the number of gesture type increases, the
performance of HMM and DTW decreases significantly. In
contrast, our FDSVM method performs well even in rec-
ognizing all the 9 gestures, with the recognition rate of
96.40%.
6.4.3 Experiment 3: user-independent gesture recognition
User-independent case means that the system is well-
trained before users use it. Such implementation avoids
users efforts to perform several gestures as training data.
The results of user-independent gesture recognition test
and comparison are shown in Fig. 12. Obviously, the rec-
ognition rate of user-independent gesture recognition is
lower than that of user-dependent one. Our FDSVM has
very stable recognition performance when the number of
gesture types increases. It achieves the recognition rate of
94.17% for 3 gestures of Group 1 and 91.07% for 9 ges-
tures of Group 3. DTW achieves recognition rate of
97.38% for Group 1 and 95.78% for Group 2, slightly
outperforming our methods. However, our FDSVM sig-
nificantly outperforms DTW in 9 gestures of Group 3. The
result reveals that our FDSVM has good generalization
capability with respect to the number of gesture types.
6.5 Response time test
We have set up 8 home appliances as control objects in the
laboratory: a curtain, two lights, a TV, an air-conditioner, a
speaker, and a DVD player. We then recruited 10 graduate
students in the laboratory for the experiments, none of
whom used the GeeAir before. A series of tasks were
defined as follows in order to test each user one after
another:
1. Use speech to select a target appliance (one of eight).
After a red light feedback from the system for
confirmation, conduct gestures to control the
appliance.
2. Use the joystick to repeat the same task as Step 1.
3. Cover the eyes of each participant to simulate the
situation for a blind person, using speech to select a
target appliance (one of eight). After a voice feedback
from the system, conduct gestures to control the
appliance.
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
2 3 4 5 6 7 9 11 13 15 17 19
Rec
og
nit
ion
Rat
e
Frame Number
user dependentuser independent
Fig. 10 Experimental result for various frame number N
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3
Rec
og
nit
ion
Rat
e
FDSVM Nave Bayes C4.5 DTW HMM
Group No.
Fig. 11 Experimental results for the user-dependent case
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3
Rec
og
nit
ion
Rat
e
FDSVM Nave Bayes C4.5 DTW HMM
Group No.
Fig. 12 Experimental result for the user-independent case
Pers Ubiquit Comput (2010) 14:723735 733
123
-
4. Use joystick to repeat the same task as Step 3.
Table 4 shows the average response time of different
stages when students use the GeeAir prototype. We can see
that it is faster to select a target using joystick than speech
because selection by speech needs lots of time (i.e. 1.4 s) to
speak an appliance name. The computational cost for rec-
ognition of both speech and gesture is less than 0.5 s. For a
user, response time of feedback by light is nearly negligible
(only 43 ms). For the procedure of gesture command,
including gesture action and gesture recognition, the
average time spent is 0.483 s.
7 Conclusions
We have developed a handheld, universal multimodal
remote control device, called GeeAir, for controlling home
appliances/appliances via a mixed modality of speech,
gesture, joystick, button, and light. Compared to the
existing universal remote controllers, GeeAir can enable
even those with physical, hearing, and vision impairment to
control home appliances in a natural manner. Compared to
the existing multimodal solutions interacting with the smart
environments, GeeAir provides a handy and single device
solution, not only providing comfort and convenience for
common users in controlling home appliances but also
meeting the special needs of physically and vision-
impaired people in operating the home appliances.
Single modality such as speech, gesture, joystick, but-
ton, and light all has its own strength and weakness. By
combining those diverse but complementary modalities
together and integrating them into a single device, different
home user groups can always find a combination of
modalities they feel comfortable to interact with the envi-
ronment. GeeAir represents an interesting attempt toward
bringing the multimodal interaction techniques closer to
the everyday life of home users, particularly those who
need assistance for independent living.
Speech and gesture are two most natural ways that people
interact with each other. Even though the continuous speech
and gesture recognition techniques are still not mature
enough to be deployed in real applications, we achieved very
good performance in our work through standardizing a small
set of easily learned verbal commands and gestures, and
introducing feedback mechanisms.
Multimodal interaction devices are necessary for mobile
and ubiquitous environments. The GeeAir prototype per-
mits us to begin developing the design space for mapping
interactions with multimodal commands. Such a space will
be necessary for optimally supporting different home users
in different context.
The initial test results show clear benefits of the multi-
modal device GeeAir over the universal remote controllers
and other single modality based solutions. In the future, we
plan to conduct a series of formal evaluations of GeeAir
with real home users, including elderly and disabled
inhabitants. Hopefully, the study will shed light on the
cognitive load of various combinations of modalities:
speech-gesture, joystick-gesture, speech-button, and joy-
stick-button, in order to further improve the future design
of GeeAir.
Acknowledgments The authors would like to thank the commentsand suggestions from the anonymous reviewers. The laboratory stu-
dents participation in the experiments is greatly appreciated. This
work is supported in part by the National High-Tech Research and
Development (863) Program of China (No. 2008AA01Z132,
2009AA011900), the Natural Science Fund of China (No. 60525202,
60533040), and the France ICT-Asia I-CROSS program. Dr. Shijian
Li is corresponding author.
References
1. Campbell LW (1997) A more universal remote control. http://web.
media.mit.edu/*lieber/Teaching/Collab97/Collab-Projects/remote.html
2. http://www.consumer.philips.com/consumer/en/gb/consumer/
cc/_categoryid_3000_SERIES_REMOTE_CONTROL_SU_GB_
CONSUMER/[4-in-1TV/VCR/DVD/SAT]
3. http://www.oneforall.co.uk/en_UK/product/1/universal-remotes/
3/advanced/25/digital-12
4. http://www.logitech.com/index.cfm/remotes/universal_remotes/
devices/3898&cl=us,en
5. http://www.universalremote.com/product_detail.php?model=158
6. Lee L, Johnson T (2006) URCousin: universal remote control
user interface. In: Proceedings of the Human Interface Technol-
ogies Conference, April 2006
7. Niezen G, Hancke GP (2008) Gesture recognition as ubiquitous
input for mobile phones. International Workshop on Devices that
Alter Perception (DAP08), conjunction with Ubicomp08, 2008
8. Bolt RA (1980) Put-that-there: voice and gesture at the graphics
interface, SIGGRAPH80, pp 262270
9. Machate J, Burmester M, Bekiaris E (1997) Towards an intelli-
gent multimodal and multimedia user interface providing a
new dimension of natural HMI in the teleoperation of all
home appliances by E&D users, 6th International Conference
Table 4 Average response time of different stages (unit: millisecond)
Target selection Feedback Gesture (action ? recognition)
Joystick 1266 Light 43 426 ? 57
Speech (speaking ? recognition) 1397 ? 406 Voice 736
734 Pers Ubiquit Comput (2010) 14:723735
123
http://web.media.mit.edu/~lieber/Teaching/Collab97/Collab-Projects/remote.htmlhttp://web.media.mit.edu/~lieber/Teaching/Collab97/Collab-Projects/remote.htmlhttp://web.media.mit.edu/~lieber/Teaching/Collab97/Collab-Projects/remote.htmlhttp://www.consumer.philips.com/consumer/en/gb/consumer/cc/_categoryid_3000_SERIES_REMOTE_CONTROL_SU_GB_CONSUMER/[4-in-1TV/VCR/DVD/SAT]http://www.consumer.philips.com/consumer/en/gb/consumer/cc/_categoryid_3000_SERIES_REMOTE_CONTROL_SU_GB_CONSUMER/[4-in-1TV/VCR/DVD/SAT]http://www.consumer.philips.com/consumer/en/gb/consumer/cc/_categoryid_3000_SERIES_REMOTE_CONTROL_SU_GB_CONSUMER/[4-in-1TV/VCR/DVD/SAT]http://www.oneforall.co.uk/en_UK/product/1/universal-remotes/3/advanced/25/digital-12http://www.oneforall.co.uk/en_UK/product/1/universal-remotes/3/advanced/25/digital-12http://www.logitech.com/index.cfm/remotes/universal_remotes/devices/3898&cl=us,enhttp://www.logitech.com/index.cfm/remotes/universal_remotes/devices/3898&cl=us,enhttp://www.universalremote.com/product_detail.php?model=158
-
ManMachine Interactions Intelligent Systems in Business,
Montpellier, May 1997, pp 226229
10. Machate J (1999) Being naturalon the use of multimodal
interaction concepts in smart homes. In: Proceedings of the HCI
International 99, pp 937941
11. Wilson A, Oliver N (2003) Gwindows: robust stereo vision for
gesture-. based control of windows. In: Proceedings of the 5th
international conference on multimodal interfaces, New York,
NY, USA, pp 211218
12. Krum DM, Omoteso O, Ribarsky W, Starner T, Hodges LF
(2002) Speech and Gesture Multimodal Control of a Whole Earth
3D Visualization Environment. In: Proceedings of Symposium on
Data Visualization, Barcelona, Spain, pp 195200
13. Starner T, Auxier J, Ashbrook D, Gandy M (2000) The gesture
pendant: a self-illuminating, wearable, infrared computer vision
system for home automation control and medical monitoring.
International Symposium on Wearable Computers (ISWC00),
pp 8795
14. Kela J, Korpipaa P, Mantyjarvi J, Kallio S, Savino G, Jozzo L,
Marca D (2006) Accelerometer-based gesture control for a design
environment, Personal Ubiquitous Computing, 10:285299
15. Wu J, Pan G, Li S, Zhang D (2009) Gesture Recognition with a
3D Accelerometer. The Sixth International Conference on
Ubiquitous Intelligence and Computing (UIC-09), Brisbane,
Australia, 79 July, 2009
16. Rabiner L, Levinson L (1981) Isolated and connected word rec-
ognitiontheory and selected applications. IEEE Trans Commun
29(5):621659
17. Rabiner LR (1989) A tutorial on hidden markov models and
selected applications in speech recognition. Proc IEEE 77:257
286
18. Lee C-H, Lin C-H, Juang B-H (1991) A study on speaker
adaptation of the parameters of continuous density hidden Mar-
kov models. IEEE Trans Signal Process 39(4):806814
19. Davis SB, Mermelstein P (1980) Comparison of parametric
representation for monosyllabic word recognition in continuously
spoken sentences. IEEE Trans Acoust Speech Signal Process
28:357366
20. Mitra S, Acharya T (2007) Acharya: gesture recognition: a sur-
vey. IEEE Trans Syst Man Cybern Part C 37(3):311324
21. Schlomer T, Poppinga B, Henze N, Boll S (2008) Gesture Rec-
ognition with a Wii Controller. International Conference on
Tangible and Embedded Interaction (TEI08), pp 1114, Bonn
Germany, Feb. 1820, 2008
22. Mantyla V-M (2001) Discrete hidden markov models with
application to isolated user-dependent hand gesture recognition.
VTT publications
23. Liu J, Wang Z, Zhong L, Wickramasuriya J, Vasudevan V (2009)
uWave: accelerometer-based personalized gesture recognition
and its applications. IEEE PerCom09, 2009
24. Mantyjarvi J, Kela J, Korpipaa P, Kallio S (2004) Enabling fast
and effortless customization in accelerometer based gesture
interaction. Proceedings of the 3rd International Conference on
Mobile and Ubiquitous Multimedia (MUM04), ACM Press, 25
31, October 2729
25. Christanini J, Taylor JS (2000) An introduction to support vector
machines and other kernel-based methods. Cambridge University
Press, Cambridge
26. HTK: http://htk.eng.cam.ac.uk/
27. Frigo M, Johnson SG (2005) The design and implementation of
FFTW3. Proc IEEE 93(2)
28. Joachims T (1999) Making large-scale SVM learning practical.
Advances in kernel methodssupport vector learning. In:
Schollkopf B, Burges C, Smola A (ed) MIT-Press
29. Quinlan JR (1996) Improved use of continuous attributes in c4.5.
J Artif Intell Res 4:7790
Pers Ubiquit Comput (2010) 14:723735 735
123
http://htk.eng.cam.ac.uk/
GeeAir: a universal multimodal remote control device for home appliancesAbstractIntroductionRelated workGeeAir: an overviewMultimodal selection of a target applianceSelecting via speech commandsSelecting via joystickFeedback mechanism
Operating an appliance via gestureGesture command definitionGesture recognition with FDSVMFeature extraction: frame-based gesture descriptorGesture classification: multiclass SVM
EvaluationsImplementationHardware setupAlgorithms implementation
Data acquisitionSpeech recognition accuracyGesture recognition accuracyExperiment 1: effect of frame number NExperiment 2: user-dependent gesture recognitionExperiment 3: user-independent gesture recognition
Response time test
ConclusionsAcknowledgmentsReferences
/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 149 /GrayImageMinResolutionPolicy /Warning /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 150 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 599 /MonoImageMinResolutionPolicy /Warning /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False
/CreateJDFFile false /Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure false /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles false /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /DocumentCMYK /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /UseDocumentProfile /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice