human-robot interaction with depth-based gesture recognitionsim1mil/papers/ias13robotgesture.pdf ·...

HUMAN-ROBOT INTERACTION WITHDEPTH-BASED GESTURE RECOGNITION

Gabriele Pozzato, Stefano Michieletto, Emanuele MenegattiIAS-Lab

University of PadovaPadova, Italy

Email: [email protected], {michieletto, emg}@dei.unipd.it

Fabio Dominio, Giulio Marin, Ludovico MintoSimone Milani, Pietro Zanuttigh

Multimedia Technology and Telecommunication LabUniversity of Padova

Padova, ItalyEmail: {dominiof,maringiu,mintolud,simone.milani,zanuttigh}@dei.unipd.it

Abstract—Human robot interaction is a very heterogeneousresearch field and it is attracting a growing interest. A keybuilding block for a proper interaction between humans androbots is the automatic recognition and interpretation of gesturesperformed by the user. Consumer depth cameras (like MS Kinect)have made possible an accurate and reliable interpretation ofhuman gestures. In this paper a novel framework for gesture-based human-robot interaction is proposed. Both hand gesturesand full-body gestures are recognized through the use of depthinformation, and a human-robot interaction scheme based onthese gestures is proposed. In order to assess the feasibility ofthe proposed scheme, the paper presents a simple applicationbased on the well-known rock-scissors-paper game.

Keywords—Human-Robot Interaction, Gesture Recognition,Depth, Kinect, SVM

I. INTRODUCTION

Human-Robot Interaction (HRI) is a field of investigationthat is recently gaining particular interest, involving differ-ent research areas from engineering to psychology [1]. Thedevelopment of robotics, artificial intelligence and human-computer interaction systems has brought to the attention ofresearchers new relevant issues, such as understanding howpeople perceive the interaction with robots [2], how robotsshould look like [3] and in what manners they could beuseful [4].

In this work we focused on the communication interfacebetween humans and robots. The aim is to apply novel ap-proaches for gesture recognition based on the analysis of depthinformation in order to allow users to naturally interact withrobots [5]. This is a significant and sometimes undervaluedperspective for human-robot interaction.

Hand gesture recognition using vision-based approachesis an intriguing topic that has attracted a large interest inthe last years [6]. It can be applied in many different fieldsincluding robotics and in particular human-robot interaction.A human-to-machine communication can be established byusing a predefined set of poses or gestures recognized by avision system. It is possible to associate a precise meaning (ora command the robot will perform) to each sign. This permitsmaking the interplay with robots more intuitive and enhancingthe human-robot interaction up to the level of a human-humanone. In this case also the robot plays a different role. During

social interactions it must have an active part in order toreally engage the human user. For example, recognizing bodylanguage, a robot can understand the mood of the human useradapting its behavior to it [7].

Until a few years ago, most approaches were based onvideos acquired by standard cameras [6], [8]; however, relyingon a simple 2D representation is a great limitation since thatrepresentation is not always sufficient to capture the complexmovements and inter-occlusions of the hand shape.

The recent introduction of low cost consumer depth cam-eras, e.g., Time-Of-Flight (ToF) cameras and MS KinectTM

[9], has opened the way to a new family of algorithms thatexploit the depth information for hand gesture recognition.The improved accuracy in modelling three-dimensional bodyposes has permitted removing the need for physical devicesin human-robot interaction such as gloves, keyboards or othercontrollers.

HRI system based on depth cameras apply machine-learning techniques to a set of relevant features extracted fromthe depth data. The approach in [10] computes a descriptorfrom silhouettes and cell occupancy features; this is then fedto a classifier based on action graphs. Both the approaches in[11] and [12] use volumetric shape descriptors processed by aclassifier based on Support Vector Machines (SVM). Anotherpossibility is to exploit the histograms of the distance of handedge points from the hand center in order to recognize thegestures [13], [14], [15]. Different types of features can alsobe combined together [16].

The paper is organized in the following way. SectionsII presents the proposed hand gesture recognition strategy,while Section III describes the body-pose detection algorithm.These tools are integrated in a human-robot interaction scheme,which is described in Section IV. A case study referring to asimple rock-paper-scissors game is presented in Section V.Finally Section VI draws the conclusions.

II. HAND GESTURE RECOGNITION FROM DEPTHDATA

Fig. 1 shows a general overview of the employed handgesture recognition approach. The approach is based on themethod proposed in [16]. In the first step, the hand is extracted

Extraction of the hand region

Multi-class SVM

Curvaturefeatures extraction

Hand samples

Circle fitting on the palm

Extraction of hand parts

Depth data

Palm samples

Recognizedgesture

PCA

Training data

Finger samples

Wrist samples(discarded)

Color data

Distancefeatures extraction

Fig. 1. Pipeline of the proposed approach.

from the depth acquired by the Kinect. Then, a set of relevantfeatures is extracted from the hand shape. Finally a multi-classSupport Vector Machine classifier is applied to the extractedfeatures in order to recognize the performed gesture.

A. Hand Recognition

In the first step the hand is extracted from the color anddepth data provided by the Kinect. Calibration information isused to obtain a set of 3D points Xi from the depth data. Theanalysis starts by extracting the point in the depth map closestto the user (Xc). The points that have a depth value close tothe one of Xc and a 3D distance from it within a thresholdof about 20[cm] are extracted and used as a first estimate ofthe candidate hand region. A further check on hand color andsize is then performed to avoid to recognize possible objectspositioned at the same distance as the hand. At this pointthe highest density region is found by applying a Gaussianfilter with a large standard deviation to the mask representingthe extracted points. Finally, starting from the highest densitypoint, a circle is fitted on the hand mask such that its arearoughly approximates the area corresponding to the palm (seeFig. 2). Principal Component Analysis (PCA) is then appliedto the detected points in order to obtain an estimate of thehand orientation. Finally the hand points are divided into thepalm region (set P , i.e., the points inside the circle), the fingers(set F , i.e., the points outside the circle in the direction of theaxis pointing to the fingertips found by the PCA), and the wristregion (setW , i.e., the points outside the circle in the oppositedirection w.r.t. the fingertips).

B. Feature Extraction

Two different types of features are computed from dataextracted in the previous step. The first set of features consistsof a histogram of the distances of the hand points from thehand center. Basically, the center of the circle is used asreference point and an histogram is built, as described in [11],by considering for each angular direction the maximum of thepoint distances from the center, i.e.:

L(θq) = maxXi∈I(θq)

dXi(1)

a) b) c)

d) e) f)

Fig. 2. Extraction of the hand: a) Acquired color image; b) Acquired depthmap; c) Extracted hand samples (the closest sample is depicted in green); d)Output of the Gaussian filter applied on the mask corresponding to H withthe maximum highlighted in red; e) Circle fitted on the hand; f) Palm (blue),finger (red) and wrist (green) regions subdivision.

where I(θq) is the angular sector of the hand corresponding tothe direction θq and dXi

is the distance between point Xi andthe hand center. The computed histogram is compared witha set of reference histograms Lrg(θ), one for each gesture g.The maximum of the correlation between the current histogramL(θq) and a shifted version of the reference histogram Lrg(θ)is used in order to compute the precise hand orientation andrefine the results of the PCA. A set of angular regions I(θg,j)associated to each finger j = 1, ..., 5 in the gesture g is definedon the histogram, and the maximum is selected as feature valueinside each region, normalized by the middle finger’s length:

fdg,j = maxI(θg,j)

Lg(θ)

Lmax(2)

where g = 1, ..., G for an alphabet of G gestures. Notethat the computation is performed for each of the candidategesture, thus obtaining a different feature value for each ofthe candidate gestures.

The second feature set is based on the curvature of thehand contour. This descriptor is based on a multi-scale integraloperator [17], [18] and is computed as described in [16] and[15]. The algorithm takes as input the edges of the palmand fingers regions and the binary mask representing thehand region on the depth map. A set of circular masks withincreasing radius centered on each edge sample is built (weused S = 25 masks with radius varying from 0.5 cm to 5 cm,notice that the radius corresponds to the scale level at whichthe computation is performed).

The ratio between the number of samples falling insidethe hand shape and the size of the mask is computed foreach circular mask. The values of the ratios represent the localcurvature of the edge and range from 0 for a convex sectionof the edge to 1 for a concave one, with 0.5 correspondingto a straight edge. The [0, 1] interval is then quantized into Nbins and the feature values f cb,s are computed by counting thenumber of edge samples having a curvature value inside bin bat scale level s.

Gesture 2

Distances

Gesture 1 Gesture 3 Gest.4

Curvatures

Scale level 1 Scale level 2 Scale level 3 Scale level 4 Scale level 5

Fig. 3. Feature vectors extracted from the two devices.

The curvature values are finally normalized by the numberof edge samples and the feature vector Fc with B×S entriesis built. The multi-scale descriptor is made of B × S entriesCi, i = 1, ..., B × S, where B is the number of bins and S isthe number of employed scale levels.

C. Hand Gesture Classification

The feature extraction approach provides two feature vec-tors, each one describing relevant properties of the handsamples. In order to recognize the performed gestures, thetwo vectors are concatenated and sent to a multi-class SupportVector Machine classifier. The target is to assign the featurevectors to G classes (with G = 3 for the sample rock-scissor-paper game of the experimental results) corresponding to thevarious gestures of the considered database. A multi-classSVM classifier based on the one-against-one approach hasbeen used, i.e., a set of G(G − 1)/2 binary SVM classifiersare used to test each class against each other and each outputis chosen as a vote for a certain gesture. The gesture withthe maximum number of votes is selected as the output of theclassifier. In particular we used the SVM implementation in theLIBSVM package [19], together with a non-linear GaussianRadial Basis Function (RBF) kernel tuned by means of gridsearch and cross-validation on a sample training set.

III. BODY GESTURE RECOGNITION FROMSKELETAL TRACKING

In this work skeletal tracking capabilities come from asoftware development kit released by OpenNI. In particular,the skeletal tracking algorithm is implemented in a freewaremiddleware called NITE built on top of the OpenNI SDK.

NITE uses the information provided by a RGB-D sensorto estimate the position of several joints of the human bodyat a frame rate of 30 fps that is a good trade-off betweenspeed and precision. Differently from approaches based onlyon RGB images, these kind of sensors based also on depth data,allow the tracker to be more robust with respect to illuminationchanges.

The skeleton information provided by the NITE middle-ware consists of N = 15 joints from head to foot. Each joint isdescribed by its position (a point in 3D space) and orientation(using quaternions). On these data, we perform two kinds ofnormalization: firstly we scale the joints positions in order tonormalize the skeleton size, thus achieving invariance amongdifferent people, then we normalize every feature to zero meanand unit variance. Starting from the normalized data, we extractthree kinds of descriptors: a first skeleton descriptor (dP ) ismade of the set of joints positions concatenated one to eachother, the second one (dO) contains the normalized jointsorientations, finally, we considered also the concatenation ofboth positions and orientations (dTOT ) of each joint:

dP = [x1 y1 z1 . . . xN yN zN ] , (3)

dO =[q11 q

21 q

31 q

41 . . . q1N q2N q3N q4N

], (4)

dTOT =[d1P d1

O . . . dNP dNO]. (5)

These descriptors have been provided as features for theclassification process. A Gaussian Mixture Model (GMM)-based classifier has been used in order to compare the set ofpredefined body gestures to the ones performed by the user.The described technique could be extended by considering asequence of descriptors equally spaced in time in order to applythe resulting features to human activity recognition [20], [21]and broaden the interaction capabilities of our framework.

IV. ROBOT CONTROL AND SYSTEMINTEGRATION

The information provided by both hand and skeletal ges-ture recognition systems are analyzed in order to obtain aproper robot motion. Robot behavior really depends on theapplication, but the complete framework can be separated indifferent layers in order to easily extend the basic systemto new tasks. We created three independent layers: gesturerecognition, decision making, robot controller. These layersare connected one to each other by means of a very diffuserobotic framework: Robot Operating System.

Robot Operating System (ROS) [22] is an open-source,meta-operating system that provides, together with standardservices peculiar to an operating system (e.g., hardware ab-straction, low-level device control, message-passing betweenprocesses, package management), tools and libraries useful fortypical robotics applications (e.g., navigation, motion planning,image and 3D data processing). The primary goal of ROSis to support the reuse of code in robotics research and,therefore, it presents a simplified but lightweight design withclean functional interfaces.

During a human-robot interaction task, the three layers playdifferent roles. The gesture recognition layer deals with theacquisition of information from a camera in order to recognizegestures performed by users. The decision making layer ison a higher level of abstraction. Information is processed byROS and classified by an AI algorithm in order to makethe robot take its decision. Finally, the robot controller layeris responsible of motion. This layer physically modifies thebehavior of the robot sending commands to the servomotors.Even though they are strictly independent also in what con-cerns implementation, the three levels can interact sendingor receiving messages through ROS. This means that wecan use different robots, AI algorithms or vision systems bymaintaining the same interface.

V. CASE STUDY: A SIMPLEROCK-PAPER-SCISSORS GAME

In order to test the developed framework a very simpleeven popular game has been used, namely rock-paper-scissors.Users are asked to play against a robotic opponent on severalrounds. At each round, both the players (human and robot)have to perform one of the 3 gestures depicted in Fig. 4. Thewinner is chosen according to the well-known rules of thegame. Game settings and moves can be selected by humanusers without any remote controls or keyboards: only the

Rock Paper Scissors

Fig. 4. Sample depth images for the 3 gestures.

a) b) c)

Fig. 5. Poses: a) corresponds to 3 games, b) to 2 games and c) to 1 game.Each game is composed by three rounds.

gesture recognition system is involved, so that the interactionwill result more natural and user-friendly as possible.

Users are able to select the number of games to play byassuming one of the pose shown in Fig. 5 once they arerecognized by the system. To each position corresponds anumber varying from 1 to 3. Each game is composed by threerounds. Once a round is started, the hand gesture recognitionalgorithm looks at the move played by the human. At thesame time, the decision making system chooses the robotgesture by using an Artificial Intelligence (AI) algorithm wedeveloped [23]. The algorithm is based on the use of GaussianMixture Model (GMM). The idea under this approach is usinga history of three previous rounds in order to forecast thenext human move. This is a very important aspect in thenatural interaction between human and robot, in fact, peopleare stimulated in confronting with a skilled opponent. Finally,the robot controller translates high level commands to motormovements in order to make a LEGO Mindstorms NXT roboticplatform perform the proper gesture (Fig. 6).

Notice that the approach of Section II, that has beendeveloped for more complex settings with a larger numberof gestures, has very good performances in this sample ap-plication. In order to evaluate its effectiveness we asked to14 different people to perform the 3 gestures 10 times eachfor a total of 14 × 10 × 3 = 420 different executions of thegestures. Then we performed a set of tests, each time trainingthe system on 13 people and leaving out one person for thetesting. In this way (similar to the cross-validation approach)the performances are evaluated for each person without anytraining from gestures performed by the same tested user, as itwill happen in a real setting where a new user starts interactingwith the robot. The proposed approach has achieved an averageaccuracy of 99.28% with this testing protocol.

VI. CONCLUSIONS

In this paper a framework for gesture-based human-robotinteraction has been proposed. Two different approaches forthe exploitation of depth data from low-cost cameras havebeen proposed for both the recognition of hand and full-body

a) b) c)

Fig. 6. LEGO Mindstorms NXT performing the three gestures: a) rock, b)paper and c) scissors.

gestures. Depth-based gesture recognition schemes allow amore natural interaction with the robot and a sample appli-cation based on the simple rock-scissor-paper game has beenpresented. Further research will be devoted to the applicationof the proposed framework in more complex scenarios.

ACKNOWLEDGEMENTS

Thanks to the Seed project Robotic 3D video (R3D) of theDepartment of Information Engineering and to the Universityproject 3D Touch-less Interaction of the University of Padovafor supporting this work.

REFERENCES

[1] Cory D Kidd, “Human-robot interaction: Recent experiments and futurework,” in Invited paper and talk at the University of PennsylvaniaDepartment of Communications Digital Media Conference, 2003, pp.165–172.

[2] Cory D Kidd and Cynthia Breazeal, “Effect of a robot on userperceptions,” in Intelligent Robots and Systems, 2004.(IROS 2004).Proceedings. 2004 IEEE/RSJ International Conference on. IEEE, 2004,vol. 4, pp. 3559–3564.

[3] Cynthia Breazeal and Brian Scassellati, “How to build robots that makefriends and influence people,” in Intelligent Robots and Systems, 1999.IROS’99. Proceedings. 1999 IEEE/RSJ International Conference on.IEEE, 1999, vol. 2, pp. 858–863.

[4] Rachel Gockley, Allison Bruce, Jodi Forlizzi, Marek Michalowski,Anne Mundell, Stephanie Rosenthal, Brennan Sellner, Reid Simmons,Kevin Snipes, Alan C Schultz, et al., “Designing robots for long-term social interaction,” in Intelligent Robots and Systems, 2005.(IROS2005). 2005 IEEE/RSJ International Conference on. IEEE, 2005, pp.1338–1343.

[5] Juan Pablo Wachs, Mathias Kolsch, Helman Stern, and Yael Edan,“Vision-based hand-gesture applications,” Communications of the ACM,vol. 54, no. 2, pp. 60–71, 2011.

[6] Juan Pablo Wachs, Mathias Kolsch, Helman Stern, and Yael Edan,“Vision-based hand-gesture applications,” Commun. ACM, vol. 54, no.2, pp. 60–71, Feb. 2011.

[7] Derek McColl and Goldie Nejat, “Affect detection from body languageduring social hri,” in RO-MAN, 2012 IEEE. IEEE, 2012, pp. 1013–1018.

[8] D.I. Kosmopoulos, A. Doulamis, and N. Doulamis, “Gesture-basedvideo summarization,” in Image Processing, 2005. ICIP 2005. IEEEInternational Conference on, 2005, vol. 3, pp. III–1220–3.

[9] C. Dal Mutto, P. Zanuttigh, and G. M. Cortelazzo, Time-of-Flight Cam-eras and Microsoft Kinect, SpringerBriefs in Electrical and ComputerEngineering. Springer, 2012.

[10] Alexey Kurakin, Zhengyou Zhang, and Zicheng Liu, “A real-timesystem for dynamic hand gesture recognition with a depth sensor,” inProc. of EUSIPCO, 2012.

[11] P. Suryanarayan, A. Subramanian, and D. Mandalapu, “Dynamic handpose recognition using depth data,” in Proc. of ICPR, aug. 2010, pp.3105 –3108.

[12] Jiang Wang, Zicheng Liu, Jan Chorowski, Zhuoyuan Chen, and YingWu, “Robust 3d action recognition with random occupancy patterns,”in Proc. of ECCV, 2012.

[13] Zhou Ren, Junsong Yuan, and Zhengyou Zhang, “Robust hand gesturerecognition based on finger-earth mover’s distance with a commoditydepth camera,” in Proc. of ACM Conference on Multimedia. 2011, pp.1093–1096, ACM.

[14] Zhou Ren, Jingjing Meng, and Junsong Yuan, “Depth camera basedhand gesture recognition and its applications in human-computer-interaction,” in Proc. of ICICS, 2011, pp. 1 –5.

[15] Fabio Dominio, Mauro Donadeo, Giulio Marin, Pietro Zanuttigh, andGuido Maria Cortelazzo, “Hand gesture recognition with depth data,” inProceedings of the 4th ACM/IEEE international workshop on Analysisand retrieval of tracked events and motion in imagery stream. ACM,2013, pp. 9–16.

[16] Fabio Dominio, Mauro Donadeo, and Pietro Zanuttigh, “Combiningmultiple depth-based descriptors for hand gesture recognition,” PatternRecognition Letters, 2013.

[17] S. Manay, D. Cremers, Byung-Woo Hong, A.J. Yezzi, and S. Soatto,“Integral invariants for shape matching,” IEEE Trans. on PAMI, vol.28, no. 10, pp. 1602 –1618, 2006.

[18] Neeraj Kumar, Peter N. Belhumeur, Arijit Biswas, David W. Jacobs,W. John Kress, Ida Lopez, and Joo V. B. Soares, “Leafsnap: A computervision system for automatic plant species identification,” in Proc. ofECCV, October 2012.

[19] Chih-Chung Chang and Chih-Jen Lin, “LIBSVM: A library for supportvector machines,” ACM Trans. on Intelligent Systems and Technology,vol. 2, pp. 27:1–27:27, 2011.

[20] Matteo Munaro, Gioia Ballin, Stefano Michieletto, and EmanueleMenegatti, “3d flow estimation for human action recognition fromcolored point clouds,” Biologically Inspired Cognitive Architectures,2013.

[21] Matteo Munaro, Stefano Michieletto, and Emanuele Menegatti, “Anevaluation of 3d motion flow and 3d pose estimation for human actionrecognition,” in RSS Workshops: RGB-D: Advanced Reasoning withDepth Cameras., 2013.

[22] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote,Jeremy Leibs, Rob Wheeler, and Andrew Y Ng, “Ros: an open-sourcerobot operating system,” in ICRA workshop on open source software,2009, vol. 3.

[23] Gabriele Pozzato, Stefano Michieletto, and Emanuele Menegatti, “To-wards smart robots: rock-paper-scissors gaming versus human play-ers,” in Proceedings of the Workshop Popularize Artificial Intelligence(PAI2013), 2013, pp. 89–95.

human-robot interaction with depth-based gesture recognitionsim1mil/papers/ias13robotgesture.pdf ·...

Documents