a real-time hand gesture recognition system · a real-time hand gesture recognition system ... sive...

A Real-time Hand Gesture Recognition System

Arun GanesanUniversity of Michigan, Ann Arbor

Caoxie (Michael) ZhangUniversity of Michigan, Ann Arbor

AbstractHand gesture recognition can be used in many applica-tions such as interactive data analysis or American SignLanguage detection. Current systems are either expen-sive, unable to run in real time, or require the user weardevices such as custom gloves. We propose an inexpen-sive solution for predicting hand gestures in real time thatuses Microsoft’s Kinect camera. Our system involvestraining a random forest classifier with a color glove, andthen predicting at a pixel level a naked hand. Our systempredicts all pixels at about 10 fps, and is resilient to envi-ronment differences in prediction. We also conduct ex-tensive experiment studying the random forest classifierand reveal some interesting properties.

1 Introduction

Natural user interfaces (NUI) are a new way for humansto interact with machines. Among numerous NUIs whichinclude multi-touch, eye tracking and motion detection,hand gesture recognition is one promising candidate. Inthis paper, we design and evaluate a novel hand gesturerecognition system to demonstrate that we are close to anactual production-level system. The reader should notethat that we do not claim hand gesture recognition is THEfuture of interfaces. In fact there are some limitationsfor using hand gestures such as fatigue over long termusage (in the movie Minority Report, Tom Cruise hasto take breaks due to fatigue). We focus on the scientificand engineering challenges in building such a system andleave usability studies to future research.

1.1 Design Goals

Our system is designed to maximize user experience.Moreover, our system differs from existing systems in

the following ways.Just hands. Many existing system such as [11, 2] re-quire users to wear gloves or markers to be capture bythe camera. We feel this limits the usability of the ap-plication and aim to keep the prediction of the systementirely marker free.Real-time. Our system should run smoothly on aver-age modern computers with a dedicated graphics card.The system should also recognize hand gesture at a highframe rate, nearing real-time speeds. Our desired framerate 30Hz; we achieved a frame rate around 10Hz. Inour design and implementation, one driving goal is tosqueeze every milliseconds as possible.No calibration. Once trained, our system should requireno further calibration before working in a new environ-ment.Robust and accurate. Our system should have an accu-rate estimation of where the user’s hands are and whatgestures they use with low false positive. Moreover,the system should be insensitive to various background,user’s location, camera position and other noise.Arbitrary gestures. Our system should be able to eas-ily incorporate new types of gestures. With sufficienttraining, our system should support many complex ges-tures such as those seen in the American Sign Lan-guage.

1.2 Main Ideas

Our system would not be possible without the use of Mi-crosoft Kinect for PC, which we were probably amongthe first to obtain when it was released in February 2012.Kinect is a multi-purpose sensor including RGB camera,depth camera and audio sensor. The Kinect SDK offersskeletal recognition however lacks the granularity for in-dividual fingers. The SDK also provides raw pixels forthe RGB image and depth image at a maximum frame

rate 30Hz. We use the depth image for gesture recogni-tion and both RGB and depth image for generating train-ing samples. The depth image is the key factor that dis-tinguishes our system from most existing systems thatuse only the RGB camera. The advantage of the depthimage is that it offers an additional dimension, i.e. depthof each pixel that is not present in the RGB image. An il-lustrating example is when an object and its backgroundhave a similar color but the depth of the two are drasti-cally different.

Our system adopts a data-driven approach: machinelearning as opposed to hand-crafted rule-based systems.The adoption of machine learning transfers the humanintelligent efforts from designing rules/algorithms to de-signing informative features. By extracting the featuresfrom labeled data, machine learning algorithms allowscomputers to learn the rules/algorithms automatically.The key advantages of using machine learning in our sys-tem are (1) it is easy to incorporate developer-definedgestures: developers just need to feed the system withthe gesture images to be trained rather than derivingnew rules to match those gestures and (2) it is robust tovarious environmental changes such as camera position,background, and various hand sizes: developers just needto generate the gestures on various environments.

In a high-level overview, the system is separated into twoparts: training and real-time prediction. Only the real-time prediction component is seen by the end users. Inthe training component, we use color gloves to generatemany labeled data of the depth image. Each pixel in thedepth image is labeled as a belonging to a gesture or tothe background. A random forest classifier is trained toachieve both real-time performance and high accuracy.In the real-time prediction component, the GPU is usedto predict the class of each pixel in the depth image andthe prediction output is pooled to propose the final po-sition and type of a gesture. Notice that we do not useany temporal or kinetics information as the current sim-ple design suffices for the hand gesture tracking.

1.3 Contributions

We summarize our contributions as follows:A system for real-time hand gesture recognition. Wedesign and implement a complete real-time hand ges-ture recognition system based on machine learning. Thesystem broadcasts the location and gesture of the handthrough a web socket server and can be used by otherapplications.An inexpensive way to generate accurate labeleddata. We use color gloves to easily generate labeled datausing the aligned RGB and depth images. In [1], the

authors use sophisticated computer graphics to generatetraining samples. We found this expensive and throughthe use of color gloves, developers can generate their cus-tomized gesture without difficulty. Another advantage ofour approach is that the system uses actual raw depth im-ages as training samples. These naturally capture real-istic noise such as shadows and hardware noise. Usingcomputer generated graphics as done in [1] it is very dif-ficult to simulate these noisy effects. Note that the endusers do not need to wear color gloves; they are onlyused in training.A computational insight about random forest andsupport vector machine (SVM). To the best of ourknowledge, there seems to be no literature in compar-ing SVM and random forest from a computational per-spective. We provide an in-depth complexity analysis ofthe two methods rather than merely reporting experimen-tal accuracy as done in most machine learning literature.

Extensive experimental evaluations of the system. Weconduct extensive experiments evaluation of the effec-tiveness of the random forest classifier by systematicallyexploring a large space of parameters. Interesting resultslead to a deeper understanding of random forest.

Our work is made public at: https://github.com/arunganesan/hand-gesture-recognition.

1.4 Outline

The rest of this paper is organized as follows. In sec-tion 2 we present an overview of the system architecture.Three large components of the system are described intheir own sections. Section 3.1 discusses our choice ofrandom forest over SVM for classification, and the fea-tures used in our algorithm. Section 4 explains the col-lection of training samples, and section 5 discusses thepooling of the predicted pixels. Section 6 discusses someof the details of our implementation. Section 7 presentsour experimental findings. Section 8 reflects on our ex-periences in building the system and working on thisproject. Section 9 presents related work and section 10concludes.

2 System Overview

The system architecture is visualized in Figure 1. Fromthe user point of view, our system is divided into twocomponents: developers mode for training gestures andend-user mode for real-time prediction. In a high-leveloverview, developers using the developer mode supplythe system with labeled data with color gloves. The sys-tem trains with this labeled data. In the end-user mode,

2

https://github.com/arunganesan/hand-gesture-recognition

https://github.com/arunganesan/hand-gesture-recognition

System Architecture

Color glove: - an inexpensive approach to label gestures - map RGB pixel to depth pixel

Generating Training Sets

A Real-time Hand Gesture Recognition System

Arun Ganesan and Caoxie (Michael) Zhang EECS Department University of Michigan, Ann Arbor

Design goals: - Real-time, Just-hands, Accurate Methodology: - Machine learning as opposed to rule-based system - Train on a large amount of label data - Use simple algorithm in prediction Contributions: - A system - An inexpensive way to generate massive labeled data - Extensive experiments - Computational analysis between SVM and random forest

Introduction

Feature extraction:

Per-pixel Classification

Pool the per-pixel classifications to a single proposal of gesture position and type Use clustering algorithms: - Kmeans: not good - Subject to outliers - Assumes each cluster has equal size - Density-based clustering - No need to specify the number of clusters Experience: - Late optimization - Adding a virtual wall - Test accuracy is not enough

Pooling & Experience

Random forest for classification - Ensemble of decision trees Prediction complexity: - O(dlevel ntree) vs. linear SVM O(nclass nfeatures) Use GPU for real-time prediction: - Massive parallelism: 307,200 threads a frame - OpenCL: a general purpose computing for GPU

Use EC2 to train many data sets - Largest data set > 25 GB, our demo takes 24 hours train

Experiments

GPU

Raw Image Labeled Image

End Users Developers

Just Hands Color Gloves

Feature Extraction

Per-pixel Classification

Pooling

Training

Kinect Sensor

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

10 50 100 150 200 250 300 350

Erro

r

Number of training samples

1 Tree

2 Trees

3 Trees

4 Trees

5 Trees

number of trees doesn’t matter a lot

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

10 50 100 150 200 250 300 350

Erro

r


Training accuracy

Test accuracy

Random forest is always overfitting

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

10 50 100 150 200 250 300 350

Erro

r


Random forest

Linear SVM

Random forest is much better than SVM

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

10 50 100 150 200 250 300 350

Erro

r


1000 Features

2000 Features

3000 Features

Number of feature doesn’t matter too much

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

10 50 100 150 200 250 300 350

Erro

r


1000 mmpx

2000 mmpx

20000 mmpx

40000 mmpx

80000 mmpx

120000 mmpx

200000 mmpx

400000 mmpx

Range of offset matters

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

10 50 100 150 200 250 300 350 689

Erro

r


The more training samples the better

Figure 1: An overview of the system architecture

the camera captures raw depth images of the user’s nakedhands. It then predicts every pixel using the GPU andthen pools the predicted pixels to get an estimation ofthe location of the gesture. Between the two modes,they share a component: feature extraction, which ob-tains the features for each pixel in the depth image. Theimplementation of feature extraction in the two modesare different - the training component uses the CPU in anoff-line fashion and the end-user prediction componentuses the GPU (end-user mode) in an on-demand fashion.

2.1 The Kinect Sensor

The Kinect sensor provides both raw depth image andcolor image at a maximum frame rate of 30Hz and at640px × 480px resolution. Each pixel in the depth im-age represent the distance from the object to the cam-era. The error of the depth can be with several millime-ters. However, objects can have a shadow where someparts near the object do not have depth value at all. Thisis an example of realistic noise that may be missing ifthe training data was generated using computer graphicsmethods.

Figure 2: Offset pairs for a given pixel. In the left figure,the reference pixel is the center of the palm. In the rightfigure, the reference pixel is the center of the palm of theperson that stands.

3 Feature Extraction and Per-pixel Classi-fication

In this section, we describe the main classification al-gorithm used in the system. At each frame the systempredicts each pixel as belonging to a gesture (e.g., openhand, close hand or background). The prediction resulton every pixel are fed into a pooling algorithm to pro-pose the final gesture location and type. The drivingreason for us to choose per-pixel classification is that itallows massive parallelism using GPU. The predictionalgorithm is identical for every pixel therefore we canleverage the Single Instruction Multiple Data architec-ture of the GPU.

3.1 Feature Extractions

For each pixel x, we extract a set of features. Each fea-ture corresponds to the depth difference between two off-set points, {u,v}, from x and normalized by its depth.This is shown in equation 1 where d(x) is the depth atpoint x and θ refers to the pair of offsets u,v.

fθ (x) = d(

x+u

d(x)

)−d

(x+

vd(x)

)(1)

This feature extraction method is also used in [1]. Theoffsets θ = {u,v} are generated randomly. In our design,we randomly sample the offsets from a bounded circulararea. u and v are obtained as

(r cosβ ,r sinβ ), (2)

where r and β are uniform random variables:

r =U [Rmin,Rmin] (3)β =U [0,2π] (4)

3

As we can see from Figure 2, the offset pairs are locatedwithin the neighborhood of the palm and are proportionalto the depth.

3.2 The Classifier: SVM or Random For-est?

In the training samples each pixel is labeled (we addresshow to generate massive labeled sample in Section 4) andwe have to use supervised learning methods. Two meth-ods come into our decision of choice: linear SVM andrandom forest. We use the following notation in our ex-planation – the number of training samples: l, the num-ber of features: n, the number of classes (types of ges-tures): c. Our main criteria for choosing the best algo-rithm are (1) prediction time and (2) accuracy.

Linear SVM. In linear SVM, a model that has c classesand a weight vector of length n is trained. The predictionis based on the dot product between the trained weightvectors and the feature vector. Therefore, the run-timecomplexity for predicting each pixel is O(n× c).

Random Forest. Random forest is built on an ensembleof decision trees that are trained on bootstrap1 samplesof the training sets. The decision tree in the random for-est is a binary tree. Each node in the decision tree hasone feature and a threshold to determine which branchto go to by comparing the feature value to the threshold.In the leafs of the tree are labels. The final predictionis made by the majority rule of the all decision trees inthe forest. More details about the training and statisticalanalysis can be seen in [4]. The depth of the decision treeis approximately as O(log l). Therefore O(log l) queriesof features are made in a single decision tree when pre-dicting. The run-time complexity for making predictionin random forest is then O(log l× ntree). In fact, we canprune each decision tree to limit its depth. This refine-ment, described in section 6.4 reduces the complexityto O(dtree × ntree), for a fixed depth dtree. Finally, wepresent the random forest prediction algorithm in Algo-rithm 1.

Comparison. Let us do a simple calculation in whichwe use a realistic parameter setting. Suppose we have2000 features, 3 classes, and using 3 trees with maxi-mum depth of 20, the random forest is 100 times fasterthan linear SVM! Moreover, in our experimental eval-uation, random forest has proven to be far superior thanlinear SVM in accuracy. Although linear SVM might usesome advanced feature learning technique such as deeplearning to achieve similar accuracy as to random for-est, SVM is still too slow for us to adopt in the system.

1Bootstrap means uniform sampling with replacement

Algorithm 1 Prediction Algorithm In Random Forest

Input: a depth image and a pixel x to be predictedInitialize predict listfor each tree i in the random forest do

node← root of tree iwhile node is not a leaf do

feature index← node.feature(u,v)← get offset pair( x, feature index )depth = get depth difference(u,v)if depth > node.threshold then

node← node’s right childrenelse

node← node’s left childrenend if

end whilepredict list[i]← node’s label

end forreturn predict list’s majority label

Inheriting from decision tree, random forest allows thesystem to extract features on-demand, which has beenextremely crucial for real-time application as in the caseof our system. We are really surprised by this analy-sis since he used to believe linear SVM is unbeatable inpractice.

Training. Training in linear SVM is very fast as it has arun-time complexity of O(n× l) and there exists a tech-nique to scale the training to distributed systems [12].Training in random forest, however, does not have anoptimization-based foundation. We use brute force todetermine the right feature and right threshold for eachnode in the decision forest. In determining the rightthreshold, we use grid search. In the training of randomforest, it is highly recommended to put the training datain the main memory.

3.2.1 GPU For Real-time Prediction

There are 307,200 (640× 480) pixels in a frame. Eachpixel will undertake O(dtree× ntree) operations for pre-diction. We first tried to implement the random for-est prediction using CPU. It takes about 1 minute toprocess a frame! GPU presented an attractive alterna-tive as it supports massive parallelism. As can be seenin Figure 3, GPU is more suitable for high-parallelismand high-latency jobs, while CPU is better fit for low-parallelism and low-latency jobs.

In our system, each pixel has fired a thread to make pre-diction using random forest. In a typical GPU, more thanthousands of threads are running concurrently thanks tothe special architecture of GPU. As a result, per-pixelclassification can be done in about 40ms.

4

0

200

400

600

800

1000

1200

30

45

60

75

90

105

120

135

150

165

180

195

210

225

240

255

270

285

300

315

Tim

e (

ms)

# iterations

GPU

CPU

Figure 3: A computational comparison between GPUand CPU. A simple experiment is conducted by vary-ing the number of iterations in a for-loop. GPU wouldparallelize the for loop.

Figure 4: Two differently colored gloves we used inrapidly generating labeled training samples. For the mul-ticolored glove, we had to ensure the colors were differ-ent enough to be easily recognized by the RGB camera.

4 Generating training samples

In this section we present our method for rapidly gener-ating labeled training samples.

To rapidly generate training samples, we use a colorglove that has a different color for each area of interest.We can present the glove to the camera and extract re-gions of a specific color using RGB camera and labelthe corresponding depth pixel appropriately. Examplesof gloves we used in our experiments are shown in fig-ure 4.

Figure 5: An example of four gestures. Each gesturerepresent a label of every pixel that belongs to the glove.

We devised two different techniques of labeling eachpixel. The first technique, shown in the left image ofFigure 4, is to differentiate all pixels in the hand fromthe background. When training the system, we train onegesture at a time and classify a pixel on the hand as be-longing to that gesture and classify any pixel not on thehand as part of the background. We trained four ges-tures using this approach yielding a total of five labels.An example of the gestures can be seen in Figure 5. Thesecond technique is shown in the right image of Figure 4.In this technique each finger is labeled separately. There-fore each pixel can have a total of seven labels - six forthe hand, and one for the background.

We also use other techniques to simplify labeling suchas cropping and ignoring far away pixels via the depthinformation.

5 Pooling

After training on the labeled data sets and making per-pixel prediction, our system takes the per-pixel’s pre-dicted class as input and proposes the final location andtype of the gesture. We use clustering to achieve thisgoal, i.e., unsupervised learning. The methodology is asfollows: we treat non-background pixels (after per-pixelclassification) all equally (without any labels) and thenrun a clustering algorithm on them. Then we use thelargest cluster as a proposal of the hand and use the ma-jority of the predicted label as the type of the gesture.The location of the gesture is found by the 2D median ofthe largest cluster to add resistance to outliers. We con-sidered several candidate clustering algorithms. First weconsidered to use mean shift, a similar approach used in[1]. However we found that the complexity of mean shiftis very high: O(N2), where N is the number of pixels (inour case N = 307200). We then turned our attention toK-means and DBSCAN clustering.

K-means. K-means is probably the most commonlyused clustering algorithm. A parameter K, the numberof clusters, has to be specified beforehand. K-meansis fast, with a complexity of O(2Nnon-background), whereNnon-background is the number of non-background pixels.We found that however there is a significant weakness inthe application of K-means to our system: (1) K-meansassumes each cluster has an equal size of points; (2)K-means is not robust to outliers; and (3) K-means re-quires a fixed number of clusters. As a result, we foundthat K-means often divides a hand into multiple clusters.Therefore we abandon the use of K-means in our sys-tem.

Density-based clustering. Density-based cluster ismore suitable for graphics application as it tries to cluster

5

points that has a density exceeding pre-determine thresh-old. In addition, density based clustering doesn’t assumea number of clusters ahead of time. We customize thedensity-based clustering as in Algorithm ?? The com-plexity is O(Nnon-backgroundε2), where ε is the radius ofthe interested pixels that is used in Algorithm ??.

6 Implementation

There are several implementation details that worth men-tioning here.

6.1 Kinect

We purchased two Kinect sensors: one for XBox theother for Windows. Kinect SDK is a Windows-basedAPI allowing us to access the raw depth and RGB im-ages from both versions of Kinect. The Windows ver-sion offers a near mode to permit objects to be within30cm of the sensor. In labeling with the color glovesto generate training samples, we use the built-in func-tion provided by Kinect SDK to map the color pixel todepth pixel. The color detection turns out not to workvery well since lighting and camera position introducemany variations to the color. However with cropping anddepth-thresholding, we managed to labeled hundreds ofimages. At this time, we did not use the gloves with fin-gers colored, as we found (1) it is difficult to produceperfectly built color gloves and (2) the low resolution ofthe Kinect sensor might not capture the details of figurevery well.

6.2 GPU Implementation

From a programmer’s perspective, GPU is seen as anI/O device: one has to move the data from main mem-ory to the GPU memory and move back the processeddata to the main memory. There are several libraries forprogramming on the GPU: Microsoft DirectX, MicrosoftDirectCompute, OpenCL, and CUDA. We decided touse OpenCL as it is a standard library for general pur-pose computing on GPU and can run on almost all GPU(as opposed on CUDA which can only run on NVidiacards). OpenCL uses C as the programming languageand has an event-queue framework to process incomingdata.

When optimizing the GPU algorithm, we noticed that thecache hit of the model stored in memory was low. Weexplored reordering the data to increase cache hit rate.Since we store the tree linearly in the memory, we tested(1) using pre-order traversal (depth first), and (2) breathfirst traversal for the data structures. It turns out there is

no significant difference between the two data structuresso we use the first data structure.

6.3 Training Random Forest

The training samples generated is big, usually more than10 GB. We customized a random forest library [9], forexample via changing the double type to float type tosave the memory size by half. We realized near the end ofthe project that optimizing the raining algorithm in addi-tion to the prediction algorithm could have had tremen-dous gains as training often took longer than 24 hours,with the longest model requiring over 48 hours to train. Itis an interesting research topic to parallelize random for-est training algorithm to multi-core and distributed sys-tems.

We use Amazon Elastic Computing Cloud (EC2)to offload our massive training workloads to cloudservers.

6.4 Refinements

Several improvements we found are very important forour systems.

Adding a virtual wall. Our system will be placed ina variety of environments and therefore the backgroundwill be drastically variable. We decide to add a virtualwall after 1.5m so every pixel farther than 1.5m is treatedas 1.5m. We found this technique can deal with a lot ofvariability of the backgrounds and make prediction moreaccurate in practice.

6.5 Actual System

Our main application is written in C# in Windows, train-ing algorithm is written in C++, and GPU predictionis written in C. The actual system shows three images:color image, depth image with per-pixel classification,and pooled image with the largest cluster’s position andtype. Also the system reports the real-time performancefor each stage. An illustrating example is shown in Fig-ure 6.

We built a demo application that maps two gestures ontomouse state. The location of the hand in the XY planedetermined the XY location of the mouse; moving thehand to and from the camera moved the mouse wheelup and down; and closing and opening the hand presseddown and released the left mouse button respectively.This simple mapping enabled us to navigate many appli-cations. We successfully demonstrated this mapping on

6

Figure 6: A snap shot of the actual system. Note theprediction is longer than it is used in a demo machine.

the Google Earth application [10] by panning and zoom-ing the map.

7 Experimental results

In per-pixel classification, there are many parameters in-volved in capturing the training images, extracting fea-tures, and training the random forest classifier. To studythe contribution of these parameters, we systematicallyvaried them and studied the accuracy of the resultingclassifier. To study the accuracy, we set aside a collectionof images as the test set and trained on a different set ofimages. We didn’t use cross-validation due to the timeconstraints in training a random forest; it would havetaken considerably longer time to test the cross valida-tion accuracy and would have prevented us from runningas many tests as we did. We collected images of four dif-ferent gestures shown in figure 5. We refer to each im-age that we captured as a “training sample”. Since thereare over 300000 pixels in each image, we sampled only2000 of them so that we can train on more diverse im-ages. From each image, we sampled 1000 backgroundpoints, and 1000 points from the gesture to ensure wetrained evenly on the different types of classes.

7.1 Varying Parameters

We varied three parameters in training the models, suchthe number of trees, the number of features, and the num-ber of training samples. We set aside 50 images as thetest set, on which we evaluate the trained model to ob-tain test accuracy.Number of trees. Figure 7a shows the results of varyingthe number of trees while fixing the remaining parame-ters. We observed that the number of trees makes no no-

ticeable difference. In fact, even training the forest withone tree has the same accuracy as with multiple trees.This was also observed for different fixed values of theparameters. Adding multiple trees in the random forest isintended to protect against overfitting. However, our testand training sets were sampled from the same pool ofimages so we hypothesize one tree is sufficient to modelall the variability in the test set. In the actual system, weused 3 trees to make the tradeoff between enough gen-erality of the random forest and efficient performance.

Number of training samples. Figure 7b shows the ef-fect of varying the number of training samples. Thereis a clear trend of smaller test error as we increase thenumber of training samples. At 689 training samples,the error is only 1.32%. Increasing the number of train-ing samples had the biggest impact in reducing the errorwhen compared to other factors.Number of features. Figure 7c shows the effect of vary-ing the number features. Across most of our experimentswe observed that using 2000 features outperforms 3000features, which outperforms 1000 features. However, asseen in figure 7c, the difference in error is small. Thissuggests that if we sampled 1000, 2000, and 3000 fea-tures again, we may observe a different trend.Varying Rmax Finally, Figure 7d shows the results ofvarying Rmax in Eq (3), the radius of the offset features.For smaller training sample sizes, setting the radius to40000 mmpx (short for, mm× pixel, which is the unit ofRmax) gives the least error classifier. However, as we in-crease the number of training samples, both 40000 mmpxand 80000 mmpx give similarly accurate classifiers. Thismeans that when we have fewer training samples, set-ting the radius to 40000 mmpx will allow us to train amore accurate classifier, but when we have enough train-ing samples, the radius is not so important. We noticethat making the radius too small or too large gives pooraccuracy. By qualitatively checking the offset images asshown in Figure 2, we hypothesize that to train a moreaccurate classifier, the offsets should cover most of thehand, and then some of the background. If the radius istoo small, then the offsets will be right next to each otherand feature vectors for different gestures may be too sim-ilar.

7.2 Randomness

First, due to the nature of the random forest (becauseof bootstrapping), the trained model is non-deterministicand may yield a different accuracy rate for the sametraining set of images. In order to study this random-ness, we trained ten models on the same training set with1000 features, 300 training images, and one tree. We

7

obtained a mean error of 9.82% with 0.94% standard de-viation. Therefore we concluded that the randomness de-rived from the random forest is negligible.

7.3 SVM Comparison

We compared random forest classifier with a linear SVMclassifier, [5]. We trained an SVM classifier for 2000features, and different number of training images. Fig-ure 7e compares the error rate of a linear SVM with arandom forest classifier. Linear SVM only slightly out-performs a random classifier that always predicts a pixelas the background. As we increase the training samplesize, linear SVM improves much more slowly than therandom forest.

7.4 Pruning

We studied the effect of pruning on the accuracy of pre-diction. To study this, we pruned the forests trainedwith 2000 features and three trees for different numberof training images. Then we calculated the test set accu-racy of the pruned models. It turned out that pruning thetree showed almost no difference in accuracy comparedto the full tree. We hypothesize the reason for this is thatvery high up in the tree the splitting nodes already con-fidently divide the training points into their class. How-ever, the random forest algorithm continues to exhaustthe features until the entire tree is constructed. By stop-ping the prediction early on in the tree, we do not sufferin accuracy, and improve the speed of prediction.

7.5 Overfitting

Finally, we studied how much the model overfits to thetraining samples by calculating the training accuracyand comparing that with the test accuracy. We fixed2000 features and three trees and varied the number oftraining samples. Figure 7f shows evidence of extremeoverfitting. The training error is consistently very low(mean 1.06%) and is much lower than the testing accu-racy.

8 Experience

We learned several lessons in building a practical ma-chine learning system.Late-optimization. Although our system is very time-sensitive, we found that it is not necessary to do opti-mization on one component while other components arenot ready. We found it is best to develop quickly and doprofiling to discover the bottleneck and optimize it. This

saves us a lot of time in making unnecessary optimiza-tion.Using pipeline rather than events. In the testing of oursystem, we have incurred many concurrency bugs due tothe event-driven programming framework. We solve thisby not using locks but adding the processing unit to apipeline, thus leaving us free concurrency bugs.Test accuracy is not enough. We found that using thetest accuracy is not a good evaluation of the actual sys-tem. Sometimes the system performs very well in the testdata set, but poorly in practice. The reason behind this isthat the environment we evaluate changes all the time:camera position, backgrounds, people’s clothes, and etc.Therefore, we decide to choose some important param-eters based on our experience in the actual environmentinstead of mere test accuracy. For example, although theexperiments indicated the number of trees made no dif-ference, we found to this to not be the case in actual testswith varying backgrounds. Therefore, we used a modeltrained with multiple trees rather than one.Bundle the model with feature extraction. In our sys-tem, we separated feature extraction with the model.This turned out to be error-prone. It would happensometimes that the trained model has the wrong fea-ture extraction (e.g., the offset pairs are not correct), andthings could get even worse when the real-time predic-tion system mistakenly uses the wrong feature extractionmethod. Our immediate solution is to be extremely care-ful to this. However, if we were to build the system againwe would have bundled the model with feature extractionto avoid human errors.

8.1 Limitations

Our current system is not without limitations. First, thereal-time prediction component cannot achieve a framerate of 30Hz but only 10Hz on average. Second, the ran-dom forest model takes a long time to train, especiallywith many training samples. Often times, training re-quired around 24 hours or more. Third, the system mustbe retrained with more training samples every time theuser wants to add a new gesture.

9 Related Work

There are two main techniques in hand gesture recog-nition - appearance based and model based. These areakin to probabilistic models of classification and gener-ative models of classification respectively. Appearancebased approaches read the pixels from the camera andbuild classifiers to label that as belonging to a finite setof classes. The main limitation of this technique is that

8

(a) Varying the number of trees. Number offeature: 2000.

(b) Varying number of training samples.Number of features: 2000, and number oftrees: 3.

(c) Varying the number of features. Numberof trees: 3.

(d) Varying the radius of offsets. Number offeatures: 2000, number of trees: 3.

(e) Comparing random forest to linear SVM.Number of features: 2000, number of trees: 3.

(f) Comparing training error and testing error.Number of features: 2000, number of trees: 3.

Figure 7: Experimental results. Testing error with respect to the number of training samples.

the set of labels is finite and fixed ahead of time. Theadvantage of appearance based approaches is their im-plementation is often extremely fast and therefore suitedfor situations where live classification is important. Ex-amples of appearance based techniques can be found in[1, 2]. Model based approaches start with a set of hy-potheses of the final classification based on rules of theobject being classified. For instance, in the case of handgestures a hypothesis can be a particular orientation ofthe joints. An advantage here is that the hypotheses canbe generated from an infinite space of possible classifica-tions. The main disadvantage of model-based techniquesis that they are often computationally expensive. In addi-tion, model-based approaches tend to be very complex.An example of a model based technique can be found in[3].

As advance sensor technology has become more acces-sible, many developers and researchers are studying nat-ural interfaces. [7] overviews a variety of input devicesfor interfacing with 3D models including mouse designswith six degrees of freedom, haptics devices for simu-lating realistic forces, and computer vision based tech-niques for head and hand tracking. Very relevant to ourproject, [8] presents a method for identifying 25 gesturesin 3D using the gyroscope and accelerometer in the Nin-tendo Wii remote. This falls in a more general categorycalled “spatially convenient input-devices”.

The work most similar to ours is from Microsoft [1]

on the details of the Kinect’s appearance-based skeletonrecognition algorithm. They classify each pixel as be-longing to some portion of the body and then pool allpixels from each portion to determine the joint positions.Their training set is generated from synthetic 3D imagesof body orientations. The feature extraction techniqueand the random forest classification algorithm used inthis work is borrowed from [4]. Our project differs fromthis in the method for generating training samples, andthe scale of classification.

[2] present another appearance-based method for detect-ing hand gestures using just an RGB camera. Their solu-tion requires the user wear a glove with a special patternimprinted on it. The camera pictures the hand in dif-ferent orientations, normalizes the image, and identifiesthe nearest neighbor in a database of common hand ges-tures. Unlike this approach, our technique only uses acolor glove for training purposes.

[3] present a model-based approach using the Kinect.They first generate a set of hypotheses with knowledge ofinverse kinematics, and approximate shape of fingers andhands. They evaluate these hypotheses and pick the mostlikely one based on the depth image from the Kinect.This technique is able to detect hand gestures even in theplace of occlusions and is implemented efficiently reach-ing up to 15 updates per second. However, being model-based, this technique may be vulnerable to differences inhand shapes, and still lags behind in speed compared to

9

appearance-based techniques.

10 Conclusion

We presented a real-time hand gesture recognition sys-tem built on top of Microsoft’s Kinect SDK using theirdepth range camera. We are able to achieve real-timeprediction at about 10 fps due to the efficient use of ran-dom forest using GPU . We extensively studied the ran-dom forest classifier by varying the multitude of hyper-parameters. We discovered that (1) increasing the num-ber of training samples had the biggest impact in the testset accuracy, (2) changing the number of trees had nonoticeable effect, and (3) random forest classifier signif-icantly outperforms a linear SVM. We built a demo ap-plication that maps two gestures to mouse position, clickstatus, and wheel movement. We then used this mappingin the Google Earth application and were able to navigatewith pan and zoom using just our gestures.

We decided to make our implementation open source inhopes of attracting other developers to test and developour application. We found this to be lacking with manyother research projects in this area. They report the re-sults, but do not make the code and testing data avail-able.

Our experiments were done on four gestures, and thedemo was with two gestures. We would like to trainour system with more gestures and more diverse humanhands to discover limitations of our system. For instance,such a system will be useful in detecting American SignLangauge alphabet gestures.

Algorithm 2 Density-based clustering

Input: a depth image with labeled pixelsfor each non-background and unvisted pixel p do

mark p as visitedneighbor list← get neighbors(p, ε )if len(neighbor list) > ε2× density then

add p to a new clustercreate a queue: Seedenqueue neighbor list to Seedwhile Seed not empty do

pixel t ← Seed.Dequeue()neighbor list← get neighbors(t, ε )if len(neighbor list) > ε2× density then

explore neighbors(neighbor list)end if

end whileend if

end for

function GET NEIGHBORS(q, ε)list is emptyfor each pixel x that is within ε distance of q do

if x is labeled as non-background thenadd x into list

end ifend for

return listend function

function EXPLORE NEIGHBORS(neighbor list)for each pixel q in neighbor list do

if q is not visited thenSeed.Enqueue(q)mark q as visitedadd q to the new cluster

end ifend forreturn neighbor list

end function

10

References

[1] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M.Finocchio, R. Moore, A. Kipman, A. Blake. Real-time human pose recognition in parts from singledepth images. CVPR, 2011.

[2] R. Wang and J. Popovic. Real-time hand-trackingwith a color glove. In Proc. ACM SIGGRAPH, 2009.

[3] I. Oikonomidis, N. Kyriazis, and A. Argyros. Effi-cient model-based 3D tracking of hand articulationsusing kinect. In BMVC, Aug 2011.

[4] V. Lepetit, P. Lagger, and P. Fua. Randomized treesfor real-time keypoint recognition. In Proc. CVPR,pages 2:775-781, 2005.

[5] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang,and C.-J. Lin. LIBLINEAR: A library for large linearclassification Journal of Machine Learning Research9(2008), 1871-1874.

[6] D.A. Keim. Information visualization and visual datamining. Visualization and Computer Graphics, IEEETransactions. Vol 8, no.1, pp. 1-8, Jan 2002.

[7] W. Stuerzlinger, C Wingrave. The value of con-straints for 3D user interfaces. Dagstuhl Seminar onVR, 2010.

[8] M. Hoffman, P. Varcholik, and J. LaViola. Break-ing the status quo: improving 3D gesture recognitionwith spatially convenient input devices. IEEE VR,2010.

[9] V. Bystritsky. ALGLIB. 14 Aug 1999. Web. http://www.alglib.net.

[10] Google Inc. (2009). Google Earth (Version 6.2)[Software]. Available from http://www.google.com/earth/index.html.

[11] Mistry, P., and Maes, P. SixthSense A WearableGestural Interface. SIGGRAPH Asia 09, EmergingTechnologies. Yokohama, Japan

[12] Caoxie Zhang, Honglak Lee, and Kang G. Shin. Ef-ficient Distributed Linear Classification Algorithmsvia the Alternating Direction Method of Multipliers.AISTAT 2012.

11

http://www.alglib.net

http://www.alglib.net

http://www.google.com/earth/index.html

http://www.google.com/earth/index.html

a real-time hand gesture recognition system · a real-time hand gesture recognition system ... sive...

Documents