a multi-view vision-based hand motion capturing system

11
A multi-view vision-based hand motion capturing system $ Meng-Fen Ho a,b,n , Chuan-Yu Tseng a , Cheng-Chang Lien c , Chung-Lin Huang a a Institute of Electrical Engineering, National Tsing Hua University, HsinChu, Taiwan, ROC b Department of Electronic Engineering, Hsiuping Institute of Technology, Taichung, Taiwan, ROC c Department of Computer Science and Information Engineering, Chung-Hua University, HsinChu, Taiwan, ROC article info Article history: Received 15 January 2010 Received in revised form 12 June 2010 Accepted 7 August 2010 Keywords: Hand motion capturing Separable state based particle filtering (SSBPF) abstract Vision-based hand motion capturing approaches play a critical role in human computer interface owing to its non-invasiveness, cost effectiveness, and user friendliness. This work presents a multi-view vision-based method to capture hand motion. A 3-D hand model with structural and kinematical constraints is developed to ensure that the proposed hand model behaves similar to an ordinary human hand. Human hand motion in a high degree of freedom space is estimated by developing a separable state based particle filtering (SSBPF) method to track the finger motion. By integrating different features, including silhouette, Chamfer distance, and depth map in different view angles, the proposed motion tracking system can capture the hand motion parameter effectively and solve the self-occlusion problem of the finger motion. Experimental results indicate that the hand joint angle estimation generates an average error of 111. & 2010 Elsevier Ltd. All rights reserved. 1. Introduction Developing an intuitive, non-invasive and intelligent human– computer interaction (HCI) method has received increasing interest recently. HCI systems can be classified as either ‘‘sensor-based’’ or ‘‘vision-based’’. The former [1–3] detects the voice, position, or motion information of humans by using microphones, electro- magnetic or fiber-optical sensors, while the latter [4–7] analyzes image or video signals to monitor human behavior, which is a non- intrusive, inexpensive, and a promising alternative choice. Most reliable hand motion capturing schemes are based on electro-mechanical or magnetic sensing devices (DataGlove) [8]. These related equipments can provide the most complete, accurate, and application-independent set of real-time measure- ments, however, they are intrusive, cumbersome, and expensive, as well as hinder natural hand motion. The motions of finger joints are captured using the other electro-mechanical sensors [9,10]. Although capable of providing real-time measurements of hand motion, sensor-based systems are generally expensive, cumbersome and not user-friendly. Vision-based methods [11] use the markers attached on the hand as the features points. Nevertheless, most studies depend on purely vision without additional equipment. A ‘‘vision-based’’ system is classified as either a 2–D appearance-based approach [12–15] or a 3-D model-based approach [16–19]. The former approach uses 2-D appearance and variety features extracted from the original hand image to estimate the hand state. Using a large amount of training images, they can formulate a nonlinear mapping containing features and hand states. Once the mapping was found in the image feature space, the hand configuration can be estimated efficiently. However, the mapping is highly nonlinear due to variation in the hand appearances under different gestures or a variety of viewing angles. Additionally, it involves a complex learning problem, and collecting a large set of training data is relatively difficult. The latter approach recovers the hand motion parameters by aligning a projected 3-D model with the observed image features, and minimizes the discrepancy between them. Various image observations have been studied to construct the correspondences between the model and the observed images. The motion state is recovered from 3-D configuration with the maximum similarity. This work presents a multi-view-based 3-D hand pose analysis system to track wrist rotation and the finger bending movement and, then, reconstruct the corresponding 3-D hand in virtual 3-D space. The movements are estimated based on a novel system that utilizes two view angles, i.e. frontal and side view, to overcome the self-occlusion problem. First, wrist motion is analyzed with motion history images and a variation of fitted palm ellipses to determine whether the hand is in frontal or side view. With a frontal view of the hand, previous studies [2–4,6,7,11–13] have tracked the parameters of hand motion. Once the hand is rotated to a side view, the problem of occlusion among fingers will make the finger movement analysis difficult. Therefore, this work Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2010.08.012 $ This revised paper is submitted to Pattern Recognition for another round of peer review. n Corresponding author at: Institute of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, ROC. E-mail addresses: [email protected] (M.-F. Ho), [email protected] (C.-L. Huang). Pattern Recognition 44 (2011) 443–453

Upload: meng-fen-ho

Post on 29-Jun-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A multi-view vision-based hand motion capturing system

Pattern Recognition 44 (2011) 443–453

Contents lists available at ScienceDirect

Pattern Recognition

0031-32

doi:10.1

$This

peer revn Corr

Hua Un

E-m

(C.-L. H

journal homepage: www.elsevier.com/locate/pr

A multi-view vision-based hand motion capturing system$

Meng-Fen Ho a,b,n, Chuan-Yu Tseng a, Cheng-Chang Lien c, Chung-Lin Huang a

a Institute of Electrical Engineering, National Tsing Hua University, HsinChu, Taiwan, ROCb Department of Electronic Engineering, Hsiuping Institute of Technology, Taichung, Taiwan, ROCc Department of Computer Science and Information Engineering, Chung-Hua University, HsinChu, Taiwan, ROC

a r t i c l e i n f o

Article history:

Received 15 January 2010

Received in revised form

12 June 2010

Accepted 7 August 2010

Keywords:

Hand motion capturing

Separable state based particle filtering

(SSBPF)

03/$ - see front matter & 2010 Elsevier Ltd. A

016/j.patcog.2010.08.012

revised paper is submitted to Pattern Reco

iew.

esponding author at: Institute of Electrical

iversity, Hsinchu, Taiwan, ROC.

ail addresses: [email protected] (M.-F. H

uang).

a b s t r a c t

Vision-based hand motion capturing approaches play a critical role in human computer interface owing

to its non-invasiveness, cost effectiveness, and user friendliness. This work presents a multi-view

vision-based method to capture hand motion. A 3-D hand model with structural and kinematical

constraints is developed to ensure that the proposed hand model behaves similar to an ordinary human

hand. Human hand motion in a high degree of freedom space is estimated by developing a separable

state based particle filtering (SSBPF) method to track the finger motion. By integrating different features,

including silhouette, Chamfer distance, and depth map in different view angles, the proposed motion

tracking system can capture the hand motion parameter effectively and solve the self-occlusion

problem of the finger motion. Experimental results indicate that the hand joint angle estimation

generates an average error of 111.

& 2010 Elsevier Ltd. All rights reserved.

1. Introduction

Developing an intuitive, non-invasive and intelligent human–computer interaction (HCI) method has received increasing interestrecently. HCI systems can be classified as either ‘‘sensor-based’’ or‘‘vision-based’’. The former [1–3] detects the voice, position, ormotion information of humans by using microphones, electro-magnetic or fiber-optical sensors, while the latter [4–7] analyzesimage or video signals to monitor human behavior, which is a non-intrusive, inexpensive, and a promising alternative choice.

Most reliable hand motion capturing schemes are based onelectro-mechanical or magnetic sensing devices (DataGlove) [8].These related equipments can provide the most complete,accurate, and application-independent set of real-time measure-ments, however, they are intrusive, cumbersome, and expensive,as well as hinder natural hand motion. The motions of fingerjoints are captured using the other electro-mechanical sensors[9,10]. Although capable of providing real-time measurements ofhand motion, sensor-based systems are generally expensive,cumbersome and not user-friendly.

Vision-based methods [11] use the markers attached on thehand as the features points. Nevertheless, most studies depend onpurely vision without additional equipment. A ‘‘vision-based’’

ll rights reserved.

gnition for another round of

Engineering, National Tsing

o), [email protected]

system is classified as either a 2–D appearance-based approach[12–15] or a 3-D model-based approach [16–19]. The formerapproach uses 2-D appearance and variety features extractedfrom the original hand image to estimate the hand state. Using alarge amount of training images, they can formulate a nonlinearmapping containing features and hand states. Once the mappingwas found in the image feature space, the hand configuration canbe estimated efficiently. However, the mapping is highlynonlinear due to variation in the hand appearances underdifferent gestures or a variety of viewing angles. Additionally, itinvolves a complex learning problem, and collecting a large set oftraining data is relatively difficult. The latter approach recoversthe hand motion parameters by aligning a projected 3-D modelwith the observed image features, and minimizes the discrepancybetween them. Various image observations have been studied toconstruct the correspondences between the model and theobserved images. The motion state is recovered from 3-Dconfiguration with the maximum similarity.

This work presents a multi-view-based 3-D hand pose analysissystem to track wrist rotation and the finger bending movementand, then, reconstruct the corresponding 3-D hand in virtual 3-Dspace. The movements are estimated based on a novel system thatutilizes two view angles, i.e. frontal and side view, to overcomethe self-occlusion problem. First, wrist motion is analyzed withmotion history images and a variation of fitted palm ellipses todetermine whether the hand is in frontal or side view. With afrontal view of the hand, previous studies [2–4,6,7,11–13] havetracked the parameters of hand motion. Once the hand is rotatedto a side view, the problem of occlusion among fingers will makethe finger movement analysis difficult. Therefore, this work

Page 2: A multi-view vision-based hand motion capturing system

Fig. 1. System overview diagram.

Fig. 2. The constructed 3-D hand model.

M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453444

attempts to resolve the occlusion problem using differentfeatures, including depth map, Chamfer distance and silhouette.Notably, the high-dimensional problem is of priority concernowing to the high degree of freedoms (DOFs) of a single hand.Therefore, this work also develops a separable state based particlefilter (SSBPF) to reduce the computation complexity. Fig. 1schematically depicts the system flow.

2. Motion constraint on hand model

Instead of using quadrics as shape primitives, cylinders,spheres and rectangles are the main elements to construct theproposed 3-D hand model. Sizes of the fingers and palm in the 3-D

hand model are scaled proportional to the size of a real humanhand. The length and width of each patch should be calibrated foreach individual person. Fig. 2 illustrates the constructed 3-D handmodel and the model contour onto the real hand.

Because of the high degree of freedom of hand motion(27 DOFs), an effective 3-D hand gesture recognition system isdeveloped by applying reasonable moving constraints for the3-D hand model. Hand motion generally consists of wrist andfinger movements. The wrist movement described with thetransformation matrix MW consists of the 3-D translation matrixT and 3-D rotation matrix R, whereas, the transformation matrixMF refers to the finger movement. Notably, MW has 6 degreesof freedom (DOFs), while MF has 21 DOFs. To reduce the DOF of a3-D hand model, the joint angles ymiddle and yfar (Fig. 3(a)) are

Page 3: A multi-view vision-based hand motion capturing system

Fig. 3. (a) The joint angles: ynear, ymiddle and yfar and (b) the finger open angle yopen between the index finger and the middle finger.

Fig. 4. The measurement of finger axis. Fig. 5. Unrealizable kinematics constraint: (a) view 1, (b) view 2.

M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453 445

estimated based on joint angle ynear with linear scaling [15]. Fig. 3also illustrates the opening angles yopen between the fingers fromtwo viewing angles. Additionally, MW and MF are depicted asMW:fðTi,RiÞ9i¼ x,y,zg and MF:fðyi

near,yimiddle,yi

far,yiopenÞ9i¼ 1. . .5g,

respectively.Based on the state of hand motion described with the

parameters of MW space and MF space, the 3-D hand model inthe virtual environment can be generated and the 2-D projectionimage of the 3-D hand model be obtained simultaneously. Here,OpenGL [20] is used to generate the 2-D projection hand image formeasuring the similarity between the 2-D projection hand imageand the observed hand image. To improve the system robustnessand efficiency, the estimation error incurred by finger occlusionsis reduced using some constraints of hand motions.

2.1. Structural constraint

Some collocations of finger motions are infeasible, because twofingers never occupy the same physical 3-D space. To avoid thisproblem, the distance between the axis of the two fingers (Fig. 4) isused to detect the finger collision by using the angle yopen abductedwithin two fingers. The position of the ith finger can be estimatedby using Fi ¼ fy

inear, y

imiddle, yi

far,yiopeng and applied to calculate the

possibility of the finger collision. Here, the function csi,j represents

the probability of the collision between ith and jth fingers as

csi,j ¼

1 if:dðFi,FjÞ:4threshold

0 otherwise

(ð1Þ

where d( � ) denotes the distance between two fingers. Next,considering all of the finger pairs allows us to define a probabilityfunction of finger collision as

psðFÞ ¼Y

8xi ,xj AMF

csi,j ð2Þ

This constraint refers to the probability of a finger collisionwhile the hand joint configurations are physically infeasible.

2.2. Kinematics constraint

In addition to the structural constraint, some infeasible fingermotions occur owing to limitations of finger movement. Forinstance Fig. 5 illustrates the hand gestures from two view angles.When the pinkie bends, the ring finger also moves due to themuscle constriction. Notably, more than 30 000 sets of fingermotion data obtained from DataGlove are collected to train thekinematics constraint information of each two fingers defined as

pðFi9FjÞ ¼ pðyinear9y

jnearÞ ð3Þ

where Fi and Fj are two different subsets of MF, defined asFi ¼ yi

near, yimiddle, yi

far

n o. The independence between the fingers is

assumed here, and the kinematics constraint is described in termsof a collision probability function as

pkðFÞ ¼Y

8Xi , Xj AMF

cki,j ð4Þ

where cki,j ¼ 1 if p(FijFj)Z threshold; otherwise ck

i,j ¼ 0.F¼{Fi9i¼1y5} represents 5 different fingers. The threshold isdetermined by fixing one finger sequentially and measuring thebending angles of the other fingers from the DataGlove. Thekinematic constraint on finger movements can then be derivedfrom the inter-relationships among finger bending angles.

Eqs. (2) and (4), i.e. the above constraints, are applied to solve thefrontal-view occlusion problem, which is introduced in Section 5.4.

3. Separable state based particle filter

In a non-Gaussian state-space model, the state sequence Xt isassumed to be a hidden Markov process with initial distribution

Page 4: A multi-view vision-based hand motion capturing system

M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453446

p(X0) and the given transition distribution pðXt9Xt�1Þ. Theobservations Zt and their observation likelihood pðZt9XtÞ

are conditionally independent given the state Xt. By usingBayesian sequential estimation on pðZt9XtÞ, this distributioncan be computed using the following recursions: (1) Prediction:

Fig. 8. The finger occlusion inference: (a) the original image, (b) the

Fig. 7. 3-D hand model with corresponding gray level indicates the depth.

Fig. 6. The depth images.

pðXt9Z0:t�1Þ ¼R

pðXt9Xt�1Þ pðXt�19Z0:t�1ÞdXt�1, and (2) Updating:pðXt9Z0:tÞ ¼ pðZt9XtÞpðXt9 Z0:t�1Þ=pðZt9Z0:t�1Þ. Particle filtering[21–23] attempts to approximate the posterior pðXt9ZtÞ bya weighted sample set as St ¼ fxi

t ,pitg, i¼ 1. . .N, where

pitppðZt9Xt ¼ xi

tÞ. The estimate of the object state at time t is theweighted mean over all sample states as X̂t ¼ EðStÞ ¼PN

i ¼ 1 pitx

it=PN

i ¼ 1 pit .

This work presents separable state based particle filter (SSBPF) [24],a novel method that is a modified version of the particle filters, totrack the hand motion. Based on the divide-and-conquer concept, amethod is adopted to separate the high dimension states into severalindependent states and, then, estimate each independent stateseparately. With the separated states, the total number of samplesrequired for estimation is significantly reduced; the processing timedecreases as well. SSBPF determines how to divide the state variable Xinto two sub-variables, i.e. X¼{Xa, Xb}, where Xa¼{Xa1,yXan} andXb¼{Xb1,yXbm}, under the assumption that the estimation of Xaconverges. Here, the gradient of convergence of estimating theseparated variable Xa is found to be the same as variable X.

Dividing the time-varying variable Xt into two parts, Xt¼

{Xat, Xbt} allows us to rewrite the weight of each sample i at time t

(i.e. xiat , xi

bt) as

pitppðZt9Xat ¼ xi

at ,Xbt ¼ xibtÞ ð5Þ

Assuming that Xbt is fixed (¼Xb) allows us to estimate Xat viathe particle filtering with observation likelihoodpðZt9Xat ¼ xi

at ,XbÞ.If Xbt is the ground truth (¼Xb,gt), the estimation of Xat by usingthe particle filtering will be accurate. Normally, Xbt is not theground truth, so we have two different estimations of Xat based ontwo different observation probabilities, i.e. pðZt9Xa,XbÞ and

depth image, (c) the edge image, and (d) the overlapped image.

Page 5: A multi-view vision-based hand motion capturing system

M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453 447

pðZt9Xa,Xb,gtÞ. Here, the ratio of these two different observationprobabilities is defined as

gðXa9Xb, Xb,gtÞ ¼pðZt9Xa,Xb, gtÞ

pðZt9Xa,XbÞð6Þ

Once gðXa Xb, Xb,gtÞ�� is known, Eq. (5) can be rewritten as

pitppðZt9xi

at , xibtÞ ¼ gðXa9Xb, Xb,gtÞ pðZt9xi

at , XbÞ ð7Þ

Consequently, the estimation X̂t becomes

X̂t ¼

PNi ¼ 1 pi

txitPN

i ¼ 1 pit

¼

PNi ¼ 1 pðZt9xi

a t , xibtÞx

itPN

i ¼ 1 pðZt9xia t ,x

ibtÞ

ffi

PNi ¼ 1 gðXa9Xb, Xb,gtÞpðZt9xi

a t ,XbÞxitPN

i ¼ 1 gðXa9Xb,Xb,gtÞpðZt9xia t ,XbÞ

ð8Þ

^

Estimating Xt with the fixed guess of Xbtð ¼XbÞ requires fixedgðXa Xb,Xb,gtÞ

�� . If gðXa Xb,Xb,gtÞ�� is fixed, then the observation prob-

abilities, pðZt9Xa,XbÞ and pðZt9Xa,Xb,gtÞ are correlated. By assumingthat Xbt is fixed ð ¼XbÞ, the estimation X̂at can be represented as

X̂at ¼

PNi ¼ 1 pðZt9Xat ¼ xi

at ,XbÞxiatPN

i ¼ 1 pðZt9Xat ¼ xiat ,XbÞ

ð9Þ

Comparing Eqs. (8) and (9) reveals that estimation X̂at

converges in the same gradient direction as X̂t , if gðXa Xb,Xb,gtÞ��

is known. Next, VðXa, Zt ,XbÞ�� , with Xa ¼ Xa1 . . . Xanf g is used to

Fig. 9. Flow-diagram of SSBP

describe the gradient of the convergence of the estimation X̂at as

VðXa9Zt , XbÞ ¼@pðZt9Xa,XbÞ

@Xa1,. . .,

@pðZt9Xa, XbÞ

@Xan

!ð10Þ

The similarity between the convergence directions of the twoestimations of X̂at (i.e., under two conditions: Xbt¼Xb,gt andXbt¼Xb) is

SðXa9Xb,Xb,gtÞ ¼VðXa9Zt ,XbÞVðXa9Zt ,XbgtÞ

VðXa Zt ,XbÞ�� �� VðXa Zt ,XbgtÞ

�� ������ ð11Þ

In the ideal case (e.g., gðXa Xb,Xb,gtÞ�� ¼Constant), the conver-

gence similarity S equals to 1. A larger S implies that the gradientsof convergence, VðXa Zt ,XbÞ

�� and VðXa Zt ,XbgtÞ�� , are closer to each

other. If the estimation X̂ converges, then with fixed Xb,estimation X̂a also converges and the separation of originalvariable X (i.e., X¼{Xa, Xb}) is applicable. Next, the constraint for afeasible separation is defined as

8Xb,

RRRXa1...

. . .R

XanSðXa9Xb,gt,XbÞpðZt9Xa,XbÞdXan. . .dXa1RRR

Xa1::. . .R

XanpðZt9Xa,XbÞdXan. . .dXa1

4e ð12Þ

where e is the threshold which is experimentally determined. Thestate variable X can be separated into Xa and Xb if the aboveconstraint, Eq. (12), is satisfied.

F for a single iteration.

Page 6: A multi-view vision-based hand motion capturing system

M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453448

4. Feature generation from multiple cameras

To capture the hand motion, we use a CCD camera in the frontalview, and sets up another CCD camera and a depth camera in theside view. The hand movement is tracked using different featuresfor different viewing angles. In the front view, the silhouette is adependable observation in the tracking process; likewise, in theside view angle, the depth map can be obtained and used to dealwith the occlusion. The silhouette feature is a highly effectivemeans of analyzing hand gestures from the front view. However,finger occlusion may occur when the wrist rotates. Here, thisproblem is overcome using the depth feature. To gather the depthinformation, the depth image is captured using the Bumblebeesystem [25]. The finite states of finger movements with thegrayscales distribution for each finger are calculated by using thedepth images (Fig. 6). The brighter intensity indicates the fingercloser to the camera than the others. This information allows us todevelop two main functionalities for hand tracking, i.e. fingerdiscrimination and finger occlusion inference.

With the depth image for each hand state, a virtual 3-D handmodel can be constructed based on the segmented regions on thedepth image. This virtual hand model can provide useful informa-tion for finger discrimination. To discriminate each finger from thedepth map, we (1) check each depth image of finger motion, (2)generate all of the depth image’s gray-level distribution, (3)calculate the mean and variation of the gray-level statisticaldistribution of each finger, and (4) assign a specific gray-level foreach finger (Fig. 7). Next, hand shape matching of the 3-D handmodel and the depth image is performed using the Chamferdistance to evaluate the observation likelihood function in SSBPF.

To solve the finger occlusion problem, we (1) use canny edgedetection to obtain the edge image, (2) gather the correspondingdepth image, (3) set a threshold for depth image for segmentationthe brighter region, which represents the distance close to thecamera, and (4) overlap the Canny edge image onto thesegmented depth image. Fig. 8 shows the results of the abovesteps. Overlapping the edge image (Fig. 8(c)) onto the depthimage (Fig. 8(b)) allows us to segment the finger regions that areclose to the camera. In Fig. 8(d), the brighter region on the rightside resulting from the ring finger is occluded by the middlefinger. Therefore, this inference can be used to overcome thefinger occlusion problem. The finger occlusion inference isintegrated as the observation likelihood to perform the mappingbetween the target and the depth image.

Fig. 10. Pseudo-code of SSBPF.

5. Hand tracking with multiple features fusion

The hand wrist rotation and the motion parameter of eachfinger can be tracked based on the features of depth, edge, andsilhouette of the hand. Also, the occlusion problem that occurredunder different view angles can also be resolved.

5.1. Estimating the wrist rotation angles

The hand rotation angles are detected by segmenting the palm,which is a rigid part of the hand. Here, the palm is separated fromthe hand region by using the segmentation method [26]. Next, byapplying ellipse fitting to the palm region, five parameters arederived, i.e. the location of the centroid, the length of major axisand minor axis, and the inclination angle. The ellipse parameterschange when the hand rotates. First, the hand rotation direction isestimated using the motion history intensity (MHI). The rotationangel y is determined as

yðFtÞ ¼MHIðFtÞERatioðFtÞ ð13Þ

where Ft is the frame number index, MHI(Ft) denotes the handrotation direction, and ERatio(Ft) represents the direction-lessrotation angle (9y9), which is determined by the ratio of themajor and minor axis of the ellipse. Here, ERatio(Ft) is quantizedinto 4 levels as

ERatioðFtÞ ¼

03 if Ratioo0:25

203 else if 0:25rRatioo0:50

403 else if 0:50rRatioo0:75

803 else if 0:75rRatior1

8>>><>>>:

ð14Þ

where Ratio¼{the length major axis}/{the length of minor axis}.

5.2. Hand tracking with SSBPF

The hand motion is tracked by decomposing the hand motionparameter X into two parts, i.e. R1,y,Rm and X1,y,Xm. R1,y,Rm,which include global information such as the translationsand orientations of the entire hand object must be updatedfirst. Let Zt denote the observation from the original video data

Page 7: A multi-view vision-based hand motion capturing system

M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453 449

and Xt denote the states in the SSBPF. Fig. 9 shows the flow-diagram for a single iteration. For each iteration, the stateestimation can be implemented by estimating both substate Rp

Fig. 12. The angle estimation of hand wrist rotation. The green line represents the rati

wrist rotation; the blue line represents the estimation hand wrist rotation angle. (For in

the web version of this article.)

Fig. 11. Hand wrist rotates and the reconstructed model’s edge on it: (a) fron

and substate Xm. Fig. 10 summarizes the algorithm of the entireprocess of SSBPF. More iterations generally imply more accurateresults.

o of major axis and minor axis; the red line represents the MHI direction of hand

terpretation of the references to color in this figure legend, the reader is referred to

tal view, (b) and (c) rotate clockwise, (d) and (e) rotate counterclockwise.

Page 8: A multi-view vision-based hand motion capturing system

Fig. 13. Reconstructed models for the cases of combining a rotating wrist with the bended fingers.

Table 1Comparison of particle filter and SSBPF.

Method Number of

Particles� Iteration

Average error

(deg.)

Particle filter

(CONDENSATION [21])

6000�1 19.75317

12 000�1 19.71322

30 000�1 17.41982

SSBPF 1000�1 12.72792

250�4 11.22609

Fig. 14. Error and the number of iteration for fixing processing time.

M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453450

5.3. Observation model

The hand region is located based on two observations, i.e.,hand silhouette for the frontal view and Chamfer distance for theside view. The hand silhouette is extracted from the frontal viewby using HSI color space to detect the skin color. From the sideview, the similarity between the constructed hand template andthe observed hand image is determined by using the Chamferdistance [27]. Moreover, the minimum distance between the handimage and the hand template is represented by using a distanceimage, as well as the Chamfer score derived by averaging thedistance in the distance image.

The hand silhouettes from the frontal and the side views areused for hand tracking. Based on the two observations from twodifferent views, two observation likelihood functions are ob-tained, i.e. pfrontalðZsilh9XÞ and psideðZDT9XÞ, which are calculated bymeasuring the difference between the target image and thecandidate model. The similarity between two binary handsilhouette images I1 and I2 with the same width and height isdefined as

DsilhðI1,I2Þ ¼X

x

Xy

I1ðx, yÞ � I2ðx, yÞ ð15Þ

where� represents the XOR operation. The observation likelihoodcan then be defined as

pfrontalðZsilh9XÞ ¼1ffiffiffiffiffiffi

2pp

sfrontal

exp�1

2

DsilhðZsilh, CðXÞÞ

sfrontal

� �2" #

ð16Þ

where Zsilh denotes the observation hand image, and C(X) refers tothe candidate hand model.

Similarly, the side likelihood function psideðZDT9XÞ can bemodeled by the Chamfer distance between the target image andthe candidate model. Let the distance between the distancetransform IDT and the candidate’s edge Iedge be defined as

DDTðIDT, IedgeÞ ¼X

x

Xy

IDTðx,yÞ � Iedgeðx, yÞ ð17Þ

Therefore, the observation likelihood is defined as

psideðZDT9XÞ ¼1ffiffiffiffiffiffi

2pp

sDT

exp�1

2

DDTðZDT, CðXÞÞ

sDT

� �2" #

ð18Þ

where ZDT denotes the observed hand image and C(X) is thecandidate hand model with a given X.

5.4. Finger occlusion

Finger occlusion is the most serious challenge in estimatingthe 3D hand parameters. This work develops a joint anglecorrelation and finger depth consistent measure to overcomethe finger occlusion problem.

(1) Frontal-view-based observation probability function:With the structural and kinematics constraints of ordinaryhuman hand, the observation likelihood in the front view iswritten as

pfrontalðZ9XÞ ¼ psðXsilhÞpkðXsilhÞpfrontalðZsilh9XÞ ð19Þ

where ps(Xsilh) and pk(Xsilh) are two constraint probabilityfunctions described in Eqs. (2) and (4), respectively. Onceocclusion occurs, two different hand postures may show thesame silhouette. Therefore, the frontal-view occlusion problem issolved by using these two constraints.

(2) Side-view-based observation probability function: Based onthe depth information, the observation likelihood in the side viewcan be expressed as

psideðZ9XÞ ¼ pdepthðZ9Xdepthi ,Xdepth

j ÞpsideðZDT9XÞ ð20Þ

where the likelihood function pdepthðZ9Xdepthj ,Xdepth

j Þ is based onthe correlation between two different finger joint (i.e., Xdepth

i andXdepth

j ), which is proportional to the overlapped area between thereal hand image and the depth image,

pdepthðZ9Xdepthi , Xdepth

j Þpexp �Adep \ Aori

2s2

� �ð21Þ

where Adep represents the depth image and Aori denotes the originalimage. The finger state in the frontal view is estimated using Eq. (19),while the finger state in the side view is estimated using Eq. (20).

6. Experimental results

The experiments involve use of a depth camera and twoordinary CCD cameras to capture the hand motion with the same

Page 9: A multi-view vision-based hand motion capturing system

M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453 451

initial gesture. Each video sequence consists of more than onethousand frames with the frame rate of 25 FPS. Hand articulationsare tracked, while the palm is viewed as the rigid body. The casesof skewed palm are not considered since the skewed palm doesnot fit our 3-D hand model.

6.1. Estimation of hand wrist rotation

The reconstructed 3-D hand model image is overlapped withthe original image to illustrate the results of our motion capturingsystem (Fig. 11). Fig. 12 shows the estimated hand wrist rotation

Fig. 15. Five finger bending trend, the green line shows the ground truth captured

from DataGlove and the blue line indicates the motion capturing results

(normalized value: 1¼901, 0.8¼721, 0.6¼541, 0.4¼361, 0.2¼181). (For interpreta-

tion of the references to color in this figure legend, the reader is referred to the

web version of this article.)

Fig. 16. The source and estimation results of sequence with D

angle. Fig. 13 indicates that the proposed system can work wellfor cases of combining a rotating wrist with bended fingers.

6.2. Comparison between particle filter and SSBPF

Particle filter (CONDENSATION [21]) and SSBPF are comparedby considering the captured images of only one camera. The same3-D hand model on a 21 DOF particle filter is utilized by usingdifferent particle numbers. Table 1 reveals that with the sameprocessing time, the average error of original particle filter isnearly twice that of the proposed method. Table 1 summarizes theaccuracy of two methods. According to those results, increasingthe number of particles does not enhance the accuracy con-spicuously by the original particle filter. Moreover, differentsettings of the same processing time in the proposed method maydecrease the average error by 10%. The average error value of theexperiments can be determined as

Error¼X4

i ¼ 0

FingerEsti �FingerGT

i

�� ��5

ð22Þ

where the fingers FingerEst represents the estimation results, FingerGT

represents the ground truth and i denotes the different fingers.Fig. 14 illustrates the accuracy after different iterations with

the same processing time. This figure also indicates the proposedsystem performs excellently by converging within five iterations.

6.3. Comparison with the ground truth

This section presents the estimated finger movement parametersand the ground truth gathered by DataGlove. Fig. 15 indicates thatthe difference between the ground truth and the results of motion

ataGlove: (a), (b) and (c) represent three different states.

Page 10: A multi-view vision-based hand motion capturing system

Fig. 18. The source and estimation results of side-view sequence without

DataGlove. Left: the source image. Middle: the reconstructed hand model. Right:

The silhouette overlapped with the estimation result.

Fig. 17. The source and estimation results of frontal-view sequence without

DataGlove. Left: the source image. Middle: the reconstructed hand model. Right:

the silhouette overlapped with the estimation result.

M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453452

capturing. The green line represents the sequence of each fingerbending joint angle, while the blue line denotes the results of SSBPF.Once the ground truth has substantial variations, the proposedsystem can still estimate and generate similar results effectively.Additionally, according to the experimental results, the errorbetween the estimated hand joint angle and the ground truth isless than 191. Two video sequences are one with DataGlove (Fig. 16)and one without (Figs. 17 and 18). The ground truth is collected byusing the DataGlove.

Experimental results indicate that the proposed systemfunctions properly with the frontal-view sequence. Comparedwith [19], their model makes more efficient computation by usingfewer particles than ours. However, their system only deals withthe frontal-view sequence. They did not deal with the occlusionproblems in the side-view images. Although [19] offered quanti-tative evaluations to prove the accuracy of their algorithm, theirnumerical results could not be compared with ours. To comparewith the ground truth, Ying et al. [19] performed validationexperiments using the animated hand motion sequence, which isdifferent from the real hand video sequences used in ourexperiments.

Fig. 18 demonstrates that the proposed method can overcomefinger occlusion problems, which often occur in the side-viewsequence. The fusion of different features allows us to track thehand articulations. Additionally, using the structural and kine-matics constraints, as well as the depth image, allows us toresolve the occlusion problems under different view angles.Rather than the traditional particle filter, the entire sample spaceis separated into several subsets to increase the trackingefficiency. Furthermore, when the palm does not face the camera,the hand posture under the finger occlusion is still estimatedsuccessfully.

7. Conclusion and future works

This work presents a novel hand motion capturing system totrack the rotating hand with finger movements. A novel methodcalled the separable state based particle filtering (SSBPF) is alsodeveloped to reduce the high dimension states when estimating thehand motion. Combining different features such as silhouette, depthand Chamfer distance allows us to track the hand motion efficientlyand solve the finger occlusion problems. The proposed system ishighly promising for many applications, including cooperationpresentation and intelligent human computer interface systems.

The proposed method assumes that the hand wrist rotatesunder the restricted angles. Efforts are underway to increase theflexibility of the proposed method, handle more clutter back-grounds and integrate the 2-D appearance based method with theproposed method. Appearance based methods can provide abetter initial guess, as well as reduce the number of requiredsamples without sacrificing accuracy.

Reference

[1] M. Bray, E. Koller-Meier, L. Van Gool, Smart particle filtering for 3D handtracking, in: Proceedings of the IEEE FGR, 2004, pp. 675–680.

[2] M. Bray, E. Koller-Meier, L.Van Gool, Smart particle filtering for high-dimensional tracking, Computer Vision and Image Understanding 106 (2007)116–129.

[3] D. Huan, E. Charbon, 3D hand model fitting for virtual keyboard system, in:IEEE Workshop on Applications of Computer Vision, 2007, p. 31.

Page 11: A multi-view vision-based hand motion capturing system

M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453 453

[4] W. Ying, J.Y. Lin, T.S. Huang, Capturing natural hand articulation, ICCV2 (2001) 426–432.

[5] A. Erol, G. Bebis, M. Nicolescu, R.D. Boyle, X. Twombly, A review on vision-based full DOF hand motion estimation, CVPR (2005) 75.

[6] A. El-Sawah, C. Joslin, N. D. Georganas, and E. M. Petriu, A framework for 3Dhand tracking and gesture recognition using elements of genetic program-ming, in: Proceedings of the IEEE CRV, 2007, pp. 495–502.

[7] A. Erol, G. Bebis, M. Nicolescu, R.D. Boyle, X. Twombly, Vision-based handpose estimation: a review, Computer Vision and Image Understanding 108(2007) 52–73.

[8] DataGlove, 5DT Fifth Dimension Technologies, /http://www.5dt.com/products/pdataglove14.htmlS.

[9] D.J. Sturman, D. Zeltzer, A survey of glove-based input, IEEE Transactions onComputer Graphics and Applications 14 (1994) 30–39.

[10] R.G. O’Hagan, A. Zelinsky, S. Rougeaux, Visual gesture interfaces for virtualenvironments, Interacting with Computers 14 (2002) 231–250.

[11] C.-C. Lien, A scalable model-based hand posture analysis system, MachineVision and Applications 16 (2005) 157–169.

[12] W. Ying, T.S. Huang, View-independent recognition of hand postures, CVPR 2(2000) 88–94.

[13] V. Athitsos, S. Sclaroff, Estimating 3D hand pose from a cluttered image, CVPR2 (2003) I–432-9.

[14] V. Athitsos and S. Sclaroff, An appearance-based framework for 3D handshape classification and camera viewpoint estimation, in: Proceedings of theIEEE FGR, 2002, pp. 40–45.

[15] A. Imai, N. Shimada, Y. Shirai, 3-D hand posture recognition by trainingcontour variation, in: Proceedings of the IEEE FGR, 2004, pp. 895–900.

[16] C. Wen-Yan, C. Chu-Song, H. Yi-Ping, Appearance-guided particle filtering forarticulated hand tracking, CVPR 1 (2005) 235–242.

[17] M. Bray, E. Koller-Meier, N.N. Schraudolph, L.Van Gool, Fast stochasticoptimization for articulated structure tracking, Image and Vision Computing25 (2007) 352–364.

[18] E.B. Sudderth, M.I. Mandel, W.T. Freeman, A.S. Willsky, Visual hand trackingusing nonparametric belief propagation, CVPR (2004) 189.

[19] W. Ying, J. Lin, T.S. Huang, Analyzing and capturing articulated handmotion in image sequences, IEEE Transactions on PAMI 27 (2005)1910–1922.

[20] OpenGL Available: /http://www.opengl.orgS.[21] M. Isard, A. Blake, CONDENSATION—conditional density propagation for

visual tracking, International Journal of Computer Vision 29 (1998) 5–28.[22] M. Isard, A. Blake, ICONDENSATION: Unifying low-level and high-level tracking

in a stochastic framework, in: Proceedings of the ECCV, 1998, pp. 893–908.[23] K. Nummiaro, E. Koller-Meier, L.Van Gool, An adaptive color-based particle

filter, Image and Vision Computing 21 (2003) 99–110.[24] Ching-Yu Weng, A vision-based hand motion parameter capturing system,

Master’s Thesis, Institute of Electrical Engineering, National Tsing HuaUniversity, July 2007.

[25] Bumblebee2, Point Grey Research /http://www.ptgrey.com/products/stereo.aspS.

[26] G. Amayeh, G. Bebis, A. Erol, M. Nicolescu, A component-based approach tohand verification, CVPR (2007) 1–8.

[27] B. Stenger, A. Thayananthan, P.H.S. Torr, R. Cipolla, Estimating 3D hand poseusing hierarchical multi-label classification, Image and Vision Computing 25(2007) 1885–1894.

Meng-Fen Ho received her B.S. degree in Electrical Engineering from National Tsing-Hua University, Hsin-Chu, Taiwan, ROC, in 1992, and M.S. degree in ElectricalEngineering from Iowa State University, USA, in 1994, respectively. Currently, she is a lecturer in the Department of Electronic Engineering, Hsiuping Institute ofTechnology, Taichung, Taiwan. She is also working toward her Ph.D. degree in Electrical Engineering, National Tsing-Hua University, Hsin-Chu, Taiwan. Her researchinterests are in the areas of image processing and computer vision.

Chuan-Yu Tseng received his B.S. degree in Electrical Engineering from Chung Yuan Christian University, Chung-Li, Taiwan, ROC, in 2006, and M.S. degree in ElectricalEngineering from National Tsing-Hua University, Hsin-Chu, Taiwan, in 2008, respectively. Currently, he works for Formosa Plastics Group, Taiwan, as an engineer. Hisresearch interests are in the areas of image processing and computer vision.

Cheng-Chang Lien received the B.S. degree from the Electrical Engineering Department, Chung Yuan University, Tauyuan, Taiwan, in 1987. He received the M.S. and Ph.D.degrees from the Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, in June 1992 and September 1997, respectively. Currently, he is anassociate professor in the department of Computer Science and Information Engineering, Chung-Hua University, Hsinchu, Taiwan. His research interests are in the areas ofmultimedia signal processing, pattern recognition and computer vision.

Chung-Lin Huang received his B.S. degree in Nuclear Engineering from National Tsing-Hua University, Hsin-Chu, Taiwan, ROC, in 1977, and M.S. degree in ElectricalEngineering from National Taiwan University, Taipei, Taiwan, ROC, in 1979, respectively. He obtained his Ph.D. degree in Electrical Engineering from the University ofFlorida, Gainesville, FL, USA, in 1987. From 1987 to 1988, he worked for the Unisys Co., Orange County, CA, USA, as a project engineer. Since August 1988 he has been withthe Electrical Engineering Department, National Tsing-Hua University, Hsin-Chu, Taiwan, ROC. Currently, he is a professor in the same department. In 1993 and 1994, hehad received the Distinguish Research Awards from the National Science Council, Taiwan, ROC. In November 1993, he received the Best Paper Award from the ACCV, Osaka,Japan, and in August 1996, he received the Best Paper Award form the CVGIP Society, Taiwan, ROC. In December 1997, he received the best paper award from IEEE ISMIPConference held in Academia Sinica, Taipei. In 2002, he received the Best Paper Annual Award from the Journal of Information Science and Engineering, Academia Sinica,Taiwan. His research interests are in the areas of image processing, computer vision, and visual communication. He is a senior member of IEEE.