a multi-view vision-based hand motion capturing system
TRANSCRIPT
Pattern Recognition 44 (2011) 443–453
Contents lists available at ScienceDirect
Pattern Recognition
0031-32
doi:10.1
$This
peer revn Corr
Hua Un
E-m
(C.-L. H
journal homepage: www.elsevier.com/locate/pr
A multi-view vision-based hand motion capturing system$
Meng-Fen Ho a,b,n, Chuan-Yu Tseng a, Cheng-Chang Lien c, Chung-Lin Huang a
a Institute of Electrical Engineering, National Tsing Hua University, HsinChu, Taiwan, ROCb Department of Electronic Engineering, Hsiuping Institute of Technology, Taichung, Taiwan, ROCc Department of Computer Science and Information Engineering, Chung-Hua University, HsinChu, Taiwan, ROC
a r t i c l e i n f o
Article history:
Received 15 January 2010
Received in revised form
12 June 2010
Accepted 7 August 2010
Keywords:
Hand motion capturing
Separable state based particle filtering
(SSBPF)
03/$ - see front matter & 2010 Elsevier Ltd. A
016/j.patcog.2010.08.012
revised paper is submitted to Pattern Reco
iew.
esponding author at: Institute of Electrical
iversity, Hsinchu, Taiwan, ROC.
ail addresses: [email protected] (M.-F. H
uang).
a b s t r a c t
Vision-based hand motion capturing approaches play a critical role in human computer interface owing
to its non-invasiveness, cost effectiveness, and user friendliness. This work presents a multi-view
vision-based method to capture hand motion. A 3-D hand model with structural and kinematical
constraints is developed to ensure that the proposed hand model behaves similar to an ordinary human
hand. Human hand motion in a high degree of freedom space is estimated by developing a separable
state based particle filtering (SSBPF) method to track the finger motion. By integrating different features,
including silhouette, Chamfer distance, and depth map in different view angles, the proposed motion
tracking system can capture the hand motion parameter effectively and solve the self-occlusion
problem of the finger motion. Experimental results indicate that the hand joint angle estimation
generates an average error of 111.
& 2010 Elsevier Ltd. All rights reserved.
1. Introduction
Developing an intuitive, non-invasive and intelligent human–computer interaction (HCI) method has received increasing interestrecently. HCI systems can be classified as either ‘‘sensor-based’’ or‘‘vision-based’’. The former [1–3] detects the voice, position, ormotion information of humans by using microphones, electro-magnetic or fiber-optical sensors, while the latter [4–7] analyzesimage or video signals to monitor human behavior, which is a non-intrusive, inexpensive, and a promising alternative choice.
Most reliable hand motion capturing schemes are based onelectro-mechanical or magnetic sensing devices (DataGlove) [8].These related equipments can provide the most complete,accurate, and application-independent set of real-time measure-ments, however, they are intrusive, cumbersome, and expensive,as well as hinder natural hand motion. The motions of fingerjoints are captured using the other electro-mechanical sensors[9,10]. Although capable of providing real-time measurements ofhand motion, sensor-based systems are generally expensive,cumbersome and not user-friendly.
Vision-based methods [11] use the markers attached on thehand as the features points. Nevertheless, most studies depend onpurely vision without additional equipment. A ‘‘vision-based’’
ll rights reserved.
gnition for another round of
Engineering, National Tsing
system is classified as either a 2–D appearance-based approach[12–15] or a 3-D model-based approach [16–19]. The formerapproach uses 2-D appearance and variety features extractedfrom the original hand image to estimate the hand state. Using alarge amount of training images, they can formulate a nonlinearmapping containing features and hand states. Once the mappingwas found in the image feature space, the hand configuration canbe estimated efficiently. However, the mapping is highlynonlinear due to variation in the hand appearances underdifferent gestures or a variety of viewing angles. Additionally, itinvolves a complex learning problem, and collecting a large set oftraining data is relatively difficult. The latter approach recoversthe hand motion parameters by aligning a projected 3-D modelwith the observed image features, and minimizes the discrepancybetween them. Various image observations have been studied toconstruct the correspondences between the model and theobserved images. The motion state is recovered from 3-Dconfiguration with the maximum similarity.
This work presents a multi-view-based 3-D hand pose analysissystem to track wrist rotation and the finger bending movementand, then, reconstruct the corresponding 3-D hand in virtual 3-Dspace. The movements are estimated based on a novel system thatutilizes two view angles, i.e. frontal and side view, to overcomethe self-occlusion problem. First, wrist motion is analyzed withmotion history images and a variation of fitted palm ellipses todetermine whether the hand is in frontal or side view. With afrontal view of the hand, previous studies [2–4,6,7,11–13] havetracked the parameters of hand motion. Once the hand is rotatedto a side view, the problem of occlusion among fingers will makethe finger movement analysis difficult. Therefore, this work
Fig. 1. System overview diagram.
Fig. 2. The constructed 3-D hand model.
M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453444
attempts to resolve the occlusion problem using differentfeatures, including depth map, Chamfer distance and silhouette.Notably, the high-dimensional problem is of priority concernowing to the high degree of freedoms (DOFs) of a single hand.Therefore, this work also develops a separable state based particlefilter (SSBPF) to reduce the computation complexity. Fig. 1schematically depicts the system flow.
2. Motion constraint on hand model
Instead of using quadrics as shape primitives, cylinders,spheres and rectangles are the main elements to construct theproposed 3-D hand model. Sizes of the fingers and palm in the 3-D
hand model are scaled proportional to the size of a real humanhand. The length and width of each patch should be calibrated foreach individual person. Fig. 2 illustrates the constructed 3-D handmodel and the model contour onto the real hand.
Because of the high degree of freedom of hand motion(27 DOFs), an effective 3-D hand gesture recognition system isdeveloped by applying reasonable moving constraints for the3-D hand model. Hand motion generally consists of wrist andfinger movements. The wrist movement described with thetransformation matrix MW consists of the 3-D translation matrixT and 3-D rotation matrix R, whereas, the transformation matrixMF refers to the finger movement. Notably, MW has 6 degreesof freedom (DOFs), while MF has 21 DOFs. To reduce the DOF of a3-D hand model, the joint angles ymiddle and yfar (Fig. 3(a)) are
Fig. 3. (a) The joint angles: ynear, ymiddle and yfar and (b) the finger open angle yopen between the index finger and the middle finger.
Fig. 4. The measurement of finger axis. Fig. 5. Unrealizable kinematics constraint: (a) view 1, (b) view 2.
M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453 445
estimated based on joint angle ynear with linear scaling [15]. Fig. 3also illustrates the opening angles yopen between the fingers fromtwo viewing angles. Additionally, MW and MF are depicted asMW:fðTi,RiÞ9i¼ x,y,zg and MF:fðyi
near,yimiddle,yi
far,yiopenÞ9i¼ 1. . .5g,
respectively.Based on the state of hand motion described with the
parameters of MW space and MF space, the 3-D hand model inthe virtual environment can be generated and the 2-D projectionimage of the 3-D hand model be obtained simultaneously. Here,OpenGL [20] is used to generate the 2-D projection hand image formeasuring the similarity between the 2-D projection hand imageand the observed hand image. To improve the system robustnessand efficiency, the estimation error incurred by finger occlusionsis reduced using some constraints of hand motions.
2.1. Structural constraint
Some collocations of finger motions are infeasible, because twofingers never occupy the same physical 3-D space. To avoid thisproblem, the distance between the axis of the two fingers (Fig. 4) isused to detect the finger collision by using the angle yopen abductedwithin two fingers. The position of the ith finger can be estimatedby using Fi ¼ fy
inear, y
imiddle, yi
far,yiopeng and applied to calculate the
possibility of the finger collision. Here, the function csi,j represents
the probability of the collision between ith and jth fingers as
csi,j ¼
1 if:dðFi,FjÞ:4threshold
0 otherwise
(ð1Þ
where d( � ) denotes the distance between two fingers. Next,considering all of the finger pairs allows us to define a probabilityfunction of finger collision as
psðFÞ ¼Y
8xi ,xj AMF
csi,j ð2Þ
This constraint refers to the probability of a finger collisionwhile the hand joint configurations are physically infeasible.
2.2. Kinematics constraint
In addition to the structural constraint, some infeasible fingermotions occur owing to limitations of finger movement. Forinstance Fig. 5 illustrates the hand gestures from two view angles.When the pinkie bends, the ring finger also moves due to themuscle constriction. Notably, more than 30 000 sets of fingermotion data obtained from DataGlove are collected to train thekinematics constraint information of each two fingers defined as
pðFi9FjÞ ¼ pðyinear9y
jnearÞ ð3Þ
where Fi and Fj are two different subsets of MF, defined asFi ¼ yi
near, yimiddle, yi
far
n o. The independence between the fingers is
assumed here, and the kinematics constraint is described in termsof a collision probability function as
pkðFÞ ¼Y
8Xi , Xj AMF
cki,j ð4Þ
where cki,j ¼ 1 if p(FijFj)Z threshold; otherwise ck
i,j ¼ 0.F¼{Fi9i¼1y5} represents 5 different fingers. The threshold isdetermined by fixing one finger sequentially and measuring thebending angles of the other fingers from the DataGlove. Thekinematic constraint on finger movements can then be derivedfrom the inter-relationships among finger bending angles.
Eqs. (2) and (4), i.e. the above constraints, are applied to solve thefrontal-view occlusion problem, which is introduced in Section 5.4.
3. Separable state based particle filter
In a non-Gaussian state-space model, the state sequence Xt isassumed to be a hidden Markov process with initial distribution
M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453446
p(X0) and the given transition distribution pðXt9Xt�1Þ. Theobservations Zt and their observation likelihood pðZt9XtÞ
are conditionally independent given the state Xt. By usingBayesian sequential estimation on pðZt9XtÞ, this distributioncan be computed using the following recursions: (1) Prediction:
Fig. 8. The finger occlusion inference: (a) the original image, (b) the
Fig. 7. 3-D hand model with corresponding gray level indicates the depth.
Fig. 6. The depth images.
pðXt9Z0:t�1Þ ¼R
pðXt9Xt�1Þ pðXt�19Z0:t�1ÞdXt�1, and (2) Updating:pðXt9Z0:tÞ ¼ pðZt9XtÞpðXt9 Z0:t�1Þ=pðZt9Z0:t�1Þ. Particle filtering[21–23] attempts to approximate the posterior pðXt9ZtÞ bya weighted sample set as St ¼ fxi
t ,pitg, i¼ 1. . .N, where
pitppðZt9Xt ¼ xi
tÞ. The estimate of the object state at time t is theweighted mean over all sample states as X̂t ¼ EðStÞ ¼PN
i ¼ 1 pitx
it=PN
i ¼ 1 pit .
This work presents separable state based particle filter (SSBPF) [24],a novel method that is a modified version of the particle filters, totrack the hand motion. Based on the divide-and-conquer concept, amethod is adopted to separate the high dimension states into severalindependent states and, then, estimate each independent stateseparately. With the separated states, the total number of samplesrequired for estimation is significantly reduced; the processing timedecreases as well. SSBPF determines how to divide the state variable Xinto two sub-variables, i.e. X¼{Xa, Xb}, where Xa¼{Xa1,yXan} andXb¼{Xb1,yXbm}, under the assumption that the estimation of Xaconverges. Here, the gradient of convergence of estimating theseparated variable Xa is found to be the same as variable X.
Dividing the time-varying variable Xt into two parts, Xt¼
{Xat, Xbt} allows us to rewrite the weight of each sample i at time t
(i.e. xiat , xi
bt) as
pitppðZt9Xat ¼ xi
at ,Xbt ¼ xibtÞ ð5Þ
Assuming that Xbt is fixed (¼Xb) allows us to estimate Xat viathe particle filtering with observation likelihoodpðZt9Xat ¼ xi
at ,XbÞ.If Xbt is the ground truth (¼Xb,gt), the estimation of Xat by usingthe particle filtering will be accurate. Normally, Xbt is not theground truth, so we have two different estimations of Xat based ontwo different observation probabilities, i.e. pðZt9Xa,XbÞ and
depth image, (c) the edge image, and (d) the overlapped image.
M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453 447
pðZt9Xa,Xb,gtÞ. Here, the ratio of these two different observationprobabilities is defined as
gðXa9Xb, Xb,gtÞ ¼pðZt9Xa,Xb, gtÞ
pðZt9Xa,XbÞð6Þ
Once gðXa Xb, Xb,gtÞ�� is known, Eq. (5) can be rewritten as
pitppðZt9xi
at , xibtÞ ¼ gðXa9Xb, Xb,gtÞ pðZt9xi
at , XbÞ ð7Þ
Consequently, the estimation X̂t becomes
X̂t ¼
PNi ¼ 1 pi
txitPN
i ¼ 1 pit
¼
PNi ¼ 1 pðZt9xi
a t , xibtÞx
itPN
i ¼ 1 pðZt9xia t ,x
ibtÞ
ffi
PNi ¼ 1 gðXa9Xb, Xb,gtÞpðZt9xi
a t ,XbÞxitPN
i ¼ 1 gðXa9Xb,Xb,gtÞpðZt9xia t ,XbÞ
ð8Þ
^
Estimating Xt with the fixed guess of Xbtð ¼XbÞ requires fixedgðXa Xb,Xb,gtÞ�� . If gðXa Xb,Xb,gtÞ�� is fixed, then the observation prob-
abilities, pðZt9Xa,XbÞ and pðZt9Xa,Xb,gtÞ are correlated. By assumingthat Xbt is fixed ð ¼XbÞ, the estimation X̂at can be represented as
X̂at ¼
PNi ¼ 1 pðZt9Xat ¼ xi
at ,XbÞxiatPN
i ¼ 1 pðZt9Xat ¼ xiat ,XbÞ
ð9Þ
Comparing Eqs. (8) and (9) reveals that estimation X̂at
converges in the same gradient direction as X̂t , if gðXa Xb,Xb,gtÞ��
is known. Next, VðXa, Zt ,XbÞ�� , with Xa ¼ Xa1 . . . Xanf g is used to
Fig. 9. Flow-diagram of SSBP
describe the gradient of the convergence of the estimation X̂at as
VðXa9Zt , XbÞ ¼@pðZt9Xa,XbÞ
@Xa1,. . .,
@pðZt9Xa, XbÞ
@Xan
!ð10Þ
The similarity between the convergence directions of the twoestimations of X̂at (i.e., under two conditions: Xbt¼Xb,gt andXbt¼Xb) is
SðXa9Xb,Xb,gtÞ ¼VðXa9Zt ,XbÞVðXa9Zt ,XbgtÞ
VðXa Zt ,XbÞ�� �� VðXa Zt ,XbgtÞ
�� ������ ð11Þ
In the ideal case (e.g., gðXa Xb,Xb,gtÞ�� ¼Constant), the conver-
gence similarity S equals to 1. A larger S implies that the gradientsof convergence, VðXa Zt ,XbÞ
�� and VðXa Zt ,XbgtÞ�� , are closer to each
other. If the estimation X̂ converges, then with fixed Xb,estimation X̂a also converges and the separation of originalvariable X (i.e., X¼{Xa, Xb}) is applicable. Next, the constraint for afeasible separation is defined as
8Xb,
RRRXa1...
. . .R
XanSðXa9Xb,gt,XbÞpðZt9Xa,XbÞdXan. . .dXa1RRR
Xa1::. . .R
XanpðZt9Xa,XbÞdXan. . .dXa1
4e ð12Þ
where e is the threshold which is experimentally determined. Thestate variable X can be separated into Xa and Xb if the aboveconstraint, Eq. (12), is satisfied.
F for a single iteration.
M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453448
4. Feature generation from multiple cameras
To capture the hand motion, we use a CCD camera in the frontalview, and sets up another CCD camera and a depth camera in theside view. The hand movement is tracked using different featuresfor different viewing angles. In the front view, the silhouette is adependable observation in the tracking process; likewise, in theside view angle, the depth map can be obtained and used to dealwith the occlusion. The silhouette feature is a highly effectivemeans of analyzing hand gestures from the front view. However,finger occlusion may occur when the wrist rotates. Here, thisproblem is overcome using the depth feature. To gather the depthinformation, the depth image is captured using the Bumblebeesystem [25]. The finite states of finger movements with thegrayscales distribution for each finger are calculated by using thedepth images (Fig. 6). The brighter intensity indicates the fingercloser to the camera than the others. This information allows us todevelop two main functionalities for hand tracking, i.e. fingerdiscrimination and finger occlusion inference.
With the depth image for each hand state, a virtual 3-D handmodel can be constructed based on the segmented regions on thedepth image. This virtual hand model can provide useful informa-tion for finger discrimination. To discriminate each finger from thedepth map, we (1) check each depth image of finger motion, (2)generate all of the depth image’s gray-level distribution, (3)calculate the mean and variation of the gray-level statisticaldistribution of each finger, and (4) assign a specific gray-level foreach finger (Fig. 7). Next, hand shape matching of the 3-D handmodel and the depth image is performed using the Chamferdistance to evaluate the observation likelihood function in SSBPF.
To solve the finger occlusion problem, we (1) use canny edgedetection to obtain the edge image, (2) gather the correspondingdepth image, (3) set a threshold for depth image for segmentationthe brighter region, which represents the distance close to thecamera, and (4) overlap the Canny edge image onto thesegmented depth image. Fig. 8 shows the results of the abovesteps. Overlapping the edge image (Fig. 8(c)) onto the depthimage (Fig. 8(b)) allows us to segment the finger regions that areclose to the camera. In Fig. 8(d), the brighter region on the rightside resulting from the ring finger is occluded by the middlefinger. Therefore, this inference can be used to overcome thefinger occlusion problem. The finger occlusion inference isintegrated as the observation likelihood to perform the mappingbetween the target and the depth image.
Fig. 10. Pseudo-code of SSBPF.
5. Hand tracking with multiple features fusion
The hand wrist rotation and the motion parameter of eachfinger can be tracked based on the features of depth, edge, andsilhouette of the hand. Also, the occlusion problem that occurredunder different view angles can also be resolved.
5.1. Estimating the wrist rotation angles
The hand rotation angles are detected by segmenting the palm,which is a rigid part of the hand. Here, the palm is separated fromthe hand region by using the segmentation method [26]. Next, byapplying ellipse fitting to the palm region, five parameters arederived, i.e. the location of the centroid, the length of major axisand minor axis, and the inclination angle. The ellipse parameterschange when the hand rotates. First, the hand rotation direction isestimated using the motion history intensity (MHI). The rotationangel y is determined as
yðFtÞ ¼MHIðFtÞERatioðFtÞ ð13Þ
where Ft is the frame number index, MHI(Ft) denotes the handrotation direction, and ERatio(Ft) represents the direction-lessrotation angle (9y9), which is determined by the ratio of themajor and minor axis of the ellipse. Here, ERatio(Ft) is quantizedinto 4 levels as
ERatioðFtÞ ¼
03 if Ratioo0:25
203 else if 0:25rRatioo0:50
403 else if 0:50rRatioo0:75
803 else if 0:75rRatior1
8>>><>>>:
ð14Þ
where Ratio¼{the length major axis}/{the length of minor axis}.
5.2. Hand tracking with SSBPF
The hand motion is tracked by decomposing the hand motionparameter X into two parts, i.e. R1,y,Rm and X1,y,Xm. R1,y,Rm,which include global information such as the translationsand orientations of the entire hand object must be updatedfirst. Let Zt denote the observation from the original video data
M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453 449
and Xt denote the states in the SSBPF. Fig. 9 shows the flow-diagram for a single iteration. For each iteration, the stateestimation can be implemented by estimating both substate Rp
Fig. 12. The angle estimation of hand wrist rotation. The green line represents the rati
wrist rotation; the blue line represents the estimation hand wrist rotation angle. (For in
the web version of this article.)
Fig. 11. Hand wrist rotates and the reconstructed model’s edge on it: (a) fron
and substate Xm. Fig. 10 summarizes the algorithm of the entireprocess of SSBPF. More iterations generally imply more accurateresults.
o of major axis and minor axis; the red line represents the MHI direction of hand
terpretation of the references to color in this figure legend, the reader is referred to
tal view, (b) and (c) rotate clockwise, (d) and (e) rotate counterclockwise.
Fig. 13. Reconstructed models for the cases of combining a rotating wrist with the bended fingers.
Table 1Comparison of particle filter and SSBPF.
Method Number of
Particles� Iteration
Average error
(deg.)
Particle filter
(CONDENSATION [21])
6000�1 19.75317
12 000�1 19.71322
30 000�1 17.41982
SSBPF 1000�1 12.72792
250�4 11.22609
Fig. 14. Error and the number of iteration for fixing processing time.
M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453450
5.3. Observation model
The hand region is located based on two observations, i.e.,hand silhouette for the frontal view and Chamfer distance for theside view. The hand silhouette is extracted from the frontal viewby using HSI color space to detect the skin color. From the sideview, the similarity between the constructed hand template andthe observed hand image is determined by using the Chamferdistance [27]. Moreover, the minimum distance between the handimage and the hand template is represented by using a distanceimage, as well as the Chamfer score derived by averaging thedistance in the distance image.
The hand silhouettes from the frontal and the side views areused for hand tracking. Based on the two observations from twodifferent views, two observation likelihood functions are ob-tained, i.e. pfrontalðZsilh9XÞ and psideðZDT9XÞ, which are calculated bymeasuring the difference between the target image and thecandidate model. The similarity between two binary handsilhouette images I1 and I2 with the same width and height isdefined as
DsilhðI1,I2Þ ¼X
x
Xy
I1ðx, yÞ � I2ðx, yÞ ð15Þ
where� represents the XOR operation. The observation likelihoodcan then be defined as
pfrontalðZsilh9XÞ ¼1ffiffiffiffiffiffi
2pp
sfrontal
exp�1
2
DsilhðZsilh, CðXÞÞ
sfrontal
� �2" #
ð16Þ
where Zsilh denotes the observation hand image, and C(X) refers tothe candidate hand model.
Similarly, the side likelihood function psideðZDT9XÞ can bemodeled by the Chamfer distance between the target image andthe candidate model. Let the distance between the distancetransform IDT and the candidate’s edge Iedge be defined as
DDTðIDT, IedgeÞ ¼X
x
Xy
IDTðx,yÞ � Iedgeðx, yÞ ð17Þ
Therefore, the observation likelihood is defined as
psideðZDT9XÞ ¼1ffiffiffiffiffiffi
2pp
sDT
exp�1
2
DDTðZDT, CðXÞÞ
sDT
� �2" #
ð18Þ
where ZDT denotes the observed hand image and C(X) is thecandidate hand model with a given X.
5.4. Finger occlusion
Finger occlusion is the most serious challenge in estimatingthe 3D hand parameters. This work develops a joint anglecorrelation and finger depth consistent measure to overcomethe finger occlusion problem.
(1) Frontal-view-based observation probability function:With the structural and kinematics constraints of ordinaryhuman hand, the observation likelihood in the front view iswritten as
pfrontalðZ9XÞ ¼ psðXsilhÞpkðXsilhÞpfrontalðZsilh9XÞ ð19Þ
where ps(Xsilh) and pk(Xsilh) are two constraint probabilityfunctions described in Eqs. (2) and (4), respectively. Onceocclusion occurs, two different hand postures may show thesame silhouette. Therefore, the frontal-view occlusion problem issolved by using these two constraints.
(2) Side-view-based observation probability function: Based onthe depth information, the observation likelihood in the side viewcan be expressed as
psideðZ9XÞ ¼ pdepthðZ9Xdepthi ,Xdepth
j ÞpsideðZDT9XÞ ð20Þ
where the likelihood function pdepthðZ9Xdepthj ,Xdepth
j Þ is based onthe correlation between two different finger joint (i.e., Xdepth
i andXdepth
j ), which is proportional to the overlapped area between thereal hand image and the depth image,
pdepthðZ9Xdepthi , Xdepth
j Þpexp �Adep \ Aori
2s2
� �ð21Þ
where Adep represents the depth image and Aori denotes the originalimage. The finger state in the frontal view is estimated using Eq. (19),while the finger state in the side view is estimated using Eq. (20).
6. Experimental results
The experiments involve use of a depth camera and twoordinary CCD cameras to capture the hand motion with the same
M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453 451
initial gesture. Each video sequence consists of more than onethousand frames with the frame rate of 25 FPS. Hand articulationsare tracked, while the palm is viewed as the rigid body. The casesof skewed palm are not considered since the skewed palm doesnot fit our 3-D hand model.
6.1. Estimation of hand wrist rotation
The reconstructed 3-D hand model image is overlapped withthe original image to illustrate the results of our motion capturingsystem (Fig. 11). Fig. 12 shows the estimated hand wrist rotation
Fig. 15. Five finger bending trend, the green line shows the ground truth captured
from DataGlove and the blue line indicates the motion capturing results
(normalized value: 1¼901, 0.8¼721, 0.6¼541, 0.4¼361, 0.2¼181). (For interpreta-
tion of the references to color in this figure legend, the reader is referred to the
web version of this article.)
Fig. 16. The source and estimation results of sequence with D
angle. Fig. 13 indicates that the proposed system can work wellfor cases of combining a rotating wrist with bended fingers.
6.2. Comparison between particle filter and SSBPF
Particle filter (CONDENSATION [21]) and SSBPF are comparedby considering the captured images of only one camera. The same3-D hand model on a 21 DOF particle filter is utilized by usingdifferent particle numbers. Table 1 reveals that with the sameprocessing time, the average error of original particle filter isnearly twice that of the proposed method. Table 1 summarizes theaccuracy of two methods. According to those results, increasingthe number of particles does not enhance the accuracy con-spicuously by the original particle filter. Moreover, differentsettings of the same processing time in the proposed method maydecrease the average error by 10%. The average error value of theexperiments can be determined as
Error¼X4
i ¼ 0
FingerEsti �FingerGT
i
�� ��5
ð22Þ
where the fingers FingerEst represents the estimation results, FingerGT
represents the ground truth and i denotes the different fingers.Fig. 14 illustrates the accuracy after different iterations with
the same processing time. This figure also indicates the proposedsystem performs excellently by converging within five iterations.
6.3. Comparison with the ground truth
This section presents the estimated finger movement parametersand the ground truth gathered by DataGlove. Fig. 15 indicates thatthe difference between the ground truth and the results of motion
ataGlove: (a), (b) and (c) represent three different states.
Fig. 18. The source and estimation results of side-view sequence without
DataGlove. Left: the source image. Middle: the reconstructed hand model. Right:
The silhouette overlapped with the estimation result.
Fig. 17. The source and estimation results of frontal-view sequence without
DataGlove. Left: the source image. Middle: the reconstructed hand model. Right:
the silhouette overlapped with the estimation result.
M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453452
capturing. The green line represents the sequence of each fingerbending joint angle, while the blue line denotes the results of SSBPF.Once the ground truth has substantial variations, the proposedsystem can still estimate and generate similar results effectively.Additionally, according to the experimental results, the errorbetween the estimated hand joint angle and the ground truth isless than 191. Two video sequences are one with DataGlove (Fig. 16)and one without (Figs. 17 and 18). The ground truth is collected byusing the DataGlove.
Experimental results indicate that the proposed systemfunctions properly with the frontal-view sequence. Comparedwith [19], their model makes more efficient computation by usingfewer particles than ours. However, their system only deals withthe frontal-view sequence. They did not deal with the occlusionproblems in the side-view images. Although [19] offered quanti-tative evaluations to prove the accuracy of their algorithm, theirnumerical results could not be compared with ours. To comparewith the ground truth, Ying et al. [19] performed validationexperiments using the animated hand motion sequence, which isdifferent from the real hand video sequences used in ourexperiments.
Fig. 18 demonstrates that the proposed method can overcomefinger occlusion problems, which often occur in the side-viewsequence. The fusion of different features allows us to track thehand articulations. Additionally, using the structural and kine-matics constraints, as well as the depth image, allows us toresolve the occlusion problems under different view angles.Rather than the traditional particle filter, the entire sample spaceis separated into several subsets to increase the trackingefficiency. Furthermore, when the palm does not face the camera,the hand posture under the finger occlusion is still estimatedsuccessfully.
7. Conclusion and future works
This work presents a novel hand motion capturing system totrack the rotating hand with finger movements. A novel methodcalled the separable state based particle filtering (SSBPF) is alsodeveloped to reduce the high dimension states when estimating thehand motion. Combining different features such as silhouette, depthand Chamfer distance allows us to track the hand motion efficientlyand solve the finger occlusion problems. The proposed system ishighly promising for many applications, including cooperationpresentation and intelligent human computer interface systems.
The proposed method assumes that the hand wrist rotatesunder the restricted angles. Efforts are underway to increase theflexibility of the proposed method, handle more clutter back-grounds and integrate the 2-D appearance based method with theproposed method. Appearance based methods can provide abetter initial guess, as well as reduce the number of requiredsamples without sacrificing accuracy.
Reference
[1] M. Bray, E. Koller-Meier, L. Van Gool, Smart particle filtering for 3D handtracking, in: Proceedings of the IEEE FGR, 2004, pp. 675–680.
[2] M. Bray, E. Koller-Meier, L.Van Gool, Smart particle filtering for high-dimensional tracking, Computer Vision and Image Understanding 106 (2007)116–129.
[3] D. Huan, E. Charbon, 3D hand model fitting for virtual keyboard system, in:IEEE Workshop on Applications of Computer Vision, 2007, p. 31.
M.-F. Ho et al. / Pattern Recognition 44 (2011) 443–453 453
[4] W. Ying, J.Y. Lin, T.S. Huang, Capturing natural hand articulation, ICCV2 (2001) 426–432.
[5] A. Erol, G. Bebis, M. Nicolescu, R.D. Boyle, X. Twombly, A review on vision-based full DOF hand motion estimation, CVPR (2005) 75.
[6] A. El-Sawah, C. Joslin, N. D. Georganas, and E. M. Petriu, A framework for 3Dhand tracking and gesture recognition using elements of genetic program-ming, in: Proceedings of the IEEE CRV, 2007, pp. 495–502.
[7] A. Erol, G. Bebis, M. Nicolescu, R.D. Boyle, X. Twombly, Vision-based handpose estimation: a review, Computer Vision and Image Understanding 108(2007) 52–73.
[8] DataGlove, 5DT Fifth Dimension Technologies, /http://www.5dt.com/products/pdataglove14.htmlS.
[9] D.J. Sturman, D. Zeltzer, A survey of glove-based input, IEEE Transactions onComputer Graphics and Applications 14 (1994) 30–39.
[10] R.G. O’Hagan, A. Zelinsky, S. Rougeaux, Visual gesture interfaces for virtualenvironments, Interacting with Computers 14 (2002) 231–250.
[11] C.-C. Lien, A scalable model-based hand posture analysis system, MachineVision and Applications 16 (2005) 157–169.
[12] W. Ying, T.S. Huang, View-independent recognition of hand postures, CVPR 2(2000) 88–94.
[13] V. Athitsos, S. Sclaroff, Estimating 3D hand pose from a cluttered image, CVPR2 (2003) I–432-9.
[14] V. Athitsos and S. Sclaroff, An appearance-based framework for 3D handshape classification and camera viewpoint estimation, in: Proceedings of theIEEE FGR, 2002, pp. 40–45.
[15] A. Imai, N. Shimada, Y. Shirai, 3-D hand posture recognition by trainingcontour variation, in: Proceedings of the IEEE FGR, 2004, pp. 895–900.
[16] C. Wen-Yan, C. Chu-Song, H. Yi-Ping, Appearance-guided particle filtering forarticulated hand tracking, CVPR 1 (2005) 235–242.
[17] M. Bray, E. Koller-Meier, N.N. Schraudolph, L.Van Gool, Fast stochasticoptimization for articulated structure tracking, Image and Vision Computing25 (2007) 352–364.
[18] E.B. Sudderth, M.I. Mandel, W.T. Freeman, A.S. Willsky, Visual hand trackingusing nonparametric belief propagation, CVPR (2004) 189.
[19] W. Ying, J. Lin, T.S. Huang, Analyzing and capturing articulated handmotion in image sequences, IEEE Transactions on PAMI 27 (2005)1910–1922.
[20] OpenGL Available: /http://www.opengl.orgS.[21] M. Isard, A. Blake, CONDENSATION—conditional density propagation for
visual tracking, International Journal of Computer Vision 29 (1998) 5–28.[22] M. Isard, A. Blake, ICONDENSATION: Unifying low-level and high-level tracking
in a stochastic framework, in: Proceedings of the ECCV, 1998, pp. 893–908.[23] K. Nummiaro, E. Koller-Meier, L.Van Gool, An adaptive color-based particle
filter, Image and Vision Computing 21 (2003) 99–110.[24] Ching-Yu Weng, A vision-based hand motion parameter capturing system,
Master’s Thesis, Institute of Electrical Engineering, National Tsing HuaUniversity, July 2007.
[25] Bumblebee2, Point Grey Research /http://www.ptgrey.com/products/stereo.aspS.
[26] G. Amayeh, G. Bebis, A. Erol, M. Nicolescu, A component-based approach tohand verification, CVPR (2007) 1–8.
[27] B. Stenger, A. Thayananthan, P.H.S. Torr, R. Cipolla, Estimating 3D hand poseusing hierarchical multi-label classification, Image and Vision Computing 25(2007) 1885–1894.
Meng-Fen Ho received her B.S. degree in Electrical Engineering from National Tsing-Hua University, Hsin-Chu, Taiwan, ROC, in 1992, and M.S. degree in ElectricalEngineering from Iowa State University, USA, in 1994, respectively. Currently, she is a lecturer in the Department of Electronic Engineering, Hsiuping Institute ofTechnology, Taichung, Taiwan. She is also working toward her Ph.D. degree in Electrical Engineering, National Tsing-Hua University, Hsin-Chu, Taiwan. Her researchinterests are in the areas of image processing and computer vision.
Chuan-Yu Tseng received his B.S. degree in Electrical Engineering from Chung Yuan Christian University, Chung-Li, Taiwan, ROC, in 2006, and M.S. degree in ElectricalEngineering from National Tsing-Hua University, Hsin-Chu, Taiwan, in 2008, respectively. Currently, he works for Formosa Plastics Group, Taiwan, as an engineer. Hisresearch interests are in the areas of image processing and computer vision.
Cheng-Chang Lien received the B.S. degree from the Electrical Engineering Department, Chung Yuan University, Tauyuan, Taiwan, in 1987. He received the M.S. and Ph.D.degrees from the Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, in June 1992 and September 1997, respectively. Currently, he is anassociate professor in the department of Computer Science and Information Engineering, Chung-Hua University, Hsinchu, Taiwan. His research interests are in the areas ofmultimedia signal processing, pattern recognition and computer vision.
Chung-Lin Huang received his B.S. degree in Nuclear Engineering from National Tsing-Hua University, Hsin-Chu, Taiwan, ROC, in 1977, and M.S. degree in ElectricalEngineering from National Taiwan University, Taipei, Taiwan, ROC, in 1979, respectively. He obtained his Ph.D. degree in Electrical Engineering from the University ofFlorida, Gainesville, FL, USA, in 1987. From 1987 to 1988, he worked for the Unisys Co., Orange County, CA, USA, as a project engineer. Since August 1988 he has been withthe Electrical Engineering Department, National Tsing-Hua University, Hsin-Chu, Taiwan, ROC. Currently, he is a professor in the same department. In 1993 and 1994, hehad received the Distinguish Research Awards from the National Science Council, Taiwan, ROC. In November 1993, he received the Best Paper Award from the ACCV, Osaka,Japan, and in August 1996, he received the Best Paper Award form the CVGIP Society, Taiwan, ROC. In December 1997, he received the best paper award from IEEE ISMIPConference held in Academia Sinica, Taipei. In 2002, he received the Best Paper Annual Award from the Journal of Information Science and Engineering, Academia Sinica,Taiwan. His research interests are in the areas of image processing, computer vision, and visual communication. He is a senior member of IEEE.