automatic skin segmentation and tracking in sign language recognition

12
Published in IET Computer Vision Received on 30th January 2008 Revised on 22nd July 2008 doi: 10.1049/iet-cvi:20080006 ISSN 1751-9632 Automatic skin segmentation and tracking in sign language recognition J. Han 1 G. Awad 2 A. Sutherland 3 1 School of Computing, University of Dundee, Dundee, DD1 4HN, UK 2 Information Technology Laboratory, National Institute for Standards & Technology, Gaithersburg, MD 20899-8940, USA 3 School of Computing, Dublin City University, Glasnevin, Dublin 9, Ireland E-mail: [email protected] Abstract: Skin segmentation and tracking play an important role in sign language recognition. A framework for segmenting and tracking skin objects from signing videos is described. It mainly consists of two parts: a skin colour model and a skin object tracking system. The skin colour model is first built based on the combination of support vector machine active learning and region segmentation. Then, the obtained skin colour model is integrated with the motion and position information to perform segmentation and tracking. The tracking system is able to predict occlusions among any of the skin objects using a Kalman filter (KF). Moreover, the skin colour model can be updated with the help of tracking to handle illumination variation. Experimental evaluations using real-world gesture videos and comparison with other existing algorithms demonstrate the effectiveness of the proposed work. 1 Introduction As a kind of gesture, sign language (SL) is the primary communication media for hearing impaired people. In recent years, SL recognition research has gained a lot of attention [1–5]. The first step of an SL recognition system is skin segmentation and tracking (SST), which is to acquire and locate hands and face across a video sequence. According to the means of performing SST, SL recognition systems can be classified into two groups: glove-based [1] and vision-based [2, 3, 5] systems. The former group requires users to wear data gloves or colour gloves. The glove enables the system to avoid or simplify the segmentation and tracking task. However, users have to carry a hardware device that makes them not be able to perform accurate gestures sometimes. Moreover, the glove- based methods might lose the facial expression information, which is important for SL recognition. In contrast, the vision-based methods rely on computer vision techniques without requiring gloves, which is more easy for users. However, one difficulty is how to accurately segment and track the hands and face. In order to produce high-quality SST, two techniques should be developed: a powerful skin colour model and a robust tracker. The skin colour model offers an effective way to detect and segment skin pixels. It should be able to handle illumination and human skin variations. The tracker is responsible for locating skin objects. It should be capable of predicting occlusions that usually happen in real-world SL conversations. This paper aims to solve SST issues for SL applications. To achieve precise skin segmentation, we introduce a novel skin colour model integrating support rector machine (SVM) active learning and region segmentation. This model consists of two stages: the training stage and the segmentation stage. In the training stage, first, for the given gesture video, a generic skin colour model is applied to several frames, which shows the initial skin areas. Afterwards, a binary classifier based on SVM active learning is trained using the obtained initial skin pixels. In the segmentation stage, the SVM classifier is incorporated with region information to yield the final skin colour pixels. The contribution of the proposed skin colour model is 2-fold. First, the SVM classifier is trained using data automatically collected from every signing video, which is adaptive to different human skin colours and lighting conditions. The model can also be updated with the help of the tracker to deal with illumination variation. 24 IET Comput. Vis., 2009, Vol. 3, Iss. 1, pp. 24–35 & The Institution of Engineering and Technology 2009 doi: 10.1049/iet-cvi:20080006 www.ietdl.org

Upload: dcu

Post on 10-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

24

& T

www.ietdl.org

Published in IET Computer VisionReceived on 30th January 2008Revised on 22nd July 2008doi: 10.1049/iet-cvi:20080006

ISSN 1751-9632

Automatic skin segmentation and trackingin sign language recognitionJ. Han1 G. Awad2 A. Sutherland3

1School of Computing, University of Dundee, Dundee, DD1 4HN, UK2Information Technology Laboratory, National Institute for Standards & Technology, Gaithersburg, MD 20899-8940, USA3School of Computing, Dublin City University, Glasnevin, Dublin 9, IrelandE-mail: [email protected]

Abstract: Skin segmentation and tracking play an important role in sign language recognition. A framework forsegmenting and tracking skin objects from signing videos is described. It mainly consists of two parts: a skincolour model and a skin object tracking system. The skin colour model is first built based on the combinationof support vector machine active learning and region segmentation. Then, the obtained skin colour model isintegrated with the motion and position information to perform segmentation and tracking. The trackingsystem is able to predict occlusions among any of the skin objects using a Kalman filter (KF). Moreover, theskin colour model can be updated with the help of tracking to handle illumination variation. Experimentalevaluations using real-world gesture videos and comparison with other existing algorithms demonstrate theeffectiveness of the proposed work.

1 IntroductionAs a kind of gesture, sign language (SL) is the primarycommunication media for hearing impaired people. Inrecent years, SL recognition research has gained a lot ofattention [1–5]. The first step of an SL recognition systemis skin segmentation and tracking (SST), which is toacquire and locate hands and face across a video sequence.According to the means of performing SST, SLrecognition systems can be classified into two groups:glove-based [1] and vision-based [2, 3, 5] systems. Theformer group requires users to wear data gloves or colourgloves. The glove enables the system to avoid or simplifythe segmentation and tracking task. However, users have tocarry a hardware device that makes them not be able toperform accurate gestures sometimes. Moreover, the glove-based methods might lose the facial expression information,which is important for SL recognition. In contrast, thevision-based methods rely on computer vision techniqueswithout requiring gloves, which is more easy for users.However, one difficulty is how to accurately segment andtrack the hands and face.

In order to produce high-quality SST, two techniquesshould be developed: a powerful skin colour model and a

he Institution of Engineering and Technology 2009

robust tracker. The skin colour model offers an effectiveway to detect and segment skin pixels. It should be able tohandle illumination and human skin variations. The trackeris responsible for locating skin objects. It should be capableof predicting occlusions that usually happen in real-worldSL conversations.

This paper aims to solve SST issues for SL applications.To achieve precise skin segmentation, we introduce a novelskin colour model integrating support rector machine(SVM) active learning and region segmentation. Thismodel consists of two stages: the training stage and thesegmentation stage. In the training stage, first, for thegiven gesture video, a generic skin colour model is appliedto several frames, which shows the initial skin areas.Afterwards, a binary classifier based on SVM activelearning is trained using the obtained initial skin pixels.In the segmentation stage, the SVM classifier isincorporated with region information to yield the final skincolour pixels. The contribution of the proposed skin colourmodel is 2-fold. First, the SVM classifier is trained usingdata automatically collected from every signing video,which is adaptive to different human skin colours andlighting conditions. The model can also be updated withthe help of the tracker to deal with illumination variation.

IET Comput. Vis., 2009, Vol. 3, Iss. 1, pp. 24–35doi: 10.1049/iet-cvi:20080006

IETdoi

www.ietdl.org

Secondly, active learning is employed to select the mostinformative training subset for SVM, which leads to fastconvergence and better performance. Moreover, regioninformation is adopted to reduce the noise and illuminationvariation.

As for the tracker, we extend the previous work [6] of ourgroup in three ways. First, the work in [6] used colour glovesto avoid the segmentation issue. However, we are moreinterested in improving SL recognition in natural conversation.Three features – skin colour, motion and position – are takeninto account to perform accurate skin object segmentation.Additionally, the work [6] tracked two hands wearing colourgloves only. However, the proposed work can segment andtrack the two hands and face. The obtained face informationcould definitely facilitate recognition. Secondly, we apply akalman filter (KF) to predict occlusions, which follows thework [6]. Nevertheless, our KF is based on skin colour, insteadof colour gloves. Thirdly, in the proposed work, tracking andsegmentation tasks are approached as one unified problem, inwhich tracking helps to reduce the search space used insegmentation, and good segmentation helps to accuratelyenhance the tracking performance.

The rest of the paper is organised as follows. In Section 2,we discuss some related work. A novel skin colour model ispresented in Section 3. In Section 4, an SST system isintroduced. The experimental results are shown in Section5. Finally, conclusions are given in Section 6.

2 Related work2.1 Related work in skin colour model

The principal idea of most existing skin colour models isbased on the assumption that skin colour is quite differentfrom colours of other objects and its distribution forms acluster in some specific colour spaces. Basically, theresearch activities in skin segmentation have progressed intwo directions: generic colour models [7, 8] and statisticalcolour models [9–12]. The so-called generic colour modeldefines a fixed colour range to separate skin from non-skinpixels. Generally, the fixed colour range is achievedempirically using some collected training instances. The useof generic colour models alone cannot handle illuminationand human skin variations.

To reduce limitations, the statistical colour model isproposed to classify skin pixels. First, the skin colourdistribution is estimated by a Gaussian or a histogram-based technique using a plethora of training data. Then, aBayesian classifier or other learning algorithm is appliedto classify the skin and non-skin pixels. The statisticalcolour model is trained on the basis of a huge trainingdata set that covers different human skin appearances.Consequently, it can work well for average users. However,one drawback is that it often works under the premise thatthe skin colour distribution is a single or mixture Gaussian.

Comput. Vis., 2009, Vol. 3, Iss. 1, pp. 24–35: 10.1049/iet-cvi:20080006

Unfortunately, this assumption is not satisfied in manycases. The other disadvantage is that it requires a largeamount of training data. It is very expensive to collect somany training instances. Recently, adaptive skin colour[13–16] models have attracted increasing researchattention. Technically, it is a kind of a statistical colourmodel but emphasises how to model skin colour under theconditions of varying illumination. Mckenna et al. [13] andWu and Huang [14] first formulated the skin colourdistribution as a Gaussian mixture model (GMM). Then,parameters of the GMM were dynamically updated overtime by combining prediction from the previous trackedskin objects with learning algorithms. Soriano et al. [15]applied adaptive histogram backprojection to update theskin colour model. The work of [16] made use of adynamic Markov model whose parameters could beestimated by Maximum likelihood (ML) over time, topredict the evolution of the skin colour histogram. Anotherinteresting work [17] proposed a colour correction schemeto accommodate illumination variations. Instead ofadapting the colour model, it learns mapping from imagecolours under unknown illumination to a reference colourmodel under a known illumination using neural networks.In this approach, the reference colour model is static butthe mapping needs to be updated across frames. In [18], aphysical model of skin reflectance was presented, which isbased on the observation that main variation in skin colouroccurs in intensity instead of chromaticity space. Thismodel tends to be more robust to the varying illuminations.However, its performance also relies on some priorknowledge such as camera and illumination parameters.

2.2 Related work in hand tracking

In recent years, vision-based hand tracking draws a lot ofattention. Yang et al. [3] implemented hand tracking basedon region matching using affine transformations. The work of[5] combined multiple features to locate the hands. However,one shortcoming is that it works well only for a single handfrom their experiments. Sherrah and Gong [19] developed asystem to track face and hands using Bayesian inference. Itsprobabilistic reasoning model fused multiple cues derivedfrom high-level contextual knowledge and sensor-levelfeatures. Although its tracker is effective, this work is notinvolved in accurate hand segmentation. Some relatedworks [20–22] tried to use KF to track hands. TheCONDENSATION algorithm proposed by Isard andBlake [23] introduced a robust Bayesian framework totrack curves in clutter. Mammen et al. [24] extended theCONDENSATION to track both hands simultaneously. Infact, the original part of CONDENSATION algorithm isthe particle filtering technique that implements a recursiveBayesian filter by Monte Carlo simulations. Recently, Shanet al. [25] reported a hand tracker by combining mean shiftand particle filtering. Although computationally expensive,these approaches can achieve accurate tracking results.Another interesting research stream called model-based handtracking was reported in [26, 27]. Starner [2] and Wren et al.

25

& The Institution of Engineering and Technology 2009

26

& T

www.ietdl.org

[27] developed a Pfinder system to track people and hands using2D models. The 2D models were built based on a priorknowledge about colour and shape of the people’s body. TheDigitEyes system [26] and the work [28] tracked hands by a3D hand model. Lu et al. [29] introduced a deformablemodel for hand tracking. Many model-based approaches areeffective under the assumption of known hand shape.Additionally, model-based approaches usually tend to bemore time-consuming and require more features.

3 Skin colour model3.1 Initial segmentation by thegeneric skin model

The purpose of the initial segmentation is to collect trainingdata for the SVM classifier, which is implemented by ageneric skin model defining a fixed colour range in onecolour space. Technically, any generic skin model could beused in our work. Here, we adopt the model presented in [7].

3.2 SVM active learning for skinsegmentation

SVMs were invented by Vapnik [30]. For SVMs in a binaryclassification setting, given a linear separable training set{x1, x2, . . . : , xi, . . . , xn} with labels {y1, y2, . . . , yi, . . . ,yn}, yi [ {�1, 1}, an optimal hyperplane is yielded bysolving an optimisation problem

Minimise : f(w) ¼1

2kwk2

Subject to : yi(w†xi þ b) � 1

(1)

Here

w ¼ Sai yixi (2)

When a new example is inputted for classification, a label isissued by its position to the hyperplane, that is

f (x) ¼ signX

i

ai yi(xi†x)þ b

!(3)

For the case of data being not linearly separable, the decisionboundary is modified by a Mercer kernel function K

f (x) ¼ signX

i

ai yiK (xi†x)þ b

!(4)

The SVM can be easily applied to the skin colour model.Given a gesture video, the generic skin model is performedon the first several frames so that the training setcontaining skin and non-skin data could be obtained.Afterwards, the SVM classifier is constructed using theobtained training set and used to segment the future framesone by one.

he Institution of Engineering and Technology 2009

In practice, one problem is the imbalance in the training data.That is, the number of negative examples (non-skin pixels) is farlarger than the number of positive examples (skin pixels). Fig. 1shows the fact. The left-hand picture is the original image andthe right-hand one is the segmented results. In the right-handimage, the parts shaded black are the skin pixels and other partsare the non-skin pixels. The imbalance in training examplesmay make learning less reliable. Moreover, it results in a longlearning time. A feasible way to reduce the limitation is to useactive learning. Active learning is named in contrast to thetraditional passive learning. Most machine learningapproaches are a form of passive learning because they areusually based on the entire training set or randomly selecteddata [31]. In contrast, active learning tries to find the mostinformative data to train the classifier. Its goal is to achievebetter performance and faster convergence with fewer trainingexamples. Lately, active learning has been successfullyintroduced to document classification [31] and image retrieval[32], whereas, to our best knowledge, very little work hasbeen done in the skin segmentation field.

The key idea of active learning is to extract the mostinformative samples from all the available training data.Tong and Chang [32] explained active learning from theviewpoint of version space theory. The version space wasdefined as a set of all hyperplanes that could classify thegiven training data correctly. To a new-labelled sample, allhyperplanes in the version space were used to classify thenew data again. Those hyperplanes that made the wrongclassification were removed from the version space so thatthe size of the version space was reduced. The informativesamples are able to reduce the size of the version space asmuch as possible. Based on this idea, Tong and Chang [32]found that informative samples are always near the hyperplane.

In our application, we attempt to find the small butinformative subset of negative examples with a similar size tothe training set of positive examples. The instances closer tothe SVM hyperplane generally give a larger influence to thelearning so that they are more informative than otherinstances. This motivates us to design a similarity-basedsampling strategy to select more informative negativeexamples for our specific application.

Let F be the training set. Fþ and F2 are the positive (skinpixel) and negative example (non-skin pixel) sets, respectively.

Figure 1 Imbalance of training examples

IET Comput. Vis., 2009, Vol. 3, Iss. 1, pp. 24–35doi: 10.1049/iet-cvi:20080006

IETdoi

www.ietdl.org

Then F ¼ Fþ < F�& Fþ > F� ¼1. We hope to obtainthe small subset of negative examples F�Active by activelearning. Here, F�Active , F�. First, a region segmentationscheme JSEG [33] grows regions based on colour-texturehomogeneity employed to segment F2 into different regions,

RF�

1 , RF�

2 , . . . , RF�

N andSN

i¼1RF�

i ¼ F�. Secondly, the

similarities between each RF�

i and Fþ are described by colourhistogram-based distances. More specifically

D(RF�

i , Fþ) ¼ kH (RF�

i )�H (Fþ)k (5)

where H (†) is the colour histogram vector. The smaller thedistance between RF�

i and Fþ the more similar they are.In the feature space, the skin pixels Fþ generally could form acluster. If one negative instance is closer to Fþ, it is closer tothe SVM hyperplane also. Therefore it is more likely to be aninformative example. Finally, we sample the negativeexamples according to a principle called ‘most similar highestpriority’. To be specific, more negative instances are extractedfrom the RF�

i with the smaller distances to Fþ, but fewernegative instances are selected from the RF�

j with the largerdistances to Fþ. The sampled examples construct the F�Active,and its size is approximately equal to the size of Fþ. Ingeneral, any similarity-based sampling scheme can beemployed. Here, we designed a simple yet effective way to dothe sampling. Let md be the minimum colour histogramdistance such that

md ¼ min (di), where di ¼ D(RF�

i , Fþ) (6)

We pick from each region RF�

i a ratio of pixels proportional toits distance to Fþ. This ratio r is defined as

ri ¼md

di

(7)

The advantage of the proposed sampling strategy is that notonly can it obtain more informative examples, but also theobtained F�Active covers all kinds of negative examples fromdifferent regions.

In summary, the skin colour model using SVM activelearning consists of the following steps.

1. Apply the generic skin model to the first several frames toobtain Fþ and F2.

2. Segment F2 into different regions RF�

1 , RF�

2 , . . . , RF�

N ,and compute the distances between each RF�

i and Fþ in thecolour feature space.

3. Construct the F�Active from F2 in accordance withsimilarity-based sampling scheme.

4. Train the binary SVM classifier using Fþ and F�Active.

5. Classify every current frame into skin and non-skin pixelsby the trained SVM.

Comput. Vis., 2009, Vol. 3, Iss. 1, pp. 24–35: 10.1049/iet-cvi:20080006

In Section 4, the operation over the frame is changed tooperate over search windows.

3.3 Combining SVM active learning andregion information

Although the performance of SVM active learning isoutstanding, it cannot produce perfect skin segmentationresults because of noise and illumination variation. However,region information is considerably robust to noise andillumination variation. Hence, this paper incorporates regioninformation to further refine the segmentation result. First,the JSEG algorithm [33] is adopted to parse the frame intoregions. Then, if the percentage of skin pixels in one regionis over a threshold, the whole region is declared as the skinarea. To be exact, one region Ri satisfying

NS(Ri)

NT(Ri). h (8)

is decided as skin area. Here, NS(Ri) denotes the number ofskin pixels in the region Ri, NT(Ri), refers to the number ofpixels in Ri and h is an empirically defined constant.

4 SST systemThe block diagram of the system is shown in Fig. 2. Asdescribed in the previous paper [34], we segment and trackthree objects: the face and two hands. Two main componentsform our system. The component of skin segmentation isresponsible for the segmentation of skin objects bycombining three information: colour, motion and position.The task of the second component, object tracking, is 2-fold: one is to match the resulted skin blobs of the

Figure 2 SST system architecture

27

& The Institution of Engineering and Technology 2009

28

&

www.ietdl.org

segmentation component to the previous frame blobs. Theother is to keep track of the occlusion status of the threeskin objects.

4.1 Skin object segmentation

4.1.1 Colour information: We apply the proposed skincolour model to small search windows around the predictedpositions of the face and hand objects and return decisionvalues representing how likely the pixels are to be skin. Tobe specific, we first train a skin colour model by theproposed SVM active learning. For each pixel x located at(x,y) in the search window, we can obtain a binary SVMoutput according to (4). If we ignore the function sign(†),we can obtain an SVM output indicating the distancebetween x and the trained hyperplane, which is f (x) ¼Siai yiK (xi†x)þ b. Then, we apply a logistic sigmoidfunction over the SVM output to transfer this output to aprobability. This paper adopts this probability as the skincolour probability Pc(x, y)

Pc(x, y) ¼ P(skinjx) ¼1

1þ exp (Q f (x)þ U )(9)

where Q and U can be estimated by optimising a negativelog-likelihood function [35].

4.1.2 Motion information: Finding the movementinformation takes two steps. First, motion detection; thenthe next step, finding candidate foreground pixels. The firststep examines the local grey-level changes between successiveframes by frame differencing

LMi(x, y) ¼ jIi(x, y)� Ii�1(x, y)j, (x, y) [ Wi (10)

where Wi is the search window in the ith frame, Ii(x, y) theintensity value of pixel (x, y) in the ith frame and LMi

the absolute difference image. It is worth noting that theabsolute difference is only calculated for pixels located inWi. The determination of Wi depends on the trackingoutput in the previous frame. Its detailed descriptions canbe found in Section 4.2.3. We then normalise LMi(x,y) byLM0i(x, y) ¼ LMi(x, y)=maxx,y LMi(x, y). For the specialcase of all LMi(x, y) ¼ 0, we set LM0i(x, y) ¼ 0. Thesecond step assigns a probability value Pm(x,y) for eachpixel in the search window to represent how likely thispixel belongs to a skin object. Actually, the same skinobject in two consecutive frames will have a large overlapbecause hands generally move slowly within a short time.Therefore even though pixels have a high intensity changeacross two frames, if they were skin object pixels in theprevious frame, then they are probably no longer in thecurrent frame and vice versa. This case is handled bylooking backward to the last segmented skin object binaryimage in the previous frame search window OBJi�1 and

The Institution of Engineering and Technology 2009

applying the following model to the pixels in LM0i

Pm(x, y) ¼1� LM0i(x, y) if OBJi�1(x, y) ; 1LM0i(x, y) otherwise

�(11)

In this way, small values (stationary pixels) in LM0i that werepreviously segmented as object pixels will be assigned highprobability values as they represent skin pixels that were notmoved, and new background pixels (that were previouslyskin pixels) with high LM0i will be assigned smallprobability values. Hence, this model gives high probabilityvalues to candidate skin pixels and low values to candidatebackground values.

4.1.3 Position information: To capture the dynamicsof the skin objects, we assume that the movement issufficiently small between successive frames. Accordingly, aKF model can be used to describe the x and y coordinatesof the centre of the skin objects with a state vector Sk thatindicates the position and velocity. The model can bedescribed as [36]

Stþ1 ¼ At St þ Gt

Zt ¼ H St þ Vt

(12)

where At is a constant velocity model, Gt and Vt representthe state and measurement white noise, respectively, Zt isthe observation, and H the noiseless connection between theobservation and the state vector S. This model is used tokeep track of the position of the skin objects and predictthe new position in the next frame. Given that the searchwindow surrounds the predicted centre, we translate abinary mask of the object from the previous frame to becentred on the new predicted centre. Then the distancetransform is computed between each pixel in the searchwindow and its corresponding nearest pixel in the mask.Euclidean distance is used as the metric. The inverse ofthis distance value assigns high value to a pixel that belongsto or is near the mask and low value to a far pixel. Thedistance values are then converted to probabilities Pp bynormalisation.

4.1.4 Information fusion: Each piece of information isnot strong enough by itself to provide good segmentationresults. Colour information can help to obtain all objectswith skin colour. However, some of them are not the skinobjects that we are really interested in. Motion informationcan detect foreground objects. Position information isestimated from temporal features, which can reduce thesearch space. The benefit of using motion and positioninformation is that they can help us to remove the falseobjects and focus on the required skin objects. This papercombines those three pieces of information logically usingan abstract fusion formula to obtain a binary decision

IET Comput. Vis., 2009, Vol. 3, Iss. 1, pp. 24–35doi: 10.1049/iet-cvi:20080006

IETdo

www.ietdl.org

image Fi(x, y)

Fi(x, y) ¼1

if ((Pc(x, y) . g) AND (Pm(x, y) . y)AND (Pp(x, y) . s))

0 otherwise

8<:

(13)

g, y and s are the thresholds, where s is determinedadaptively by the following formula

s ¼size((Pm(x, y) . y) AND (Pp(x, y) ; 1))

size(Pm(x, y) . y)(14)

The threshold s determines the margin that we searcharound the predicted object position. In (14), this isformulated by finding the overlap between the predictedobject position and the foreground pixels above a certainthreshold value. The other threshold values are determinedempirically.

4.2 Skin object tracking

As indicated before, our tracker tends to implement twotasks. First, this section introduces how to keep trackof the occlusion status of the skin objects. Then, thedetails of matching the segmented skin blobs of thenew frame to those of the previous frame is described.Finally, an important issue involving in our SST system,the determination of search windows, is discussed.

4.2.1 Occlusion detection: A bounding box is firstformed around each of the face and two hands. Then, eachbounding box is modelled by a KF filter. To be specific, wemodel each side of the bounding boxes by its position,velocity and acceleration as follows

sj,tþ1 ¼ sj,t þ h_sj,t þ1

2h2€sj,t

_sj,tþ1 ¼ _sj,t þ h€sj,t

(15)

where s is the position, _s the velocity, €s the acceleration, h . 0the sampling time, j the bounding box side index and t thetime. Combining (12) and (15), we can obtain

stþ1

_stþ1

€stþ1

264

375 ¼ 1 h

1

2h2

0 1 h

0 0 1

2664

3775

st

_st

€st

264

375þ Gt

Zt ¼ [1 0 0]

st

_st

€st

264

375þ Vt

(16)

Note that we use [1 0 0] for matrix H as the position is theonly observable feature for the bounding box sides. Applying(16) for every bounding box side can predict the position of

Comput. Vis., 2009, Vol. 3, Iss. 1, pp. 24–35i: 10.1049/iet-cvi:20080006

the bounding boxes in the next frame. We check to see ifthere is any overlap between any of the bounding boxes inthe next frame. If there is an overlap, we raise an occlusionalarm corresponding to the two bounding boxes that willoverlap. If, in the next frame, the number of detected skinobjects is less than the current frame objects and anocclusion alarm was raised previously, we conclude thatocclusion happened. These occlusions can be called ‘partial’occlusions.

The other occlusion is ‘complete’ occlusion, which mayhappen in three cases. In the first case, the occlusion beginsas a partial occlusion and then one skin object completelyhides behind the other skin objects. For this case, theocclusion alarm has been raised and we know that thebounding box represents an occluded object. Our algorithmhence deals with this case in a similar manner to the partialocclusion. Both bounding box and search window varydepending on the connected skin blob. In the second case,the objects completely hide behind the non-skin objects(e.g. body) and in the third case, the skin objects disappearfrom the view of camera. The last two cases can bedetected by decreasing the number of skin objects andhence no previous occlusion alarm. In these cases, the skinsegmentation result over the search window of the hiddenskin object is blank. We then fix the search window overthere and perform the skin segmentation at every frameuntil the hidden object appears again. Here, an assumptionis that it is more likely that the object will reappear in thesame or near the area where it was hidden.

4.2.2 Skin blob matching: As shown in Fig. 2, thetracking process takes place by first constructing searchwindows around each of the objects we are tracking. Whentwo or more objects are occluded, they are treated as oneobject and one search window is constructed around theirposition. After the search windows are constructed, wesegment the skin objects. Next, connected regions arelabelled after removing noisy small regions. The final step isthe blob matching. Given that we concluded what objects arethere in the segmented frame and their occlusion status, weperform the matching between the previous frame skinobjects and the new frame skin blobs. The matching is doneusing the distance between the objects in sequential frames.We use here the Euclidean distance between the centres ofthe objects to match the corresponding objects.

4.2.3 Search window determination: In the proposedwork, search windows can roughly locate the positions of skinobjects. The tracking and segmentation are performed withinsearch windows, which plays an important role in reducingthe processing space and further improving the computationspeed. The size and location of the search window to a skinobject are not fixed. In the first frame, the search window isthe whole frame. In the following frames, for each currentframe, from the output of KF, we know the predicted centreof the object. We use this centre as the centre of the searchwindow. From the output of the tracking, we can obtain the

29

& The Institution of Engineering and Technology 2009

30

& T

www.ietdl.org

Figure 3 Experimental samples with and without active learning

bounding box of the object in the previous frame. We define amargin surrounding this bounding box of 20 pixels in everydirection (top, bottom, left and right). The bounding box sizeplus the margin is used as the size of the search window in thecurrent frame. Thus, the size of the search window variesdepending on the previous bounding box, that is, the previoushand shape. The idea of using bounding boxes in the previousframe to approximate search windows is also employed inother tracking approaches. In our work, the small margin isdetermined under the assumption that the hand shape oftencannot have a sharp change in two consecutive frames in SL.

4.3 Skin colour model adapting undervarying illumination

One of the challenges of SST is that lighting conditions mightchange over time within a video sequence so that the skin colourdistribution is not stationary. The skin colour model proposedin Section 3 is static. Basically, depending on the static modelonly, the model is incapable to handle the illuminationchange problem. We have to update the model frequentlyto adapt to the illumination change. This paper proposes toincorporate the useful temporal information from tracking toaddress the problem, which extends our previous work [34].

The basic idea for adapting the skin colour model is tocollect the new training data for re-training the SVMclassifier at every frame. Given two consecutive frames fi�1

and fi , we assume that their skin colour distributions aredifferent because of lighting change. Accordingly, the SVMclassifier needs to be updated by training using new skinsamples. Furthermore, the generic skin model is unreliableto supply high-quality samples. The following three majorsteps are used to collect new skin samples for updating theSVM. First, for the current frame fi , we obtain searchwindows that were constructed already around the predictedskin object locations by the tracker. We use the genericskin colour model as a filter, which selects the preliminaryskin colour pixel set. Second, the selected skin pixel (x, y)is decided to be one of the new skin samples provided thatboth motion probability Pm(x, y) and position probabilityPp(x, y) are large enough, that is, over the empiricalthreshold. The rest of the search window pixels isconsidered as non-skin pixels. Finally, having new positiveand negative examples, we train the SVM classifier forframe fi and then apply it to classify the pixels of the searchwindow again. This new classifier returns a skin colourprobability Pc(x, y), which can be combined with thePm(x, y) and Pp(x, y) to continuously perform the tracking.

he Institution of Engineering and Technology 2009

The advantage of the proposed adapting algorithm is thatwe incorporate temporal features from the tracker to reducethe inaccuracy problem of skin colour models caused byillumination varying.

5 Experimental resultsThree experiments were constructed to evaluate the proposedSST framework. In Section 5.1, we test the performance ofthe skin colour model. In Section 5.2, we examine thetracking system. In Section 5.3, we evaluate the proposedwork using a real-world SL recognition system.

5.1 Evaluation of the skin colour model

We tested the proposed skin colour model with eight gesturevideo sequences from the ECHO database (ECHO is aEuropean sign language database, which is available onhttp://www.let.ru.nl/sign-lang/echo/) and two self-capturedsigning videos. They were captured with different signersand under different lighting conditions. Most test videosare over 5 min long. To quantitatively evaluate our work,we randomly picked 240 frames from those testingsequences, then two students were invited to manuallysegment skin pixels to construct the ground truth. TheSVM classifier was trained using accumulated three frames.The structure of SVM is as follows. The SVM type is C-SVC with the RBF kernel. The parameter C is set to 50.As in [10], three metrics, correct detection rate (CDR),false detection rate (FDR) and overall classification rate(CR) were employed to measure the performance of thetechniques. They are described as follows

CDR : percent of correctly classified skin pixels

FDR : percent of wrongly classified non-skin pixels

CR:NS

max (N AS , N G

S )� 100%

(17)

Table 1 Statistical precision and training time comparisonswith and without active learning

CDR, % FDR, % CR, % Trainingtime, s

SVM withoutactive learning

85.12 2.43 61.97 121.14

SVM with activelearning

82.83 1.39 67.60 7.33

IET Comput. Vis., 2009, Vol. 3, Iss. 1, pp. 24–35doi: 10.1049/iet-cvi:20080006

IETdoi:

www.ietdl.org

Figure 4 Comparison results with and without region information

where NS is the number of skin pixels detected both by thealgorithm and the ground truth, N A

S the number of skinpixels detected by the algorithm and N G

S the number ofskin pixels detected by the ground truth.

Three types of experiments were constructed to evaluate ourskin colour model. First, we test the performance of activelearning. Then, we examine the effect of combining regioninformation. Next, comparisons with some traditional skinsegmentation techniques are reported. Finally, the adaptingof the skin colour model to varying illumination is evaluated.

5.1.1 Performance test of the active learning: Toevaluate SVM active learning, we compare the SVMclassifiers with and without active learning using our testdata. Fig. 3 shows one set of sample results. The first,second and third images display the original frame, theSVM without active learning and SVM with activelearning, respectively. Table 1 lists the statistical resultsincluding the precision and training time. As can be seenfrom the experimental results, the SVM active learning issuperior in both accuracy and computational complexity.It can enhance the overall accuracy almost by 6%, anddecrease average training time by 114 s.

5.1.2 Evaluation of combining region information:This experiment is to evaluate the segmentation results withand without region information. As for the threshold h, itsvalue was fixed to 0.4 for all different video sequences.Fig. 4 shows some sample results, and Table 2 lists thestatistical precision comparisons. In Fig. 4, the first columnshows the original frames, the second column shows thesegmentation results without region information and thethird column shows the results with region information.Clearly, the algorithm with region information is better,which can reduce the noise and refine the segmentationresults. Incorporation of region information enhanced theoverall accuracy by 9%.

5.1.3 Comparisons with traditional skinsegmentation techniques: To demonstrate theeffectiveness of the proposed work, we compare it with two

Comput. Vis., 2009, Vol. 3, Iss. 1, pp. 24–3510.1049/iet-cvi:20080006

existing skin segmentation algorithms, the generic skinmodel [7, 8] and the Gaussian model [9, 10]. TheGaussian models [10] can be described as follows. Theyemploy the Bayesian decision rule

p(cjskin)

p(cjnonskin)� j (18)

to classify the skin and non-skin pixels. Here, p(cjskin) andp(cjnonskin) refer to the probability density function (pdf )of skin and non-skin colour, respectively, and j is athreshold. The colour pdf could be modelled as a singleGaussian

G(c) ¼ (2p)�n=2(jSj)�1=2e�(1=2)(c�m)TS�1(c�m) (19)

or Gaussian mixture

GM(c) ¼Xk

i¼1

viGi(c)

Xk

i¼1

vi ¼ 1

(20)

where m is the mean vector, S the covariance matrix, k thenumber of mixture components and n the dimension of thefeature vector, respectively. Phung et al. [10] proposed twostrategies: modelling only skin pixels as Gaussian (calledone-Gaussian in this paper) and modelling both skin andnon-skin pixels as Gaussians (called two-Gaussian in this

Table 2 Statistical precision comparisons with and withoutregion information

CDR, % FDR, % CR, %

the proposed methodwithout region

82.83 1.39 67.60

the proposed method withregion

86.34 0.96 76.77

31

& The Institution of Engineering and Technology 2009

32

& T

www.ietdl.org

Figure 5 Sample results of generic skin model, one-Gaussian model, two-Gaussian model and the proposed model

paper). In this paper, we implement these two strategies.Notice that we use a Gaussian mixture to model pdfs.Fig. 5 shows some results. The segmentation results bygeneric model, one-Gaussian, two-Gaussian and theproposed approach are displayed in the first, second, thirdand fourth columns, respectively. Table 3 lists the statisticalaccuracy comparisons. As we can see from the comparisonresults, the proposed model has the highest overall accuracywith the second lowest false detection rate. Although thetwo-Gaussian model has the best correct detection rate, itsfalse detection rate is worst.

he Institution of Engineering and Technology 2009

Table 3 Statistical accuracy comparisons of existing modelsand the proposed model

CDR, % FDR, % CR, %

the generic skin model 71.51 0.79 65.10

one-Gaussian model 72.74 1.04 66.85

two-Gaussian model 90.88 4.41 57.06

the proposed model 86.34 0.96 76.77

Figure 6 Some segmentation results under time-varying illumination

IET Comput. Vis., 2009, Vol. 3, Iss. 1, pp. 24–35doi: 10.1049/iet-cvi:20080006

IETdoi

www.ietdl.org

5.1.4 Evaluation of adapting skin colour model:We tested skin colour model adaptation with a number ofgesture videos under the time-varying illumination. We usedlight to simulate the illumination change while capturingvideos. We controlled the light intensity by varying thedistance of the light from the human body, and turning onor off the light. Fig. 6 shows some skin segmentation resultsby the updated skin colour model. The visually acceptableresults demonstrate the effectiveness of the proposed work.

5.2 Tracking performance test

We tested the proposed tracking system on a number of ECHOand self-captured signing videos for different SL signers underdifferent lighting conditions and with different occlusionconditions. Fig. 7 illustrates several examples of the trackedimages. We use a bounding box with different colour to

Comput. Vis., 2009, Vol. 3, Iss. 1, pp. 24–35: 10.1049/iet-cvi:20080006

represent tracking of the different objects. If some objectswere occluded, their bounding boxes were merged to onebounding box. To quantitatively evaluate the performance, wemanually labelled 600 frames to construct the ground truth ofthe bounding boxes of the skin objects. Out of 600 frames,237 frames included occlusions. As in [21], we measure theerror in the position (x,y) of the centre of the bounding box.Table 4 shows the average error in x and y directions,respectively, and the average error of the tracking process, thatis, when skin object is incorrectly identified. From the results,we can conclude that the tracking is good and very robust toocclusions, as out of about 40% occluded frames, the errorpercentage was about 6.5%.

The proposed tracking algorithm has two free parametersthat need to be specified. The first one is the threshold g

associated with the colour information. It determines which

Figure 7 Some samples of the proposed tracking system

33

& The Institution of Engineering and Technology 2009

34

&

www.ietdl.org

pixels have skin colour. In principle, this threshold has to bere-adjusted when the new SVM is trained for new users.However, actually, it is not so sensitive to different personsbecause many humans have nearly the same skin colour.Accordingly, once the colour threshold is empiricallydecided, it can be reliable for different users with similarskin colour and under near same illumination conditions.The second parameter n is to detect motion information.In this paper, we used the same value for all testingsequences, which may illustrate that this threshold is nearlyrobust to different users. For all test videos, we used a setof fixed parameters with g ¼ 0.8 and n ¼ 0.7.

5.3 Segmentation and tracking testfor SL recognition

The proposed work was applied to a real-world SL recognitionsystem [37]. Given a signer in front of a video camera, the SLrecognition system can automatically translate recognisedgestures into English. The feature vector consisted of handshape described by a principal component analysis-basedapproach and hand position obtained from the output of theproposed tracker. A hidden markov Model was implementedfor recognition. Refer to [37] to obtain the details on therecogniser. The vocabulary of our test set contained 28 staticand 17 dynamic gestures. Twenty samples of each isolatedgesture were recorded employing two different persons overfour different days. The samples were divided into trainingand test sets by random sampling. The average recognitionaccuracy with five training samples for each gesture canachieve around 94%. The average recognition accuracy withten training samples can achieve around 98%. The goodperformance of SL recognition demonstrates the effectivenessof the proposed work.

6 ConclusionsIn this paper, an automatic frameworkof SST has been proposedfor SL application. The contribution of the proposed work is 2-fold: (1) A skin colour model using SVM and active learning hasbeen introduced. It is able to handle time-varying illuminationand achieve the better performance than many traditional skinmodels. (2) Tracking and segmentation tasks have beenapproached as one unified problem, where tracking helps toreduce the search space used in segmentation, and goodsegmentation helps to accurately enhance the trackingperformance. More importantly, the tracking system canpredict the occlusion status to maintain a high-levelunderstanding of skin object movement. The proposed work

Table 4 Statistical accuracy of the proposed tracking system

Face Right hand Left hand

error in X direction, pixel 1.72 1.52 4.78

error in Y direction, pixel 2.80 2.27 6.24

tracking error, % 6.20 6.50 6.50

The Institution of Engineering and Technology 2009

is easy to implement and may be efficiently incorporated in agesture recognition system.

In future work, we may improve the tracking performanceby using more robust techniques like particle filters. Anotherextension is to avoid retraining SVM after every frame undervarying illumination by incorporating a lighting changedetector, which can reduce lots of computation cost.

7 AcknowledgmentsThis work was performed in Dublin City University, Ireland.It was partially supported by the EU Marie Curie IncomingInternational Fellowship, project 509477.

8 References

[1] GAO W., MA J., WU J., WANG C.: ‘Large vocabulary sign languagerecognition based on HMM/ANN/DP’, Int. J. Pattern Recognit.Art. Intell., 2000, 14, (5), pp. 587–602

[2] STARNER T., WEAVER J., PENTLAND A.: ‘Real-time Americansign language recognition using desk and wearablecomputer based video’, IEEE Trans. Pattern Anal. Mach.Intell., 1998, 20, (12), pp. 1371–1375

[3] YANG M.-H., AHUJA N., TABB M.: ‘Extraction of 2D trajectoriesand its application to hand gesture recognition’, IEEE Trans.Pattern Anal. Mach. Intell., 2002, 24, (8), pp. 1061–1074

[4] ONG S., RANGANATH S.: ‘Automatic sign language analysis:a survey and the future beyond lexical meaning’, IEEE Trans.Pattern Anal. Mach. Intell., 2005, 27, (6), pp. 873–891

[5] CHEN F.-S., FU C.-M., HUANG C.-L.: ‘Hand gesture recognitionusing a real-time tracking method and hidden MarkovModels’, Image Vision Comput., 2003, 21, pp. 745–758

[6] SHAMAIE A., SUTHERLAND A.: ‘Hand tracking in bimanualmovements’, Image Vision Comput., 2005, 23, pp. 1131–1149

[7] KOVAC J., PEER P., SOLINA F.: ‘Human skin colour clusteringfor face detection’. Proc. EUROCON 2003, Ljubljana,Slovenia, September 2003, pp. 144–148

[8] CHAI D., NGAN K.N.: ‘Face segmentation using skin-colourmap in videophone applications’, IEEE Trans. Circuits Syst.Video Technol., 1999, 9, pp. 551–564

[9] ZHU Q., WU C.T., CHENG K.T., WU Y.L.: ‘An adaptive skin modeland its application to objectionable image filtering’. Proc.ACM Multimedia, New York, USA, October 2004, pp. 56–63

[10] PHUNG S.L., BOUZERDOUM A., CHAI D.: ‘Skin segmentationusing colour pixel classification: analysis and comparison’,IEEE Trans. Pattern Anal. Mach. Intell., 2005, 21, pp. 148–154

IET Comput. Vis., 2009, Vol. 3, Iss. 1, pp. 24–35doi: 10.1049/iet-cvi:20080006

IETdoi

www.ietdl.org

[11] JONES M.J., REHG J.M.: ‘Statistical colour models withapplication to skin detection’, Int. J. Comput. Vision, 2002,46, pp. 81–96

[12] VEZHNEVETS V., SAZONOV V., ANDREEVA A.: ‘A survey on pixel-based skin color detection techniques’. Proc. Graphicon,Moscow, Russia, September 2003, pp. 85–92

[13] MCKENNA S.J., RAJA Y., GONG S.: ‘Tracking colour objectsusing adaptive mixture models’, Image Vision Comput.,1999, 17, pp. 225–231

[14] WU Y., HUANG T.S.: ‘Colour tracking by transductivelearning’. Proc. IEEE Conf. Computer Vision PatternRecognition, SC, USA, June 2000, pp. 33–138

[15] SORIANO M., MARTINKAUPPI B., HUOVINEN S., LAAKSONEN M.:‘Adaptive skin color modeling using the skin locus forselecting training pixels’, Pattern Recognit., 2003, 36,pp. 681–690

[16] SIGAL L., SCLAROFF S., ATHITSOS V.: ‘Skin colour-basedvideo segmentation under time-varying illumination’,IEEE Trans. Pattern Anal. Mach. Intell., 2004, 26, (7),pp. 862–877

[17] NAYAK A., CHAUDHURI S.: ‘Automatic illumination correctionfor scene enhancement and object tracking’, Image VisionComput., 2006, 24, pp. 949–959

[18] STORRING M., ANDERSEN H., GRANUM E.: ‘Physics-basedmodelling of human skin colour under mixed illuminants’,Robot. Auton. Syst., 2001, 35, pp. 131–142

[19] SHERRAH J., GONG S.: ‘Tracking discontinuous motionusing Bayesian inference’. Proc. European Conf. ComputerVision, Dublin, Ireland, June 2000, pp. 150–166

[20] IMAGAWA K., LU S., IGI S.: ‘Colour-based hands trackingsystem for sign language recognition’. Proc. IEEE Int. Conf.Automatic Face and Gesture Recognition, Nara, Japan,April 1998, pp. 462–467

[21] MARTIN J., DEVIN V., CROWLEY J.: ‘Active hand tracking’. Proc.IEEE Int. Conf. Automatic Face and Gesture Recognition,Nara, Japan, April 1998, pp. 573–578

[22] MCALLISTER G., MCKENNA S.J., PICKETTS I.W.: ‘Hand tracking forbehaviour understanding’, Image Vision Comput., 2002, 20,pp. 827–840

[23] ISARD M., BLAKE A.: ‘CONDENSATION-conditional densitypropagation for visual tracking’, Int. J. Comput. Vision,1998, 29, pp. 5–28

[24] MAMMEN J., CHAUDHURI S., AGRAWAL T.: ‘Simultaneous trackingof both hands by estimation of erroneous observations’. Proc.

Comput. Vis., 2009, Vol. 3, Iss. 1, pp. 24–35: 10.1049/iet-cvi:20080006

British Machine Vision Conf., Manchester, UK, September2001, pp. 83–92

[25] SHAN C., TAN T., WEI Y.: ‘Real-time hand tracking using amean shift embedded particle filter’, Pattern Recognit.,2007, 40, pp. 1958–1972

[26] REHG J., KANADE T.: ‘Visual tracking of high DOFarticulated structures: an application to human handtracking’. Proc. 3rd European Conf. Computer Vision,Stockholm, Sweden, May 1994, pp. 35–46

[27] WREN C., AZARBAYEJANI A., DARRELL T., PENTLAND A.: ‘Pfinder:real-time tracking of the human body’, IEEE Trans. PatternAnal. Mach. Intell., 1997, 19, pp. 780–785

[28] STENGER B., MENDONCA P.R.S., CIPOLLA R.: ‘Model-based 3Dtracking of an articulated hand’. Proc. Conf. ComputerVision and Pattern Recognition, Hawaii, USA, December2001, pp. 310–315

[29] LU S., METAXAS D., SAMARAS D., OLIENSIS J.: ‘Using multiplecues for hand tracking and model refinement’. Proc. Conf.Computer Vision and Pattern Recognition, Wisconsin,USA, June 2003, pp. 443–450

[30] VAPNIK V.: ‘The nature of statistical learning theory’(Springer, New York, 1995)

[31] SCHOHN G., COHN D.: ‘Less is more: active learning withsupport vector machines’. Proc. Int. Conf. MachineLearning, Stanford, USA, June 2000, pp. 839–846

[32] TONG S., CHANG E.: ‘Support vector machine activelearning for image retrieval’. Proc. ACM Multimedia,Ottawa, Canada, September 2001, pp. 107–118

[33] DENG Y., MANJUNATH B.S.: ‘Unsupervised segmentation ofcolour-texture regions in images and video’, IEEE Trans.Pattern Anal. Mach. Intell., 2001, 23, pp. 800–810

[34] AWAD G., HAN J., SUTHERLAND A.: ‘A unified system forsegmentation and tracking of face and hands in signlanguage recognition’. Proc. Int. Conf. Pattern Recognition,Hong Kong, China, August 2006, pp. 239–242

[35] WU T., LIN C., WENG R.: ‘Probability estimates for multi-class classification by pairwise coupling’, J. Mach. Learn.Res., 2004, pp. 975–1005

[36] CHUI C.K., CHEN G.: ‘Kalman filtering with real-timeapplications’ (Springer, Berlin Heidelberg, 1999)

[37] COOGAN T., AWAD G., HAN J., SUTHERLAND A.: ‘Real time handgesture recognition including hand segmentation andtracking’. Proc. Int. Symp. Visual Computing, Nevada, USA,November 2006, pp. 495–5048

35

& The Institution of Engineering and Technology 2009