video face recognition via combination of real time local features and temporal–spatial cues

www.ietdl.org

IE

d

Published in IET Computer VisionReceived on 26th January 2013Revised on 7th November 2013Accepted on 24th November 2013doi: 10.1049/iet-cvi.2013.0025

T Comput. Vis., 2014, Vol. 8, Iss. 4, pp. 347–357oi: 10.1049/iet-cvi.2013.0025

ISSN 1751-9632

Video face recognition via combination of real-timelocal features and temporal–spatial cuesGaopeng Gou, Di Huang, Yunhong Wang

State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang

University, Beijing 100191, People’s Republic of China

E-mail: [email protected]

Abstract: Video-based face recognition has attracted much attention and made great progress in the past decade. However, it stillencounters two main problems, which are efficiently representing faces in frames and sufficiently exploiting temporal–spatialconstraints between frames. The authors investigate the existing real-time features for face description, and compare theirperformance. Moreover, a novel approach is proposed to model temporal–spatial information which is then combined withreal-time features to further enforce the consistent constraints between frames to improve the recognition performance. Theexperiments are validated on three video face databases and the results demonstrate that temporal–spatial cues combined withthe most powerful real-time features largely improve the recognition rate.

1 Introduction

Nowadays, identifying a person by his/her face has becomeone of the most important research topics in the area ofpattern recognition and computer vision. In general,according to the type of used data, face-recognitionmethods can be roughly divided into three sub-branches:still image based, video based and three-dimensional (3D)model based, respectively. Among these techniques,video-based ones make use of more useful information thanstill image-based ones, while avoid the high computationalcost in 3D model-based ones, and hence they attractincreasing attentions in recent years, especially in theapplication of security surveillance, for example, accesscontrol systems.Similar to still image-based face-recognition methods,

video-based ones are also affected by factors as diverse asfacial image quality, pose, illumination and facialexpression variations, partial occlusions and so on, andbesides, there are two additional critical issues to address:(i) how to extract powerful and efficient features for facerepresentation, and (ii) how to sufficiently make use oftemporal–spatial information to further improve performance.

1.1 Related work

To deal with the two issues mentioned above, many effortshave been made, and in this section, we first briefly reviewthese methods on real-time local-feature extraction and theutilisation of temporal–spatial information.

1.1.1 Real-time local-feature extraction: Manymethods of facial-feature extraction have been proposed inthe past decade. These features can be mainly categorised

into two classes based on their properties: holistic ones andlocal ones. Face recognition based on holistic featuresgenerally requires accurate normalisation according to pose,illumination, scale and so on, and variations in these factorscan largely degrade its performance of holistic feature-basedface recognition. Moreover, local features are extractedfrom different facial regions and are thus more robust topose and illumination changes [1], as well as occlusions[2]. According to the matching strategies, local features canbe further classified as dense local features and sparse localfeatures.Specifically, the former finds matches for all points in the

face image, and statistic measurements are often used tocompute similarities between different faces. Local binarypatterns (LBP) [3] is an efficient dense texture operatorwhich labels the pixels of an image by thresholding theneighbourhood of each pixel and considers the result as abinary number. Owing to its high discriminative power aswell as the property to monotonic illumination changes andcomputational simplicity, it has been widely investigated forface recognition since proposed and achievedstate-of-the-art accuracies. There also exist some attempts toseek possible ameliorations to LBP in different aspects,such as the number of sampling points, the robustness tonoises (LTP [4]), the shape of neighbourhood (three-patchLBP (TPLBP), four-patch LBP (FPLBP) [5] andmulti-block LBP (MB-LBP) [6]), to boost the result in thedomain of face recognition. See [7] for a comprehensivesurvey on LBP and its variants. Gabor [8] is another typicaldense descriptor successfully used in face recognition. Eventhough it presents excellent performance due to itsmulti-scale and multi-orientation description, thecomputational cost is high, and the high-dimensionalfeature vector probably leads to curse of dimensionality in

347& The Institution of Engineering and Technology 2014

www.ietdl.org
the following classification step. Ozuysal et al. [9] proposedthe DAISY feature that is very fast in computation, but itstill results in high burden in classification.Different from dense local features, the latter establishes a
set of robust matches between the features extracted from theinterest points of the image pair. It generally contains two keysteps [10]: (i) detecting interest points from the facial imageand (ii) describing local features of interest points. Themost representative example is scale-invariant featuretransform (SIFT) [11]. It adopts difference of Gaussian(DoG) to localise keypoints and computes gradientinformation quantised for different orientations within apre-defined neighbourhood, which also achieves benchmarkresults in face recognition; however, its computational costis always a problem to video face recognition in real-time.To solve this problem, SURF [12] appeared as asubstitution, and it uses Hessian matrix to detect keypointsand utilise the neighbourhood information of keypoints todescribe them. To inherit the efficiency of SURF featuresand enhance the representativeness and robustness oflocalised interest points, we also proposed thecolour-hSURF and gradient-hSURF in [10], and these twofeatures exploit the histogram of colour and gradientvariations to detect the interest points and describe them bySURF descriptor. Gradient location and orientationhistogram (GLOH) [13] is another extension of the SIFTdescriptor and presents good performance, but it alsosuffered by the time consumption.

1.1.2 Utilisation of temporal–spatial: According towhether temporal–spatial information is used in the testingand training sets, video-based face-recognition algorithmscan be divided into three categories: still-to-still,video-to-still and video-to-video methods. Still-to-stillmethods process face sequences as individual frameswithout consecutive dynamic information. Liu et al. [14]proposed a still-to-still algorithm based on iterative updateof the principal component analysis face model, and it onlyuses current face frame in recognition. Temporal–spatialinformation is utilised in video-to-still and video-to-videomethods; video-to-still methods apply sequence importancesampling to extract keyframes and compare them with theimage template, whereas video-to-video methods constructthe spatial and temporal propagation model to recognisefaces in videos. In this section, we introduce the recentwork of temporal–spatial information utilisation in videoface recognition.Zhou et al. [15] proposed a method by tracking the

uncertainty faces and recognising them simultaneously in aprobabilistic framework. Aggarwal et al. [16] constructedauto-regressive and moving average models for each inputsequence and video template and then calculated theirsimilarities in the subspace. Tang and Li [17] proposed avideo frame temporal synchronisation scheme to alignframes of similar images across video sequences. But onlythe selected video frames are extracted for recognition,indicating the face samples in videos could not be fully andefficiently utilised. In [18], the sum-rule decision fusionscheme was used to combine all frames in a video sequenceto predict the identity of a subject, and the weight of thecontribution of each individual frame could be calculatedby three different measures. Although this method makesuse of all the samples, recognition performance is easilyaffected by the weight of each individual frame.


1.2 Contributions

In this work, we first evaluate the performance of popularlocal descriptors on different video databases, and thenproposed a method to extract temporal–spatial informationto improve the accuracies of these local descriptors. Somepreliminary results can be found in our previous work [10,19]. The main contributions of this paper can besummarised as three aspects:

1. We evaluate some representative real-time local featureswhich have been applied or potentially applied to video-based face recognition, and analyse their recognition resultsof these features on various requirements (different videoface databases present different application requirements).2. We discuss the relationship between the number ofdetected interest points of face images and final facerecognition performance in sparse local feature matching.3. We propose a novel method to extract temporal–spatialinformation that is then combined with real-time localfeatures to further improve the system accuracy.

The remainder of the paper is organised as follows: InSection 2, we introduce some efficient dense matching andsparse matching local features that could be used invideo-based face recognition. The proposed method thatutilises temporal–spatial cues in face recognition isdescribed in Sections 3 and 4. In Section 5, theexperimental results of local features on some typical videoface databases are shown and discussed, and theperformance of local features combined with temporal–spatial information is also displayed. Finally, we give theconclusion of this paper in Section 6.

2 Dense matching and sparse matchingfeatures

LBP is one of the most representative of dense matchingfeatures, and TPLBP and FPLBP as the extensions of LBPpresent their effectiveness in face recognition in the wild[5]. Therefore, we consider LBP, TPLBP and FPLBP indense matching in this paper.

2.1 Dense matching features (LBP family)

LBP, a non-parametric algorithm [20], was originallyproposed to describe local variations of 2D texture images.Owing to its properties of high discriminative power,tolerance to monotonic illumination changes andcomputational simplicity, it has been extensively adoptedfor 2D face recognition in the last several years [3].Specifically, the basic LBP operator labels each pixel of animage by thresholding in a 3 × 3 neighbourhood. If thevalues of the neighbouring pixels are not lower than that ofthe central pixel, their corresponding binary bits areassigned to 1; otherwise they are assigned to 0. A binarynumber is thus formed by concatenating all the eight binarybits, and the resulting decimal value is employed forlabelling. Fig. 1 gives a process example. Formally, given apixel at (xc, yc), the derived LBP decimal value could becalculated by (1).

LBP(xc, yc) =∑8n=0

s(in − ic)2n; s(x) = 1, if x ≥ 0

0, if x , 0

{

(1)

IET Comput. Vis., 2014, Vol. 8, Iss. 4, pp. 347–357doi: 10.1049/iet-cvi.2013.0025

Fig. 1 LBP, TPLBP and FPLBP codes [5]

a LBP and TPLBPb FPLBP

www.ietdl.org

where n covers the eight neighbours of the central pixel, ic andin are the grey level values of the central pixel and itssurrounding pixels, respectively.One limitation of the basic LBP operator is that its small

3 × 3 neighbourhood cannot capture dominant features withlarge-scale structures. To deal with the texture at differentscales, the operator was later generalised to useneighbourhoods of different sizes [20]. A localneighbourhood is defined as a set of sampling points evenlyspaced on a circle which is centred at the pixel to belabelled, and the sampling points that do not fall within thepixels are interpolated by using bilinear interpolation, thusallowing for any radius and any number of sampling pointsin the neighbourhood.As its name implies, the TPLBP descriptor compares the

values of three patches to produce a single-bit value in thecode assigned to each pixel. For each pixel in the image, aw ×w patch centred on the pixel is considered, and Sadditional patches are distributed uniformly in a ring ofradius r around it in Fig. 1. The parameter α indicates thatα-patches apart along the circle are used, and their valuesare compared with those of the central patch. Specifically,TPLBP is produced by applying (2) to each pixel.Similarly, FPLBP follows the manner in which it focuseson two rings of radii r1 and r2 centred on the pixel andcompares two centre symmetric patches in the inner ringwith two centre symmetric patches in the outer ringpositioned α-patches away along the circle as shown inFig. 1. Finally, as the variants of LBP, they also inherit theproperty of fast processing, and we thus investigate theirperformance for real-time face recognition in videos.

TPLBPr,S,w,a(p) =∑Si

f d Ci, Cp

( )− d Ci+a,modS , Cp

( )( )2i

(2)


2.2 Sparse matching family

Different from dense matching features, the sparse matchingones involve the steps of interest point detection and localimage description. In this part, we introduce thecolour-hSURF and gradient-hSURF proposed in ourprevious work [10].

2.2.1 SURF: SURF is chosen because it proves moreefficient than other invariant local descriptors like SIFT[11], GLOH [13] and so on, and it can achieve nearlyreal-time results after acceleration processing. Moreover,SURF is also robust to the variations of scales, rotationsand poses [12].Interest points detected by SURF is based on Hessian

matrix because of its good performance in computationtime and accuracy. Given a point (x, y) in a face image, theHessian matrix H(X, σ) at point (x, y) at scale σ is definedby (3), where Lxx(X, σ) is the convolution of the Gaussiansecond-order derivative (∂2/∂x2)g(σ) with face image atpoint X, and using similarly step, Lxy(X, σ) and Lyy(X, σ)could be generated.

H(X , s) = Lxx(X , s) Lxy(X , s)Lxy(X , s) Lyy(X , s)

[ ](3)

With box filters, SURF approximates the second-orderGaussian derivatives, which approximates Laplacian ofGaussian (LoG) with DoG. The sizes of the box filters are9 × 9, 15 × 15, 21 × 21 and 27 × 27. The location and scaleof interest points are selected by relying on the determinantof the Hessian matrix. Applying non-maximum suppressionin a 3 × 3 × 3 neighbourhood, the interest points could bedetected in scale and image space.In the descriptors generating step, a square area around

each interest point is selected as the interest region, whosesize is 20 times of the interest point’s scale. This interest


Fig. 2 Generation of descriptors [12]

www.ietdl.org

region is then split equally into 16 square subregions with 5 ×5 regularly spaced sample points in each subregion. Haarwavelet responses dx and dy for the x- and y-directions arecalculated in every subregion. The generation of descriptorsis shown in Fig. 2. The responses in each subregion areweighted using a Gaussian function and then summed up inthe x- and y-directions, respectively, to generate the featuredescriptor (∑dx, ∑dy), which is a 32-dimensional vector(4 × 4 × 2). When the absolute responses are considered,the descriptor becomes a 64-dimensional vector (∑dx, ∑dy,∑|dx|, ∑|dy|). Furthermore, when ∑dx and ∑|dx| arecomputed according to the signs of dy’s, similarly ∑dy and∑|dy| according to the signs of dx’s, the descriptor becomesa 128-dimensional vector.

2.2.2 Colour-hSURF and gradient-hSURF:Histogram-based interest point (HIP) [21] is employed toextract robust interest points in colour-hSURF andgradient-hSURF. The face image is first divided intodifferent patches, and then a colour histogram or orientedgradient histogram [22] with L bins is built for each patch,based on which a weighted histogram h(x, y) is constructedfrom pixels in the neighbourhood Ω(x, y) of each pixellocation (x, y). The kth bin of h(x, y), denoted by hk(x, y), iscomputed using (4).

hk(x, y) =1

Z

∑where (xi ,yi )[ V(x,y) andb(xi ,yi) = k

w(xi − x, yi − y) (4)

where Z is a normalising parameter, w(xi−x, yi−y) is aGaussian weighting function, which takes the form

w(x, y) = e−(x2+y2)

s2 , and b(xi, yi) is the discrete quantityderived from the colour or oriented gradient of images.To estimate whether point (x, y) is an interest one, the

Bhattacharyya coefficient ρ [23] is used to evaluate thesimilarity between h(x, y) and the weighted histogramh(x + Δx, y + Δy) of its shifted pixel. The smaller the ρ is,the more different hk(x, y) and hk(x + Δx, y + Δy) are.Therefore, the pixel locations which is projected to the localminimum of ρ are selected as the interest points in a patch.Colour histogram and oriented gradient histogram are used

to generate the HIP. In the colour histogram generating step,we quantise each colour channel (256 levels assumed) of thecolour face image into eight bins, and the colour histogramwith 83 = 512 bins can be obtained. The quantisation


function is given by

b(x, y) = Rx,y/32⌊ ⌋

× 82 + Gx,y/32⌊ ⌋

× 8

+ Bx,y/32⌊ ⌋

+ 1 (5)

where 32 is calculated as the 256 levels divided by the 8 bins;Rx,y, Gx,y, Bx, y are the RGB values of pixel (x, y) in faceimages. Then, b(x, y) is substituted into the constraintcondition of (4), where b(x, y) = k in the set Ω(x, y).In the oriented gradient histogram generating step, we

utilise intensity gradients to construct the histograms forface images. We quantise the orientation of the gradientinto 8 bins, and each bin covers a 45° angle (360° ÷ 8). Themagnitude of the gradient is also divided into eight bins,thus the resulting histogram contains 64 bins. Themagnitude of the gradient provides useful information forinterest point detection. Equation (4) can be changed to (6).||g(xi, yi)||

α is the magnitude of the gradient at pixel (xi, yi),and α is a scaling parameter.

hk(x, y) =1

Z

∑where (xi ,yi)[ V(x,y) andb(xi ,yi) = k

w(xi − x, yi − y) g(xi, yi)∥∥ ∥∥a (6)

If the interest points are detected, the descriptor ofcolour-hSURF or gradient-hSURF is using the SURFdescriptor, which has been introduced in Section 2.2.1.

3 Facial image similarity measurement

A face video sequence F = { f1, …, ft, …, fN} of a singlesubject is regarded as the probe sample, where N is the totalnumber of face frames in sequence F and ft is the tth faceframe in F. There are K different enrolled subjects in thegallery and a face image of each subject constitutes thegallery set S = {s1, …, sk, …, sK}.After feature extraction, the facial image xt detected from a

frame ft of a probe sequence F is represented either by denselocal features or sparse local features, based on which itssimilarity measurement is then calculated with the enrolledimages in the gallery set S for identification. While forsparse and dense local features, different similaritymeasurements are adopted. In the paragraphs below, weintroduce them in detail.

3.1 Face similarity measure based on dense localfeatures

Specifically, for dense local features, xt is described as theLBP, TPLBP or FPLBP vector, and the similarity with thesamples in S is measured by Euclidean distance as (7)

Di(xt, si) =��xt − si

∥∥ ∥∥2

√, i = 1, 2, . . . , K (7)

where a smaller D indicates a bigger similarity. We thennormalise it and change its polarity by (8)

Di = log 1− max(Di)− Di

max(Di)−min(Di)

( )(8)


www.ietdl.org
where max(Di) and min(Di) are the largest and smallest value,respectively, between xt and the K enrolled faces in the galleryset, and a bigger indicates a bigger similarity.
3.2 Face similarity measure based on sparse localfeatures

While for sparse local features, given the probe facial imageIET Computer Vision xt and the gallery facial image sk, wegenerate their descriptor sets, that is, I x = ix1, . . . , i

xM

{ }and I s = is1, . . . , i

sN

{ }, where M and N are the numbers

of interest points detected from xt and sk, respectively, and iis the feature descriptor of SURF, colour-hSURF orgradient-hSURF extracted from these points.We associate the corresponding descriptors between xt and

sk according to the strategy proposed in [11, 12]. For eachdescriptor ixm in xt, a similarity score dm,n is computedbetween it and each descriptor isn in sk using (9).

dm,n = ixm − isn[ ]

ixm − isn[ ]T

(9)

The smaller the dm,n is, the more similar the ixm and isn are.After all the similarity scores {dm,1, …, dm, N} arecalculated for a descriptor ixm, the ratio between the smallestscore dm,p and the second smallest score dm, q can beachieved. If the ratio is less than a pre-defined threshold(0.8 in our case), ixm and isn are considered as matching;otherwise, as non-matching where we set dm,p = 1.The more the descriptor pairs are matched, the more similar

the two faces are. Therefore, the similarity between xt and sk isdefined by the product rule

Dx,s =∏Mp=1

∏Nq=1

dp,q (10)

Fig. 3 Process of temporal–spatial face recognition


where the smaller the Dx, s is, the more similar the xt and skare. Since Dx, s may be extremely small, we take theopposite value of the logarithm of Dx, s to measure thesimilarity of xt and sk, that is

Dx,s = −log(Dx,s) (11)

where a larger Dx,s

indicates the face pair is more similar.

4 Temporal–spatial description in video facerecognition

Video-based face recognition contains more information thanstill image-based one. The main difference is the temporalinformation of adjacent frames and the spatial similarity ofadjacent detected faces. To utilise the temporal–spatialinformation of a face video sequence, we propose a novelmethod to extract temporal–spatial information forvideo-based face recognition using Markov processing [19].Fig. 3 illustrates the general process of our temporal–spatialface-recognition framework based on first-order Markovprocessing.Specifically, as we mentioned in Section 3, we could

generate the probability that the identity of xt is assigned tosk in the gallery using the distance-based similaritymeasurement so that it can be used in the proposedtemporal–spatial framework as

P xt = sk |ft( ) = Dk (xt, sk)∑

l=1,...,KDl(xt, sl)

(12)

Equation (12) is used for dense local matching, whereas (13)


Table 1 Database introduction

Database Subjects Variations

UCSD/Honda 20 poseVidTIMIT pose 43 pose

expression 43 expression variationChokePoint 29 surveillance

www.ietdl.org
is used for sparse local matching.
P xt = sk |ft( ) = Dx,k∑

l=1···KDx,l

(13)

It should be noted that if Di is 0, we set it as the secondsmallest one so that the probability of a single frame is not0. In our temporal–spatial framework, we compute theprobability that the identity of xt is assigned to sk in thegallery as follows

P xt = sk | f1:t( ) = P xt = sk | ft

( )P xt = sk | f1:t−1

( )(14)

At initialisation, P(xt = sk|f1 : 1) = P(xt = sk|f1). For the face areadetected in the tth frame, we compute its probabilities with allthese subjects in gallery, and assign the probe with theidentity of the enrolled subject who has the biggestprobability, that is

Fxt= arg max

k=1···KP xt = sk |f1:t( )

(15)

5 Empirical evaluation

In this part, we evaluated the performance of differentreal-time local features on three video face databases, in thepresence of head pose variations, expression changes andreal-world surveillance conditions. In our experiments, weset the number of sampling points at 8 and the radius at 1(this is an empirical parameter obtained in the trainingprocess). We divided the face images to some blocks of12 × 10 pixels to extract the LBP, TPLBP, FPLBP features.For the parameters of TPLBP and FPLBP, we used theones given by [5]. For SURF, colour-hSURF andgradient-hSURF, we set the matching parameter at 0.8 as in[11, 12]. The dimension of SURF descriptor has beenchanged to 64 in order to improve the processing efficiency.

5.1 Video face database introduction

In this paper, three databases were used to test theperformance of different local features with regard to posevariations, expression changes and surveillance videosituation. Face sequences were automatically detected bythree detectors: one was directly borrowed from OpenCV[24] to detect the faces in less than 25° poses, the other twowere trained ourselves to detect the ones in 30–75° and75–90°, respectively. To produce the two detectors, for eachof them, we collected around 10 K positive samples fromInternet, which were then normalised to 24 × 24, while 1million non-face images were used as the negative samples.The Adaboost [25] algorithm was introduced for training.Additionally, we designed a hierarchical framework tocombine these three face detectors so that the systemefficiency is not largely degraded. It starts with the frontaldetector (for less than 25°), and if there is no facecandidate, the second one (30–75°) is launched. If still nocandidates, the last one (75–90°) is operated. When all thedetectors fails to detect any face, we discard the frame (lessthan 3%). The detected faces were scaled to 100 × 100pixels. We selected 50 frames sequentially from the firstsequence of a subject for testing, then we chose five framessequentially from the second sequence of the same subjectfor training. To only focus on the comparison in


discriminative power of different local descriptors, weadopted nearest neighbour (NN) as the final classifier. Inthe process of recognition, we discard the frame in whichno face area is located. The total number of frames in thetest video sequences with detected faces is denoted as Ntotal,and the number of faces which are correctly recognised isdenoted as Ncorr. The final recognition rate is thus computedas Ncorr/Ntotal.The first database is Honda/UCSD video database [26]

containing 20 subjects each of which possesses a trainingset and a test set, respectively. Each subject has significant2D (in-plane) and 3D (out-of-plane) head rotations. Theface detection rate is 85%, and the detected faces containsmuch more face pose variations, therefore, we could testour work on the pose changes.The second video face database is the multi-modal

VidTIMIT database [27], which was collected in an indoorenvironment. The VidTIMIT database contains 43 subjectsand each subject has 13 sequences. Each subject rotates thehead and recites short sentences with expressions. So, thevideos could be divided into two parts: head pose changesand expression variations. The face detection rate is 90% onthe pose part; the detected faces contain much more facepose variations; therefore, we could test our work on thepose changes situations. The face detection rate is nearly100% on expression part, so we could test our work on theexpression changes situations.The third database is a surveillance video database [28]

named ChokePoint, and it could be used for facerecognition under real-world surveillance conditions. Thedataset consists of 29 subjects. The video of subjectentrances or leaves the portal was captured by threecameras. The dataset has frame rate of 30 fps and the imageresolution is 800 × 600 pixels. The ChokePoint databaseapplies the face detection results so there is no need todetect. Faces in recorded videos have variations in terms ofillumination conditions, pose and sharpness due to threecamera configuration. In this database, we test theperformance of different features on nearly practicalapplication situation.To test the performance of different local features and the

effectiveness of the temporal–spatial cues, we divided thethree databases into five parts. Table 1 gives an introductionof the testing databases in details. The UCSD-Honda andthe pose part of VidTIMIT are used to test the performanceof features on the situation of head pose variations in videosequence. The expression part of VidTIMIT is used to testthe performance of features on the situation of expressionchanges in video sequence. The ChokePoint database couldbe used to test the performance of features on real-worldsurveillance conditions.

5.2 Performance of real-time local features

In this part, the performances of LBP, TPLBP, FPLBP,SURF, colour-hSURF and gradient-hSURF were tested on


Fig. 4 Face recognition results of LBP, TPLBP, FPLBP, SURF, colour-hSURF and gradient-hSURF

Table 2 Repeatability rate of interest points detected by localsparse features

Feature Colour-hSURF Gradient-hSURF SURF

repeatability rate 44.88% 82.96% 76.02%

www.ietdl.org

the Honda/UCSD, VidTIMIT and ChokePoint videodatabases without using temporal–spatial information,observing their robustness to pose changes, expressionvariations and surveillance condition.

5.2.1 Recognition results of local features: FromFig. 4, we can find that, in dense matching features, LBPfeature achieves better performance on the tested databasecompared with TPLBP and FPLBP on most situations,indicating the simple feature could be more suitable in thevideo-based face recognition. In sparse matching features,gradient-hSURF features outperforms SURF andcolour-hSURF showing that gradient-hSURF localises moreuseful interest points that are robust to pose, expression andin surveillance because these three sparse local featuresmake use of using the same local descriptor. In general, itis quite difficult to conclude which feature performs best,therefore, we discuss them in terms of different applicationsituations.Under the situation of pose changes, none of these

evaluated real-time features can reach the recognition rate of80%, showing that pose variations greatly degrade theperformance of real-time feature-based face-recognitionsystems in videos. Among these features, LBP performsslightly better than the others.Regarding the variations of expressions in videos, all the

local features achieve acceptable results (above 80%).TPLBP performs the best, while LBP and gradient-hSURFalso exceed 90%.In the surveillance environment, the recognition rates

decrease largely when compared with the ones in terms ofpose variations and expression changes showing that thesurveillance application challenges most to video facerecognition. From this experiment, we can also see that thesparse matching feature presents obvious advantage to


others, indicating that gradient-hSURF is more robust insurveillance conditions.

5.2.2 Performance of sparse matching features withsample discard: Recall that sparse matching involvestwo key steps, that is, keypoint detection and description,and in this experiment, we continue to investigate theirimpacts on final performance. Since SURF, colour-hSURFand gradient-hSURF make use of the same descriptor ofSURF, the main difference among them is caused byvarious keypoint detectors. The repeatability keypoints is animportant criterion to measure the efficiency of local sparsematching features [11, 29]. Specifically, if repeatableinterest points can be detected sufficiently from faceimages, satisfying results can be guaranteed.Table 2 gives the repeatability rate of interest points

detected in 21 aligned faces from the FRGC v2.0 dataset[30] by different local sparse features, and we can see thatgradient-hSURF is the most stable feature leading to betterperformance in the face recognition application.Fig. 5 shows the scattering of interest points in these

aligned faces. We can find out that most of the interestpoints extracted by SURF, colour-hSURF andgradient-hSURF are located around the eyes, nose andmouth. On the other side, for a face image, the mostimportant interest points are generally around the left inner


Fig. 5 Scattering of interest points detected by SURF, colour-hSURF, gradient-hSURF, respectively

a SURFb Colour-hSURFc Gradient-hSURF

www.ietdl.org

eye corner, left outer eye corner, right inner eye corner, rightouter eye corner, nose left side, nose right side, nose tip, leftmouth side, right mouth side and the mouth centre [31].

Table 3 Kept rate of samples after interest points filter

Database Colour-hSURF,%

Gradient-hSURF,%

SURF,%

UCSD-Honda 99.10 95.90 72.40VidTIMIT (pose) 80.70 90.56 85.49VidTIMIT(expression)

94.47 92.65 95.67

ChokePoint 98.41 78.55 50.83

Fig. 6 Face recognition results of SURF, colour-hSURF and gradient-hS


If very limited interest points are detected by local sparsefeatures from face samples, the matching cannot be wellestablished between faces, so we can improve therecognition system efficiency by discarding samples withlimited interest points detected. Since the majorrepeatability rate of the points is about 80%, in order todecrease the impact of the wrongly located interest points,we threshold the number of detected interest points. If thenumber of interest points is less than the threshold from oneimage, we discard the face image. In Table 3, we list thekept rate of different sparse matching features on differentvideo databases after inferior face samples are discarded.In Fig. 6, the tag SURF, colour-hSURF and

gradient-hSURF indicate the recognition rate of thecorresponding features on video databases, respectively,

URF on original databases and the database with samples discarded


Table 4 The representative features and their recognitionresults on the Honda/UCSD, VidTIMIT and ChokePoint videodatabases

Feature LBP,% TPLBP,% Gradient-hSURF,%

UCSD-honda 78.70 71.20 71.70vidTIMIT(Pose) 67.58 60.98 57.77vidTIMIT(Expression) 92.84 95.95 92.00chokePoint 43.38 46.28 59.17

www.ietdl.org

whereas the tag SURF-discard, colour-hSURF-discard andgradient-hSURF-discard mean the ones which is based onthe strategy of sample discard as in Table 3. From Fig. 6,first, we can see that the recognition results of SURF withsamples discarded are promoted largely compared with theoriginal SURF on the video databases. Secondly, nearly20% face samples are discarded in the pose part of theVidTIMIT database by colour-hSURF as shown in Table 3,however, recognition results of colour-hSURF with samplesdiscarded almost do not change compared with the ones ofthe original colour-hSURF. And the repeatability of interestpoints for colour-hSURF is lower than SURF andgradient-hSURF, which makes colour-hSURF have worserecognition results on the surveillance and expressionchanges compared with gradient-hSURF as in Fig. 4.Finally, the recognition results of gradient-hSURF withsamples discarded decreased much compared with the onesof original gradient-hSURF on the ChokePoint database anddecreased slightly on UCSD/Honda and VidTIMITdatabases. On the other side, we can see that about 20%samples are discarded on the ChokePoint database whileless than 10% samples are discarded on the UCSD/Hondaand VidTIMIT databases by gradient-hSURF in Table 3.Therefore, the samples discarded by gradient-hSURFcontain less but more discriminative interest points formatching. So, gradient-hSURF is a robust local feature andcan achieve better performance in surveillance condition.

Fig. 7 Result comparison of the benchmark features and the combinati


5.3 Effectiveness of temporal–spatial information

In this part, we choose the representative features that achieveexcellent results on these databases to highlight theeffectiveness of the temporal–spatial information.

5.3.1 Best results of local features on the Honda/UCSD, VidTIMIT and ChokePoint video databases:The recognition performances of local features are shown inFigs. 4 and 6, and we can find out that LBP and TPLBPachieve better results than FPLBP for dense matching,whereas gradient-hSURF achieves the best and stableperformance for sparse matching. Although the performanceof SURF is promoted largely via the strategy of samplediscard, they are still slightly inferior to the ones ofgradient-hSURF. Therefore, we choose LBP, TPLBP andgradient-hSURF as the representative dense and sparsematching features and set their recognition results as thebenchmark to test the performance when combined with thetemporal–spatial information. The accuracies of thesefeatures on the Honda/UCSD, VidTIMIT and Chokepointvideo databases are shown in Table 4.

5.3.2 Combination of local features with temporal–spatial information: The comparison of the benchmarkfeatures and the ones when they are combined withtemporal–spatial clues are displayed in Fig. 7. We can seethat the recognition rate is greatly improved via theutilisation of temporal–spatial information. Utilising thetemporal–spatial clues in the proposed way can boostthe recognition result in video-based face recognition. Theprocedure and details of the temporal–spatial information insubjects in different video databases are shown in Fig. 8.In Fig. 8, we randomly choose some subjects from the

UCSD-Honda, VidTIMIT, ChokePoint databases, and showthe maximum posterior probabilities of LBP, TPBLP andgradient-hSURF, respectively (from left to right) onUCSD-Honda, pose part of VidTIMIT, expression part ofVidTIMIT and ChokePoint (from top to down). We can see

on of features with temporal–spatial information


Fig. 8 Changes of P(xt = sk|f1:t)s as the number of iterations increases by the combination of temporal–spatial information and local featureson the video databases, respectively

www.ietdl.org

that the maximum posterior probabilities P(xt = sk| f1:t) duringthe processing in the frames of sequence change through thetemporal information in different features. The spatialinformation of different frames are contained in theposterior probabilities P(xt = sk| f1:t). We can also see thatthe maximum posterior probabilities P(xt = sk| f1:t)s of mostsubjects change as the number of iterations increases on

Table 5 Average processing time of different methods in face recogn

Local feature LBP TPLBP FPLBP

avg.(s) 0.0015 0.0058 0.0067


these testing databases, respectively. In the beginning, themaximum posterior probabilities P(xt = sk| f1:t)s are verysmall; but only after several times of iterations, P(xt = sk| f1:t)s are increased close to 1, which terminates the iterativerecognition process. Note that for the sequence of subjectsk, when P(xt = sk|f1:t) increases, sj is one of the rest subjectsin the test video database except sk, and any P(xt = sk| f1:t)

ition

Colour-hSURF Gradient-hSURF SURF

0.0372 0.0252 0.0148


www.ietdl.org
will decrease because of∑l = 1…Kp(xt = sk| f1:t) = 1. We do notgive the maximum posterior probabilities of P(xt = sj| f1:t) inthe figures for brevity ( j≠ k).
5.4 Average processing time consumption

We evaluate the computational cost of LBP, TPLBP, FPLBP,SURF, colour-hSURF and gradient-hSURF on an Intel Core(TM) i5 2.53 GHz processor and visual studio platform forall the experiments. Table 5 gives the average featureextraction and matching time separately. From the testingfeatures, we can see that within one second these featurescould be processed more than 50 times. The standard videonormally has 24 frames in one second; therefore, weachieve real-time video face processing by utilising thesefeatures in video face-recognition systems.

6 Discussion and conclusion

In this paper, we evaluated two types of real-time localfeatures, that is, dense matching and sparse matching ones,including LBP, TPLBP, FPLBP, SURF, colour-hSURF andgradient-hSURF on different video face databases. From theexperimental results, the dense matching features canachieve better performance on head pose variations andexpression changes. But in the surveillance andlow-resolution situation, the representative of the sparsematching family, gradient-hSURF, performs steadily better.Meanwhile, the recognition result of SURF can beimproved through discarding some weak samples whichhave few interest points detected.When combining the temporal–spatial cues extracted by

the proposed method, the TPLBP and gradient-hSURFfeatures which have small within-class variations and largebetween-class variations achieve greater progress onperformance, indicating temporal–spatial information alsoplays an important role in video-based face recognition,especially these features like TPLBP and gradient-hSURF.In the future, we will investigate more powerful classifiers

with real-time local features such as TPLBP andgradient-hSURF to improve the performance of video-basedface-recognition systems.

7 Acknowledgment

This work was supported in part by the National BasicResearch Program of China (2010CB327902); the NationalNatural Science Foundation of China (NSFC) under grantNo. 61202237, No. 61273263 and No. 61061130560; theSpecialised Research Fund for the Doctoral Program ofHigher Education (No. 20121102120016); and theFundamental Research Funds for the Central Universities.

8 References

1 Ahonen, T., Hadid, A., Pietikainen, M.: ‘Face description with localbinary patterns: application to face recognition’, IEEE Trans. PatternAnal. Mach. Intell., 2006, 28, (12), pp. 297–301

2 Sande, K., Gevers, T., Snoek, C.: ‘Evaluating color descriptors for objectand scene recognition’, IEEE Trans. Pattern Anal. Mach. Intell., 2009,32, (9), pp. 1582–1596

3 Ahonen, T., Hadid, A., Pietikainen, M.: ‘Face recognition with localbinary patterns’. Proc. European Conf. Computer Vision, 2004,pp. 469–481


4 Tan, X., Triggs, B.: ‘Enhanced local texture feature sets for facerecognition under difficult lighting conditions’, IEEE Trans. ImageProcess., 2010, 19, (6), pp. 1635–1650

5 Wolf, L., Hassner, T., Taigman, Y.: ‘Descriptor based methods in thewild’. Proc. European Conf. Computer Vision, 2008, pp. 1–14

6 Zhang, L., Chu, R., Xiang, S., Liao, S., Li, S.Z.: ‘Face detection basedon multi-block LBP representation’. Proc. Int. Conf. Biometrics, 2007,pp. 11–18

7 Huang, D., Shan, C., Ardabilian, M., Wang, Y., Chen, L.: ‘Local binarypatterns and its application to facial image analysis: a survey’, IEEETrans. Syst. Man Cybern., Pt. C, 2011, 41, (6), pp. 765–781

8 Liu, C., Wechsler, H.: ‘A Gabor feature classifier for face recognition’.Proc. Int. Conf. Computer Vision, 2001, pp. 270–275

9 Ozuysal, M., Calonder, M., Lepetit, V., Fua, P.: ‘Fast keypointrecognition using random ferns’, IEEE Trans. Pattern Anal. Mach.Intell., 2010, 32, (3), pp. 448–461

10 Gou, G., Huang, D., Wang, Y.: ‘A hybrid local feature for facerecognition’. Proc. Pacific Rim Int. Conf. Artificial Intelligence, 2012,pp. 64–75

11 Lowe, D.G.: ‘Distinctive image features from scale-invariant keypoints’,Int. J. Comput. Vis., 2004, 60, pp. 91–110

12 Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: ‘Surf: speeded up robustfeatures’, Comput. Vis. Image Understand., 2008, 110, (3), pp. 346–359

13 Mikolajczyk, K., Schmid, C.: ‘A performance evaluation of localdescriptors’. Proc. IEEE Int. Conf. Computer Vision, 2003, pp. 257–263

14 Liu, X., Chen, T., Thornton, S.M.: ‘Eigenspace updating for non-stationary process and its application to face recognition’, PatternRecognit., 2003, 36, (9), pp. 1945–1959

15 Zhou, S., Krueger, V., Chellappa, R.: ‘Face recognition from video: acondensation approach’. Proc. Int. Conf. Automatic Face and GestureRecognition, 2002, pp. 221–226

16 Aggarwal, G., Chowdhury, A.K.R., Chellappa, R.: ‘A systemidentification approach for video-based face recognition’. Proc. Int.Conf. Pattern Recognition, 2004, pp. 175–178

17 Tang, X., Li, Z.: ‘Audio-guided video-based face recognition’, IEEETrans. Circuits Syst. Video Technol., 2040, 7, (19), pp. 955–964

18 Stallkamp, J., Ekenel, H., Stiefelhagen, R.: ‘Video-based facerecognition on real-world data’. Proc. Int. Conf. Computer Vision,2007, pp. 1–8

19 Gou, G., Shen, R., Wang, Y., Basu, A.: ‘Temporal-spatial facerecognition using multi-atlas and Markov process model’. Proc. Int.Conf. Multimedia and Expo, 2011, pp. 1–4

20 Timo, O., Pietikainen, M., Maenpaa, T.: ‘Multiresolution gray-scale androtation invariant texture classification with local binary patterns’, IEEEPattern Anal. Mach. Intell., 2002, 24, (7), pp. 971–987

21 Lee, W., Chen, H.: ‘Histogram-based interest point detectors’. Proc.Conf. Computer Vision and Pattern Recognition, 2009, pp. 1590–1596

22 Dalal, N., Triggs, B.: ‘Histograms of oriented gradients for humandetection’. Proc. Conf. Computer Vision and Pattern Recognition,2005, pp. 886–893

23 Comaniciu, D., Ramesh, V., Meer, P.: ‘Real-time tracking of nonrigidobjects using mean shift’. Proc. Conf. Computer Vision and PatternRecognition, 2000, pp. 142–149

24 Othman, H., Aboulnasr, T.: ‘A separable low complexity 2D HMMwithapplication to face recognition’, IEEE Pattern Anal. Mach. Intell., 2003,25, (10), pp. 1229–1238

25 Viola, P., Jones, M.: ‘Robust real-time face detection’. Proc. Int. Conf.Computer Vision, 2001, pp. 590–595

26 Lee, K.C., Ho, J., Yang, M.H., Kriegman, D.: ‘Visual tracking andrecognition using probabilistic appearance manifolds’, Comput. Vis.Image Understand., 2005, 99, (3), pp. 303–331

27 Sanderson, C., Paliwal, K.K.: ‘Polynomial features for robust faceauthentication’. Proc. Int. Conf. Image Processing, 2002, pp. 997–1000

28 Wong, Y., Chen, S., Mau, S., Sanderson, C., Lovell, B.C.: ‘Patch-basedprobabilistic image quality assessment for face selection and improvedvideo-based face recognition’. Proc. Computer Vision and PatternRecognition Workshops, 2011, pp. 74–81

29 Mian, A., Bennamoun, M., Owens, R.: ‘On the repeatability and qualityof keypoints for local feature-based 3D object retrieval from clutteredscenes’, Int. J. Comput. Vis., 2010, 89, (2), pp. 348–361

30 Phillips, P.J., Flynn, T., Scruggs, T., et al.: ‘Overview of the facerecognition grand challenge’. Proc. Computer Vision and PatternRecognition, 2005, pp. 947–954

31 Biswas, S., Aggarwal, G., Flynn, P.J.: ‘Face recognition inlow-resolution videos using learning-based likelihood measurementmodel’. Proc. Int. Joint Conf. Biometrics, 2011, pp. 1–7


video face recognition via combination of real time local features and temporal–spatial cues

Technology

videobased face recognition

video face recognition

face image

domain of face recognition

recognition performance

face description

dense local features

videobased ones