facial expression recognition from near-infrared video sequences

13
Facial expression recognition from near-infrared videos Guoying Zhao a, , Xiaohua Huang a, c , Matti Taini a , Stan Z. Li b , Matti Pietikäinen a a Machine Vision Group, Infotech Oulu and Department of Computer Science and Engineering, P. O. Box 4500 FI-90014 University of Oulu, Finland b Institute of Automation, Chinese Academy of Sciences, P. O. Box 95 Zhongguancun Donglu, Beijing 100080, China c Research Center for Learning Science, Southeast University, Nanjing, 210096, China abstract article info Article history: Received 22 October 2010 Received in revised form 14 June 2011 Accepted 1 July 2011 Keywords: Facial expression recognition Spatiotemporal descriptors Near-infrared (NIR) Visible light (VIS) Component-based facial features Facial expression recognition is to determine the emotional state of the face regardless of its identity. Most of the existing datasets for facial expressions are captured in a visible light spectrum. However, the visible light (VIS) can change with time and location, causing signicant variations in appearance and texture. In this paper, we present a novel research on a dynamic facial expression recognition, using near-infrared (NIR) video sequences and LBP-TOP (Local binary patterns from three orthogonal planes) feature descriptors. NIR imaging combined with LBP-TOP features provide an illumination invariant description of face video sequences. Appearance and motion features in slices are used for expression classication, and for this, discriminative weights are learned from training examples. Furthermore, component-based facial features are presented to combine geometric and appearance information, providing an effective way for representing the facial expressions. Experimental results of facial expression recognition using a novel Oulu-CASIA NIR&VIS facial expression database, a support vector machine and sparse representation classiers show good and robust results against illumination variations. This provides a baseline for future research on NIR-based facial expression recognition. © 2011 Elsevier B.V. All rights reserved. 1. Introduction Mehrabian indicated that the facial expression of the speaker contributes 55% to the effect of the spoken message, which is more than the verbal part (7%) and the vocal part (38%) [21]. This means that facial expressions play a key role in verbal and non-verbal communi- cation. Automated and real-time facial expression recognition would be useful in many kinds of computer vision applications, e.g. humancomputer interaction and biometrics. In many cases, facial expression recognition research is based on static images [4,6,24,26] unlike in dynamic facial expression recognition, in which appearance and motion information of the whole video sequence can be utilized as well. There have been signicant efforts on developing methods for facial expression recognition. Most of facial expression data sets currently in use are captured in a visible light spectrum [41]. In the real world, different environment and time can cause signicant variations between images. Adini et al. explored the effect of illumination changes on face recognition [1]. Lighting conditions, and light angles in particular, change the appearance of the face in a signicant way. An intrapersonal change in different illumination conditions can be larger than an extra- personal change in similar lighting conditions when unprocessed images are compared. The inuence of illumination can also be seen in Face Recognition Vendor Test [7]. Uncontrolled environmental illumination is an important issue to be solved for reliable facial expression recognition. A facial expression recognition system should adapt to the environ- ment, not vice versa. However, uncontrolled visible (VIS) light (380750 nm) in ambient conditions can change with location and time, which can cause signicant variations in image appearance and texture. The facial expression recognition methods developed thus far perform well under controlled circumstances, but changes in illumination or light angle cause problems for the recognition systems [1]. To meet the requirements of real world applications, facial expression recognition should be possible in varying illumination conditions and even in near darkness. There is quite much work on effective pre-processing algorithms to deal with illumination changes. Unfortunately, algorithms are complicated and not very reliable, e.g. for different lighting directions, using the same preprocessing could not get satisfying results, and for good lighting conditions, such preprocessing would lose useful information. Little work has been done on facial expression recognition using images/videos beyond the visible spectrum, such as in near-infrared (NIR) band. Active NIR imaging (7801100 nm) is robust to illumination varia- tions, and it has been used successfully for illumination invariant face recognition [18]. Advantages of active NIR imaging over a visible light imaging system were studied by Li et al. [18]. While VIS images from the same subject under different lighting directions can be negatively correlated, active NIR images from the same person in diverse VIS conditions are closely correlated. Because of the changes in the lighting intensity, NIR images are inclined to monotonic transform. Li et al. used Image and Vision Computing 29 (2011) 607619 This paper has been recommended for acceptance by Ioannis A. Kakadiaris. Corresponding author. Tel.: + 358 8 553 7564. E-mail addresses: [email protected].(G. Zhao), [email protected]., [email protected] (X. Huang), [email protected] (S.Z. Li), [email protected].(M. Pietikäinen). 0262-8856/$ see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.imavis.2011.07.002 Contents lists available at ScienceDirect Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis

Upload: independent

Post on 12-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Image and Vision Computing 29 (2011) 607–619

Contents lists available at ScienceDirect

Image and Vision Computing

j ourna l homepage: www.e lsev ie r.com/ locate / imav is

Facial expression recognition from near-infrared videos☆

Guoying Zhao a,⁎, Xiaohua Huang a,c, Matti Taini a, Stan Z. Li b, Matti Pietikäinen a

a Machine Vision Group, Infotech Oulu and Department of Computer Science and Engineering, P. O. Box 4500 FI-90014 University of Oulu, Finlandb Institute of Automation, Chinese Academy of Sciences, P. O. Box 95 Zhongguancun Donglu, Beijing 100080, Chinac Research Center for Learning Science, Southeast University, Nanjing, 210096, China

☆ This paper has been recommended for acceptance b⁎ Corresponding author. Tel.: +358 8 553 7564.

E-mail addresses: [email protected] (G. Zhao), [email protected] (X. Huang), [email protected](M. Pietikäinen).

0262-8856/$ – see front matter © 2011 Elsevier B.V. Aldoi:10.1016/j.imavis.2011.07.002

a b s t r a c t

a r t i c l e i n f o

Article history:Received 22 October 2010Received in revised form 14 June 2011Accepted 1 July 2011

Keywords:Facial expression recognitionSpatiotemporal descriptorsNear-infrared (NIR)Visible light (VIS)Component-based facial features

Facial expression recognition is to determine the emotional state of the face regardless of its identity. Most ofthe existing datasets for facial expressions are captured in a visible light spectrum. However, the visible light(VIS) can change with time and location, causing significant variations in appearance and texture. In thispaper, we present a novel research on a dynamic facial expression recognition, using near-infrared (NIR)video sequences and LBP-TOP (Local binary patterns from three orthogonal planes) feature descriptors. NIRimaging combined with LBP-TOP features provide an illumination invariant description of face videosequences. Appearance and motion features in slices are used for expression classification, and for this,discriminative weights are learned from training examples. Furthermore, component-based facial features arepresented to combine geometric and appearance information, providing an effective way for representing thefacial expressions. Experimental results of facial expression recognition using a novel Oulu-CASIA NIR&VISfacial expression database, a support vector machine and sparse representation classifiers show good androbust results against illumination variations. This provides a baseline for future research on NIR-based facialexpression recognition.

y Ioannis A. Kakadiaris.

[email protected],n (S.Z. Li), [email protected]

l rights reserved.

© 2011 Elsevier B.V. All rights reserved.

1. Introduction

Mehrabian indicated that the facial expression of the speakercontributes 55% to the effect of the spoken message, which is morethan the verbal part (7%) and the vocal part (38%) [21]. This means thatfacial expressions play a key role in verbal and non-verbal communi-cation. Automated and real-time facial expression recognition would beuseful in many kinds of computer vision applications, e.g. human–computer interaction and biometrics. In many cases, facial expressionrecognition research is based on static images [4,6,24,26] unlike indynamic facial expression recognition, in which appearance andmotioninformation of the whole video sequence can be utilized as well. Therehavebeen significant efforts ondevelopingmethods for facial expressionrecognition. Most of facial expression data sets currently in use arecaptured in a visible light spectrum [41]. In the real world, differentenvironment and time can cause significant variations between images.Adini et al. explored the effect of illumination changes on facerecognition [1]. Lighting conditions, and light angles in particular,change the appearance of the face in a significant way. An intrapersonalchange in different illumination conditions can be larger than an extra-personal change in similar lighting conditionswhenunprocessed images

are compared. The influence of illumination can also be seen in FaceRecognitionVendor Test [7]. Uncontrolled environmental illumination isan important issue to be solved for reliable facial expression recognition.

A facial expression recognition system should adapt to the environ-ment, not vice versa. However, uncontrolled visible (VIS) light (380–750 nm) in ambient conditions can changewith location and time,whichcan cause significant variations in image appearance and texture. Thefacial expression recognition methods developed thus far perform wellunder controlled circumstances, but changes in illuminationor light anglecauseproblems for the recognition systems [1]. Tomeet the requirementsof realworld applications, facial expression recognition shouldbepossiblein varying illumination conditions and even in near darkness. There isquite much work on effective pre-processing algorithms to deal withillumination changes. Unfortunately, algorithms are complicated and notvery reliable, e.g. for different lighting directions, using the samepreprocessing could not get satisfying results, and for good lightingconditions, such preprocessingwould lose useful information. Littleworkhas been done on facial expression recognition using images/videosbeyond the visible spectrum, such as in near-infrared (NIR) band.

Active NIR imaging (780–1100 nm) is robust to illumination varia-tions, and it has been used successfully for illumination invariant facerecognition [18]. Advantages of active NIR imaging over a visible lightimaging systemwere studied by Li et al. [18].While VIS images from thesame subject under different lighting directions can be negativelycorrelated, active NIR images from the same person in diverse VISconditions are closely correlated. Because of the changes in the lightingintensity, NIR images are inclined to monotonic transform. Li et al. used

Fig. 1. Block diagram for the proposed methods.

608 G. Zhao et al. / Image and Vision Computing 29 (2011) 607–619

staticNIR images and local binarypattern (LBP) features for illuminationinvariant face recognition [18]. They compensated the monotonictransform by applying the LBP operator to NIR images because LBP isinvariant with respect tomonotonic gray scale changes. The robustnessproperties of NIR imaging provide a good basis for facial expressionrecognition regardless of variations in VIS lighting.

In this paper, we propose a novel method that combines NIR videosequences and LBP-TOP features [37] for illumination invariant facialexpression recognition. Although there are already some works aboutobject tracking and classification beyond the visible spectrum, asshown in Joint IEEE International Workshop on Object Tracking andClassification in and Beyond the Visible Spectrum, to our knowledge,this is the first time when NIR video sequences are utilized for facialexpression recognition.

Our contributions in this paper are:

1) Combination of NIR imaging system and LBP-TOP features togenerate an illumination invariant system. NIR imaging reducesthe illumination variation to only subject to a monotonictransform. LBP-TOP can describe the appearance and motion andare invariant to monotonic gray-scale changes [37]. Combiningthem together can lead to an illumination invariant system forvideo-based facial expression recognition.

2) Weight assignment for slices. Due to the LBP-TOP operator, it isfurthermore possible to divide each face block into three types ofplanes or slices, and set individual weights for each type of sliceinside the block volume. As far as we know, this constitutes novelresearch on setting weights for the slices. In addition to the locationinformation, the slice-based approach obtains also the feature type:appearance, horizontal motion or vertical motion, which makes thefeatures more adaptive for dynamic facial expression recognition.Weights can be learnt separately for every expression pair. Thismeans that the weighted features are more related to intra- andextra-class variations of two specific expressions.

3) Component-based representation. Methods combining geometricand appearance features have been considered earlier [41]. Forexample, Tianet al. [33] proposed theuse of facial component shapesand the transient features like crow-feet wrinkles and nasal-labialfurrows. Zhang and Ji [40] used 26 facial points around the eyes,eyebrows, andmouth, and the transient features proposedbyTian etal. In [19,28], several facial regions were firstly cropped and thenPHOG or primitive surface feature distributions were extracted. Butthey used complicated geometric information and the features wereonly extracted from static images.We propose a simple component-based representationmethod to combine spatiotemporal geometricand appearance information, providing an effective way forrepresenting the facial expressions from videos. Classifier fusion isused for combining the features from different components. Moreeffective slices in each component are learned to exploit discrimi-native information in each expression-pair.

4) Novel NIR and VIS facial expression database and baseline results.We have collected a novel Oulu-CASIA NIR&VIS facial expressiondatabase containing both NIR and VIS video sequences. It consistsof six expressions from 80 people. Experimental results andanalysis are included for performance evaluation. The baselineresults are provided for future research on NIR-based facialexpression recognition. Preliminary results of this research werepresented in [29,30].

Fig. 1 shows the block diagram for the proposed methods.

2. NIR imaging for facial expression recognition

Environmental illumination changes have an effect on face recog-nition in VIS images. Three significant conclusions [1] were made basedon an in-depth study on the influence of illumination changes to facerecognition. 1) Lighting conditions, and especially lighting direction,

significantly change the appearance of a face. 2) The changes betweenthe images of a personunder different illumination conditions are largerthan those between the images of two people under the sameillumination, when unprocessed images are compared. 3) All of thestudied localfilters are themselves inadequate toovercomevariations inthe environmental lighting due to changes in the illumination direction.The conclusionsabove are relevant also for facial expression recognition,because variations in the illumination condition change the appearanceof the face. Recognition inuncontrolled indoor illuminationconditions isone of the most important problems for practical facial expressionrecognition systems. It is possible to solve this problem by using twodifferent approaches. Thefirst tries to normalize illumination variations,and the second takes advantage of different imaging systems, such asthe NIR imaging system.

Much work has been done to model and correct illuminationchanges on faces in VIS images [8,9,11,31]. The idea in illuminationnormalization is to try to remove unwanted illumination effects fromthe image, such as non-uniform illumination, highlights, shadowing,aliasing, noise and blurring. One of thedisadvantages of the illuminationnormalization is that when effects caused by varying illumination areremoved, also some useful information, such as facial features, ridges,wrinkles, skin details, local shadowing and shading, will also vanish.Illumination normalization methods are not very reliable, because alsosome important information for recognitionwill be lostwhen removingeffects caused by illumination variations. The use of illuminationnormalizationmethods are shown to improve recognition performancewhen there are illumination variations in the faces, but have not led toillumination invariant face representation due to significant difficulties,especially uncontrolled illumination directions [1,8,9,31]. Thus, thischallenge remains an unsolved problem due to significant difficulties.

Some research has been carried out recently on face imagingbeyond the visible spectrum. Electromagnetic spectral bands belowthe visible spectrum, such as X-rays and ultraviolet radiation, areharmful to the human body and cannot be used in face analysisapplications. The use of thermal infrared imaging is a useful techniquefor identifying faces under uncontrolled illumination condi-tions [16,25]. Instability due to environmental temperature is one ofthe disadvantages of thermal infrared imaging. Experiments showedthat a thermal infrared system does not work in practice as well as asystem based on VIS images, because elapsed time causes significant

609G. Zhao et al. / Image and Vision Computing 29 (2011) 607–619

changes in the thermal patterns of the same subject [2]. Facerecognition performance is enhanced when VIS imagery and thermalinfrared imagery are fused [2,27]. This requires co-registration of VISimages and thermal infrared images. As a result of the fusion, imagingsetup becomes more complicated and the computational requirementsare higher. While thermal infrared imaging reflects heat radiation, NIRimaging behaves more like normal VIS imaging, though NIR is invisibleto the naked eye. NIR imaging brings a new dimension to illuminationinvariant face representation and recognition [36,42]. An NIR system ismore suitable for face recognition than a VIS system when there arechanges in environmental illumination conditions, and especially in thelight angle. An intra-personal change due to different lighting directionscan be larger than an extra-personal change under similar lightingconditions in VIS images, while the influence of environmental lightingis reduced considerably due to the present NIR imaging system [18].When the face matching scores under similar lighting directions forintra-personal pairs are lower than those for extra-personal pairs,reliable recognition cannot be achieved in a VIS imaging system,although an advancedmatching engine is used [18]. These results showthat a special NIR imaging system is needed to handle uncontrolled andvarying environmental lighting.

NIR imaging is robust to illumination variations, and facial expressionrecognition is possible even in very low lighting. An NIR imaging systemis easier and faster to use than complicated illumination normalizationalgorithms.

TheNIR imaging systemused in this paper consists of anNIR camera,a color camera, a camera box and 18 NIR light-emitting diodes (LEDs)mounted on the camera box. Fig. 2 shows the design of the device. TheNIR imaging systemwasused to collect a new facial expression databasefor both NIR and VIS images.

In the NIR imaging system, two methods are used to control thelight direction: 1) active lights are mounted on the camera to providefrontal lighting and 2) environmental lighting is minimized. NIR LEDsare used as active lights. A reasonable wavelength for active lights is850 nm, which is in the NIR spectrum (780–1100 nm).

It is proven in [18] that NIR imaging is only subject to a monotonictransform due to the changes in the distance between the face and theLED lights of the camera,whichmakes the system robust to illuminationvariations. If features invariant tomonotonic transform can be found, anillumination invariant facial expression recognition system can beobtained.

3. Spatiotemporal local binary patterns

The LBP operator [23] describes a local texture pattern with a binarycode, which is obtained by thresholding a neighborhood of pixels withthe gray value of its center pixel. LBP is a gray scale invariant texturemeasure and it is computationally very simple,whichmakes it attractivefor many kinds of applications. The LBP operator was extended to adynamic texture operator [37], the LBP from three orthogonal planes

Fig. 2. An active NIR imaging system.

(LBP-TOP) of a space timevolume. The LBP-TOPdescription is formedbycalculating the LBP features from the planes and concatenating thehistograms.

The original LBP operator was based on a circular sampling pattern,but different neighborhoods can also be used as well. Elliptic samplingwas proposed for the XT and YT planes:

LBPP;R = ∑P−1

p=0s gp−gc� �

2p; s xð Þ = 1; x≥ 0

0; x < 0

� �;

where gc is the gray value of the center pixel (xc, yc) and gp are the grayvalues at the P sampling points: (xc−RXsin(2πp/PXT), yc, tc−RTcos(2πp/PXT)) for XT plane and similarly (xc, yc−RYcos(2πp/PYT), tc−RTsin(2πp/PYT)) for YT plane. Rd is the radius of the ellipse to thedirection of the axis d (x, y or t). As the XY plane encodes only theappearance, i.e. both axes have the samemeaning, circular sampling issuitable. The values of gp for the points that do not fall on pixels areestimated using bilinear interpolation. By considering simply thesigns of the differences between the values of neighborhood and thecenter pixel instead of their exact values, LBP achieves invariance withrespect to the scaling of the gray-scale. For each pixel, a binary code isformed by thresholding its neighborhood in an ellipse to the centerpixel value. The LBP code is computed for all pixels in XY, XT and YTplanes separately. LBP histograms are computed to all three planes inorder to collect up the occurrences of different binary patterns. Finally,the histograms are concatenated into one feature histogram [37].

“Uniform patterns” [23] are usually used to shorten the length of thefeature vector of LBP. Here, a pattern is considered uniform if it containsat most two bitwise transitions from 0 to 1 or vice versa when the bitpattern is considered circular (e.g. 11110011.However, 1000100 is not auniform pattern since it contains four bitwise transitions). When usingthe uniform patterns, all non-uniform LBP patterns are collected into asingle bin during the histogram computation.

Local binary patterns from three orthogonal planes (LBP-TOP) areappropriate for describing and recognizing dynamic textures. LBP-TOPfeatures have been successfully used for facial expression recognition[37] and visual speech recognition [38]. LBP-TOP features describeeffectively appearance, horizontal motion and vertical motion from thevideo sequence. A face image can be divided into overlapping blocks. Ablock-based approach combines pixel, region and volume level featuresto handle unusual dynamic textures in which local information and itsspatial locations need to be considered. LBP histograms for each blockvolume in three planes are formed and concatenated into onehistogram. Features extracted from each block volume are connectedto represent the appearance and motion of the video sequence [37], asshown in Fig. 3.

NIR images are subject to amonotonic transform. LBP-TOP featuresare robust to monotonic gray-scale changes. An illumination invariantrepresentation of facial expressions can be generated [29], when a NIRimaging system and LBP-TOP features are used together. Comparedwith the other static or dynamic features, LBP-TOP features work verywell and obtain the very good performance [37] on Cohn–Kanadefacial expression database [13], which is a typical database collected inVIS environment.

4. Weight assignment

Different regions of the face have a different contribution for thefacial expression recognition performance. Therefore, it makes sense toassign different weights to different face regions when measuring thedissimilarity between expressions. In this section, methods for weightassignment [30] are examined in order to improve facial expressionrecognition performance.

In this paper, a face image is divided into overlapping blocks andeach block contains three types of slice: XY (appearance),XT (horizontalmotion) and YT (vertical motion) slices. Different weights are set for

Fig. 3. (a) Three planes in dynamic texture (b) LBP histograms from each plane (c) Concatenated histogram.

610 G. Zhao et al. / Image and Vision Computing 29 (2011) 607–619

each type of slice, based on its importance. In some block-basedapproaches, weights are set only to the location of the block. However,different kinds of featuresdonot contributeequally in the same location.In LBP-TOP representation, the LBP code is extracted from threeorthogonal planes, describing appearance in the XY plane and temporalmotion in theXTandYTplanes. Theuseof LBP-TOP features enablesus toset different weights for each type of plane or slice inside the blockvolume. In addition to the location information, the slice-basedapproach obtains also the feature type: appearance, horizontal motionor verticalmotion,whichmakes the featuresmore suitable andadaptivefor classification.

In many cases, weights are designed empirically, based on theobservation [5,20,26]. Here, the Fisher separation criterion is used tolearn suitable weights from the training data [3].

For a C class problem, let the similarities of different samples of thesame expression compose the intra-class similarity, and those ofsamples from different expressions compose the extra-class similar-ity. The mean (mI, b) and the variance (sI, b2 ) of intra-class similaritiesfor each slice can be computed by as follows:

mI;b =1C∑C

i=1

2Ni Ni−1ð Þ ∑

Ni

k=2∑k−1

j=1χ2 S i; jð Þ

b ;M i;kð Þb

� �;

s2I;b = ∑C

i=1∑Ni

k=2∑k−1

j=1χ2 S i; jð Þ

b ;M i;kð Þb

� �−mI;b

� �2;

where Sb(i, j) denotes the histogram extracted from the j-th sample and

Mb(i, k) denotes the histogram extracted from the k-th sample of the

i-th class, Ni is the sample number of the i-th class in the training set,and the subsidiary index b means the b-th slice. In the same way, themean (mE, b) and the variance (sE, b2 ) of the extra-class similarities foreach slice can be computed by as follows:

mE;b =2

C C−1ð Þ ∑C−1

i=1∑C

j= i + 1

1NiNj

∑Ni

k=1∑Nj

l=1χ2 S i;kð Þ

b ;M j;lð Þb

� �;

s2E;b = ∑C−1

i=1∑C

j= i + 1∑Ni

k=1∑Nj

l=1χ2 S i;kð Þ

b ;M j;lð Þb

� �−mE;b

� �2:

The Chi square statistic is used as dissimilarity measurement oftwo histograms

χ2 S;Mð Þ = ∑L

i=1

Si−Mið Þ2Si + Mi

;

where S and M are two LBP-TOP histograms of a slice, and L is thenumber of bins in the histogram. It is commonly used for measuringdifferences between binned distributions [34].

Finally, the weight for each slice can be computed by

wb =mI;b−mE;b

� �2

s2I;b + s2E;b:

The local histogram features are discriminative, if the means ofintra and extra classes are far apart and the variances are small. In that

611G. Zhao et al. / Image and Vision Computing 29 (2011) 607–619

case, a large weight will be assigned to the corresponding slice.Otherwise, the weight will be small. A higher value of wb represents abetter class separability of the region. In thisway, differentweights canbe set based on the importance of the appearance, horizontal motionand vertical motion features.

In the above weight computation, the similarities of differentsamples of the same expression composed the intra-class similarity,and those of samples from different expressions composed the extra-class similarity. Such a way, similar weights are used for all expressionsand there is no specificity for discriminating two different expressions.To deal with this problem, expression pair learning is utilized. Thismeans that theweights are learned separately for every expression pair,so extra-class similarity can be considered as a similarity between twodifferent expressions.

5. Component-based features and features combination

5.1. Component-based features

In themethods presented above, the face imageswere divided intoseveral non-overlapping or overlapping blocks, and then the featuresextracted from each block were concatenated into an expressiondescriptor. This approach takes the whole face into account, thus itmay include some useless information. By introducing geometricinformation into each face image, more effective and discriminativecomponents can be constructed corresponding to the important facialareas, such as eyes, mouth.

Regarding pose variation and partial occlusion, component-basedfeatures are more effective for representing the facial expressions. Inour method, six facial components are considered, as shown in Fig. 4(b), since most of the facial features are concentrated on mouth,cheeks, eyes and forehead [32]. To avoid labeling facial points orcropping those parts manually, and for extracting those facial featuresaccurately, geometric information from the first frame is obtained bydetecting facial points with the Extended Active Shape Model(STASM) [22], as shown in Fig. 4(a).

Usually, the samples for training an ASM model are obtained usingVIS imaging. Unfortunately, it is difficult to detect facial points from NIRimages when using an ASM model trained from VIS images. For thisreason, we used NIR images database taken under normal illuminationto train a new geometric model. This model was utilized for both NIRand VIS images taken in different illumination conditions.

Fig. 4. (a) 38 facial points tracked by STASM (b) Six facial com

In our experiments,we found that this approachworkedwell forNIRand VIS images in all illuminations. The detection rates were: for NIRimages 98.93% in normal and weak illumination, 98.73% in darkillumination; for VIS images 97.07% in normal, 98.95% in weak, 95.79%in dark illumination. Some misalignment occurred in a few images, butthis error was accepted in our experiments, and we did not make anyfurther manual processing to remove it. Different from other methods,which fuse geometric features and appearance features for training andtesting, our method crops six areas, including forehead, two eyes, twocheeks and mouth, as shown in Fig. 4(b). Then the LBP-TOP histogramsfor each component are computed.

The sizeof each facial component is so large thatmore thanoneblockis needed to describe its local spatiotemporal information. Moreover, asobserved from Fig. 4(b), the areas of different parts are different. Forexample, both cheek areas are much smaller than forehead and mouthareas. This means that using the same number of blocks for allcomponents is not reasonable [37]. Thus, a different number of blockswas used, for different components in our experiments.

5.2. Fusion of component-based facial features

After extracting the facial features for different components, theproblem of feature fusion will be addressed. Extensive studies onfeature combination are presented in Ref. [14] and Ref. [10]. Thedimensions of our component-based features are not equal due to adifferent number of blocks used for different facial components. Thismakes it difficult to use a direct combination or parallel combination,since a direct combination will lead to a higher dimensionality, but aparallel combination requires the same dimensionality. Fortunately,some studies on the decision combination have been conducted[10,12,15]. In this section, we will consider the use of a classifiercombination in our problem.

Consider a facial expression recognition problem where the patternX is to be assigned to one of m expression classes ω1, ⋯, ωm (m=6). Incomponent-based spatiotemporal LBP, the components from six areasin face images, shown in Fig. 4(b), are denoted as F1, ⋯, Fn (n=6),respectively.

Given n components F1, ⋯, Fn extracted by LBP-TOP, the pattern X ismodeled by the posterior class probability P(ωk|Fi), where all compo-nents are statistically independent. According to Bayesian theory, thepattern X should be assigned to the class ωj provided that a posterioriprobability of that interpretation is maximum in Eq. (1):

ponents cropped from a face image using 38 facial points.

612 G. Zhao et al. / Image and Vision Computing 29 (2011) 607–619

Assign X→ωj if

P ωj jF1;…; Fn� �

= maxωk∈ ω1;…;ωmf g

P ωk jF1;…; Fnð Þ

= maxωk∈ ω1;…;ωmf g

⊗ P ωk jF1ð Þ;…; P ωk jFnð Þf g½ �

= maxωk∈ ω1;…;ωmf g

⊗ P1ωk;…; Pn

ωk

n oh i;

ð1Þ

where, ⊗ represents one of combination rules (mean, median,product), shown in Fig. 5.

Shown in Fig. 5, the ith individual classifier Di, such as SVM, canoutput voting vectors representing the voting numbers of each class.Here, the voting number of ωk-th class from i-th classifier Di isdenoted as Vωk

i {i=1,…, n}. Since P(ωk|Fi) can be approximated byestimation of the classifier Di, in our study, we used these scoresrepresenting the posterior class probability of ith component. For thepurpose of outputting conditional probabilities, these scores Vωk

i {i=1,…, n} are converted to probabilities Pωk

i {i=1,…, n} by the soft maxfunction [12].

For exploiting the complementary information among all compo-nents, some combination rules (median, mean, and product rules) areinvestigated due to their good performance in some earlier studies[15].

5.3. Feature selection for component-based facial features

Some tiny areas in one component usually contain morediscriminative information than others. It is not necessary to use allthe information available in the image, but only the most importantareas in terms of distinguishing between subjects or events. AdaBoostis a machine learning technique which can be adapted to select themost discriminative features for recognition. AdaBoost has been usedsuccessfully to improve the facial expression recognition performanceof different classifiers by learning themost discriminative LBP featuresfor expression recognition.

In our approach, the AdaBoost method is exploited for boostingLBP-TOP features from each component to extract the mostdiscriminative features for improving facial expression recognitionperformance. It is used to learn the dissimilarity of each facialexpression class against other expression classes by selecting theprincipal appearance and motion slices [39], and discarding thosefeatures which may hinder the recognition. The use of AdaBoostfeature selection technique reduces the dimensionality of the featuresin each area.

Our method differs from previous component-based algorithms inthe following ways: 1) We only use the component regions and then

Fig. 5. The structure of cl

extract appearance features from these regions, and do not use any ofcomplicated geometric information like shapes and the transientfeatures [33,40] which are not easy to accurately detect from lowresolution images; 2) The features in the earlier methods were onlyextracted from static images [19,28,33,40], although some of them[28,40] used Hidden Markov Models (HMMs) or Dynamic BayesianNetworks (DBNs) to integrate the static information with timedevelopment. Instead of this, our features are directly extractedfrom video sequences.

6. Facial expression database

Our database: the Oulu-CASIA NIR&VIS facial expression database,consists of six expressions (surprise, happiness, sadness, anger, fearand disgust) from 80 people between 23 and 58 years old. 73.8% of thesubjects are males. The whole database includes two parts, one wastaken in Feb. 2008 in Oulu by the Machine Vision Group of theUniversity of Oulu, consisting 50 subjects, most of whom are Finnishpeople. The other was taken in April 2009 in Beijing by the NationalLaboratory of Pattern Recognition, Chinese Academy of Sciences,consisting of 30 subjects, all of them Chinese people. The subjectswere asked to sit on a chair in the observation room in a way that he/she is in front of camera. Camera-face distance is about 60 cm.Subjects were asked to make a facial expression according to anexpression example shown in picture sequences. USB 2.0 PC Camera(SN9C201 & 202) includes an NIR camera and a VIS camera whichcapture the same facial expression, as shown in Fig. 6. The imaginghardware works at the rate of 25 frames per second and the imageresolution is 320×240 pixels. Now the whole database is availableonline (http://www.cse.oulu.fi/MVG/Downloads).

All expressions are captured in three different illuminationconditions: normal, weak and dark. Normal illumination means thatgood lighting is used. Weak illumination means that only computerdisplay is on and the subject sits on the chair in front of the computer.Dark illumination means near darkness.

Example images in different illuminations are shown in Fig. 7. Wecan see that the influence of environmental illumination is minimizedin NIR images. NIR images look similar despite the surroundingillumination. VIS images in different illuminations are not directlycomparable because the change in the environmental illuminationcan significantly affect the appearance of VIS images.

Fig. 8 demonstrates how well facial features, such as wrinkles andfurrows, can be seen in NIR and VIS images. Images present the sameframe from the image sequence. The contours of the facial features areshown in a clear way in the NIR image. There are no shadowy places inthe NIR image, but some dark areas caused by self-occlusion can befound in the VIS image.

assifier combination.

Fig. 6. Capture environment.

Fig. 8. The appearance of a face in NIR (left) and VIS image.

613G. Zhao et al. / Image and Vision Computing 29 (2011) 607–619

The number of video sequences is 480 (80 subjects by sixexpressions) for each illumination and imaging system pair; thus,totally there are 2880 (480×6) video sequences in the database.

7. Experiments

2472 video sequences from the database were used to recognizesix typical expressions: anger, disgust, fear, happiness, sadness andsurprise. Video sequences came from the 80 subjects, with two to sixexpressions per subject. In our database, people did not move theirhead a lot, so the positions of the eyes in the first frame were detectedautomatically and these positions were used to determine the facial

Fig. 7. VIS (top row) and NIR images (bottom row). Columns

area for the whole sequence. But in real environment, people canmove their head freely. For that case, eye detection should be done foreach frame and then the cropped face images using detected eyes'position in a video clip will be normalized. Also, a robust and fasttracking algorithm would be helpful in this case.

We use two sorts of classifiers, support vector machine and sparserepresentation classifier [35], in all experiments. A support vectormachine (SVM) classifier is a traditional and popular classifier; in ourapplication, the six-expression classification problem is divided into15 two-class problems, and then a voting scheme is used to performthe recognition. A sparse representation classifier (SRC) is a recentlyproposed classifier, which selects a representation that mostcompactly expresses the testing sample. Due to the high computa-tional cost, principal component analysis is used to reduce the featuredimension before using SRC. In our experiments, the energy ratio (thesum of selected eigenvalues divided by the sum of eigenvalues) is setas 98%. Block-based LBP-TOP descriptors as presented in Section 3,combining local information from the pixel, region and volume levels,are used to represent facial expressions. We use 9×8 blocks. Eightneighboring points and radius three are the LBP-TOP parameters in allplanes. Uniform patterns [37] representation is used in the recogni-tion. The size of each block depends on the following parameters: thenumber of blocks, overlapping ratio and the size of images. We do notneed to normalize different videos to the same image size, but the sizeof each frame in a specific video is the same. Therefore, the block size

from left to right: normal, weak and dark illumination.

Table 1Accuracies (%) of different expressions using SVM and SRC. (Numbers inside bracketsrepresent the results by using SRC. NIR_N, NIR_W and NIR_D represent normal, weak,dark illumination in NIR imaging system, respectively. So do VIS_N, VIS_W and VIS_D.).

Anger Disgust Fear Happiness Sadness surprise Total

NIR_N 77.61 71.70 60.61 83.75 56.06 78.75 72.09(88.06) (91.13) (65.15) (90.00) (57.58) (86.25) (78.64)

NIR_W 70.15 58.49 51.52 83.75 46.97 82.50 66.99(68.66) (71.70) (57.58) (88.75) (56.06) (90.00) (73.30)

NIR_D 71.64 62.26 57.58 78.75 60.61 80.00 69.42(58.21) (77.36) (54.55) (86.25) (54.55) (86.25) (70.39)

VIS_N 76.12 64.15 60.61 83.75 62.12 87.50 73.54(73.13) (66.04) (62.12) (91.25) (65.15) (91.25) (76.21)

VIS_W 62.69 45.28 46.97 72.50 40.91 83.75 60.44(62.69) (58.49) (48.48) (81.25) (45.45) (86.25) (65.29)

VIS_D 49.25 43.40 45.45 68.75 43.94 78.75 56.55(61.19) (47.17) (43.94) (58.75) (36.36) (82.50) (56.31)

614 G. Zhao et al. / Image and Vision Computing 29 (2011) 607–619

is different from video to video. In our experiments, we used 9×8blocks, with an overlap ratio of 43%, which provided the best results.For example, the block size is about 16×18 for a face image of 106×86pixels, as shown in Fig. 9. The details of calculation can be found in[37]. Fig. 9 also shows the first two overlapping blocks.

The subjects were separated into ten groups of roughly equal size.After that, a “leave one group out” cross validation, which can also becalled to a “10-fold cross-validation” test scheme, was used for all theevaluation. Therefore, the testing was done with “novel faces” and itwas completely subject-independent. We designed five experimentsfor evaluating the proposed method. In Section 7.1, we compared theexperimental results of NIR and VIS to study if NIR works better forillumination variation than VIS. In Section 7.2, the illuminationnormalization is utilized to test how NIR system works comparingwith normalized VIS images. In Section 7.3, weights for each slice arelearned and tested on NIR facial expression sequences. In Section 7.4,components based features and classifier fusion strategy are evalu-ated for improving the recognition performance. In Section 7.5, anexperiment on people from different countries is done to preliminar-ily investigate the culture effects on facial expression recognition.

7.1. NIR vs. VIS

Table 1 shows 10-fold cross-validation accuracies for each facialexpression using SVM and SRC classifiers, respectively. The numbersin parentheses represent the results for SRC. The overall recognitionaccuracies in NIR images are 72.09% in normal, 66.99% in weak and69.42% in dark illumination for SVM.When using SRC, they are 78.64%in normal, 73.30% in weak and 70.39% in dark illumination,respectively. Accuracies for NIR images are quite consistent withtwo different classifiers in all three different illumination conditions,showing that NIR images are robust to illumination variations andconvenient for dynamic facial expression recognition even in neardarkness.

Illumination cross-validation experiments are carried out sepa-rately for NIR and VIS images. One group from a total of ten groups ofnormal illumination images is used in training and the other nine

Fig. 9. Overlapping blocks.

groups from normal, weak or dark illumination images are used intesting. This way, there is no overlapping for illumination and subjectsin training and test set. Experiments are subject-independent,which ismore challenging than subject-dependent but more reasonable forreal applications. Illumination cross-validation results using SVM andSRC, demonstrated in Table 2, show that a different illuminationbetween training and testing videos does not affect overall recognitionresults in NIR imagesmuch. The results in VIS images are poor becauseof significant illumination variations. In addition, the overall result ofweak illumination is a bit lower than that of dark illumination in NIRimaging environment. That is because the NIR imaging is aimed tocontrol the lighting direction, i.e. from the front, such that theillumination condition is controlled (normalized) from the front. Indark illumination in our database collection, no lightwas on other thanthe controlled LED light. But in weak illumination, the computer'sdisplay in front of subjectswas on, whichmay cause disturbance in thefrontal lighting direction.

7.2. Experiments with illumination normalization

There are some studies to normalize the VIS images from differentillumination conditions to reduce the influence of lighting changes. Inthis subsection, an illumination normalization method of Tan andTriggs [31], which has given state-of-the-art face recognition perfor-mance under different illumination conditions on several majordatabases, will be compared with NIR imaging for dealing withillumination variations. The illumination normalization method [31]uses an efficient preprocessing chain that eliminates most of theeffects of illumination variations while still preserving the essentialelements of visual appearance needed for the recognition. Thismethod, demonstrated in Fig. 10, is a four step process consisting ofgammacorrection,Difference ofGaussian (DoG)filtering,masking andcontrast equalization [31].

Fig. 11 visualizes theeffectof the illuminationnormalizationmethod.Original images from all illumination conditions are shown on the toprow. Corresponding normalized images are shown in the lower rows inthe same column to the original image. It can be seen that illuminationvariations are reduced as a result of illumination normalization.

Table 2Illumination cross-validation results (%) using SVM and SRC (Numbers inside bracketsrepresent the results by using SRC).

Training NIR_N NIR_N NIR_N VIS_N VIS_N VIS_NTesting NIR_N NIR_W NIR_D VIS_N VIS_W VIS_DResults 72.09 66.02 69.90 73.54 35.44 29.13

(78.64) (69.42) (72.33) (76.21) (34.22) (28.88)

Fig. 10. Illumination normalization process.

Table 3Illumination cross-validation results (%) with illumination normalization (Numbersinside brackets represent the results by using SRC).

Training NIR_N NIR_N NIR_N VIS_N VIS_N VIS_NTesting NIR_N NIR_W NIR_D VIS_N VIS_W VIS_DNo preproc 72.09 66.02 69.90 73.54 35.44 29.13

(78.64) (69.42) (72.33) (76.21) (34.22) (28.88)Preproc 70.87 66.26 64.56 72.57 52.43 42.48

(72.33) (63.83) (67.23) (73.54) (46.12) (34.22)

Table 4Cross imaging system recognition results (%) with illumination normalization(Numbers inside brackets represent the results by using SRC).

Training NIR_N NIR_N NIR_N VIS_N VIS_N VIS_NTesting VIS_N VIS_W VIS_D NIR_N NIR_W NIR_DNo preproc 47.33 43.20 30.34 47.57 41.26 43.69

(47.33) (48.06) (32.52) (42.23) (37.38) (37.14)Preproc 68.20 55.10 35.68 59.95 54.61 54.13

(64.08) (50.73) (36.65) (53.88) (51.94) (50.73)

615G. Zhao et al. / Image and Vision Computing 29 (2011) 607–619

Table 3 lists illumination cross-validation results when theillumination normalization method is used. The parameters [31] weuse are Gamma correction: 0.2, DoG inner filter: 1, DoG outer filter: 4,Contrast equalization ha=0.1 and δ=5. In the last three columnsshown in Table 3, whenweak illumination images are used for testing,the recognition performance improved from 35.44% to 52.43% (SVM)compared to the case without preprocessing. In dark illumination, theimprovement is from 29.13% to 42.48%, respectively. This shows thatpreprocessing indeed improves the results for VIS images in poorillumination conditions. However, in the case of normal illuminationconditions, the recognition rate is reduced from 73.54% to 72.57%,because the original normal illumination images are of good qualityand some useful information for expression recognition might bedestroyed during the illumination normalization process.

Generally speaking, illumination normalization enhances VISimages, but preprocessing is not helpful enough for dealing withillumination variations. When analyzing the illumination cross-validation results in Table 3, it can be noticed that, compared to theillumination normalized VIS images, the recognition accuracies in theNIR images without preprocessing are similar (72.09% vs. 72.57%) innormal, 13.59% (66.02%–52.43%) higher in weak and 27.42% (69.90%–42.48%) higher in dark illumination. These results indicate that NIRimages, which are robust to illumination changes, work even betterthan preprocessed VIS images with one of the best-performingillumination normalization methods.

Even though illumination normalization does not work for NIRimages, it is helpful for cross imaging system recognition. As shown in

Fig. 11. Original images in the top row and corresponding no

Table 4, when we use data from normal illumination in NIR (VIS)environment for training, and the three illumination conditions fromdifferent imaging systems, i.e. VIS (NIR) for testing, the original results(the third row) are quite poor. But the illumination normalizationhelps to improve the performance in all the cases, for example for

rmalized images in the lower rows in the same column.

Table 5Illumination cross-validation results (%) using SVM.

Training NIR_N NIR_N NIR_NTesting NIR_N NIR_W NIR_DNo weights 72.09 66.02 69.90Slice weights 76.21 68.20 70.39

616 G. Zhao et al. / Image and Vision Computing 29 (2011) 607–619

training using NIR_N and testing in VIS_N, the increase is about 20.87%(from 47.33% to 68.20%).

For the following experiments, because SVM and SRC getconsistent results, we only list the results from SVM for conciseness.

7.3. Using weights

Every expression pair has different and specific features which areof great importance when expression classification is performed onexpression pairs. The ten most important blocks and 30 slices for theexpression pair anger and sadness, fear and happiness are illustratedin Fig. 12. The symbol “/” expresses appearance features from the XY-slice, the symbol “-” indicates horizontal motion features from the XTslice and the symbol “|” indicates vertical motion features from theYT-slice. The most important appearance features for anger-sadnessappear in the forehead and eyes, while for fear-happiness, the cornersof the mouth are more important. Comparing with block features,slice features show that the horizontal and vertical motion in foreheadand eyes areas are more important than appearance features fordiscriminating anger and sadness. Motion of the forehead and theappearance of the mouth corner contribute more for classifying fearand happiness. This demonstrates well that different features do notcontribute equally and that a slice-based approach is a powerfultechnique when weights are set for different sub-regions.

Table 5 illustrates subject-independent illumination cross-validationresults using the weights learned for each slice. Normal illuminationimages are used in training, and normal, weak or dark illuminationimages are used in testing. The use of weighted slices improves theperformance in the NIR images using SVM.

7.4. Experiments with component-based facial features and classifier fusion

Each facial component has discriminative information for classi-fication. Since the areas of different facial components are not equal,we studied how many blocks are needed for each component to

Fig. 12. Ten most important block features (left) and 30 slice features (right) for Anger-Sadmeans XT-slice and blue “|” illustrates YT-slice.

achieve the best recognition accuracy. The highest performance isachieved with 5×4 blocks for the right eye, 7×7 for the left eye, 5×4for the right cheek, 4×3 for the left cheek, 8×7 for the mouth and5×5 blocks for the forehead, respectively.

After comparing with median and mean rules for classifier fusion,the product rule works better. Table 6 illustrates the results fordifferent illuminations in NIR images by using the product method.Compared to Table 1, classifier combination improves the perfor-mance in all cases. The increases in rates are 1.70% in normal, 3.64% inweak, 0.24% in dark for NIR images.

In our experiments, we select the different slice sizes correspond-ing to facial components. The slice sizes are 45, 30, 30, 60, 90 and 30for right cheek, left cheek, right eye, left eye, mouth and forehead,respectively.

Table 7 shows the recognition rates by using selected slices for eachcomponent in NIR imaging systems. Compared to Table 6, theperformances for all illuminationare improvedbyusing feature selection.

7.5. Experiments on different ethnic groups

Although facial expressions are widely considered to be theuniversal language of emotion, some negative facial expressionsconsistently elicit lower recognition levels among Eastern comparedto Western groups.

Our database mainly consists of Finnish and Chinese people. Whenwe collected it, we noticed that there are differences between Finnishpeople and Chinese people in making expressions. Finnish people caneasily use eyes and mouth to express emotion, while Chinese people

ness (first row) and Fear-Happiness (second row). Red “/” means XY-slice, purple “-”

Table 7Accuracies (%) of different expressions using feature selection in NIR system.

Anger Disgust Fear Happiness Sadness Surprise Total

NIR_N 68.66 60.38 69.70 83.75 75.76 85.00 75.00NIR_W 73.13 60.38 66.67 82.50 66.67 86.25 73.79NIR_D 65.67 56.60 68.18 78.75 71.21 86.25 72.33

Fig. 14. Comparison of expression recognition using SVM for Finnish and Chinesepeople.

Table 8Cross-ethnic groups evaluation (%) using SVM.

Test NIR_N NIR_W NIR_D VIS_N VIS_W VIS_D

Finnish 65.32 53.23 60.48 58.06 46.37 32.66(74.60) (63.71) (72.18) (75.40) (64.92) (59.27)

Chinese 57.32 53.05 48.17 51.22 32.32 37.80(62.20) (60.37) (64.02) (69.51) (49.39) (48.17)

Table 6Accuracies (%) of different expressions using product method in NIR imaging systems.

Anger Disgust Fear Happiness Sadness Surprise Total

NIR_N 83.58 64.15 71.21 83.75 62.12 73.75 73.79NIR_W 79.10 67.92 62.12 83.75 48.48 77.50 70.63NIR_D 76.12 71.70 56.06 77.50 57.58 76.25 69.66

617G. Zhao et al. / Image and Vision Computing 29 (2011) 607–619

focus on eyes which makes them a bit hard to present obviousexpressions except happiness. As Fig. 13 shows, the expressions ofsadness, anger and disgust are quite similar to each other and it is alsodifficult to discriminatewhat the fourth expression is.We carried out theexperiments on two groups, one including 46 Finnish people and theother 34Chinese people. The recognition results are shown in Fig. 14. Theaccuracy of Finnish people in all six environments is much higher thanthat of Chinesepeople, consistent toourobservation. It alsodemonstratesthe difficulty of recognizing spontaneous expressions, since they areaffected by the subject's background, culture, characteristics, and so on.

Table 8 lists the results of cross-ethnic group evaluation, i.e. testingon Finnish/Chinese people with training on Chinese/Finnish people.The numbers inside the brackets are the results using data with thesame race in training and testing. From these tables, we can see that inall the evaluations, the accuracy of cross-ethnic group evaluation ismuch lower than that of same race. Combined with the above-mentioned analysis, facial expressions can be affected by culture andbackground, so if we know in advance the query people's ethnicgroup, it would be helpful for recognition by using the data from thesame ethnic group as training.

8. Discussion and conclusions

Illumination variations can drastically affect current facial expres-sion recognition systems. Much work has been done to normalizeillumination variations in VIS images, while little effort has beenmadefor trying other imaging systems. In this paper, that problem isaddressed by combining an NIR imaging system with a LBP-TOPdescriptor. An illumination invariant representation of faces was thusobtained by extracting LBP-TOP features from NIR images. This way, anovel approach to illumination invariant facial expression recognitionwas presented to overcome the problem of varying illumination.

Some local facial regions were known to containmore discriminativeinformation for facial expression classification than others; thus, weightassignment was examined first in order to improve facial expression

Fig. 13. Sadness, anger, disg

recognition performance. Assigning different weights to the slices wasintroduced, allowing us to set weights for the most important facialregions. In this approach, different weights can be set not only for thelocation, as in the block-based approach, but also for the appearance,horizontalmotion and verticalmotion.Whenweighted slices were used,recognition accuracies in the NIR images for cross-illumination evalua-tion were better than without weights.

Considering robustness to face rotation and partial occlusion,component-based facial features were introduced. In this approach,the active shape model is used in detecting facial points on NIRimages, and locating the key facial components. Through featureselection, discriminative information from each facial component isextracted for improving the performance for NIR images.

As a part of this research we collected a novel NIR&VIS facialexpression database, which includes facial expression video se-quences from both NIR and VIS imaging systems. We carried outextensive experiments and performance evaluation by using twoclassifiers (SVM and SRC). Results on the new database were robustagainst illumination variations, showing that our approach provides apromising performance for real applications (Section 7.1). Compar-ison between a state-of-the-art illumination normalization methodand the NIR imaging system clearly showed that more robust resultsagainst illumination variations can be obtained by using an NIRimaging system (see, Section 7.2).

Experiments of using weights show the effectiveness of learningweights for each slice (Section 7.3). The component-based method andclassifier fusion provide improved performance (Section 7.4) and arebelieved to bemore useful for dealingwith occlusions andhead rotations

ust, fear and surprise.

618 G. Zhao et al. / Image and Vision Computing 29 (2011) 607–619

than region-basedmethods. Moreover, we analyzed the effects of race infacial expression recognition (Section 7.5). To the best of our knowledge,there are very few earlier works discussing about the effects of race. Ourresults on cross-race evaluation show that facial expression recognitioncan be influenced by cultures. The prior knowledge of the race could behelpful for improving the recognition accuracy.

With extensive experiment, we firstly tested our proposedmethodand showed its effectiveness. Secondly, we provided the baselineresults which can be used for convenient comparison in futureresearch. Furthermore, we used two effective classifiers (SVM andSRC) for testing with NIR and VIS image sequences, which providedthe consistent results.

However, the best result on our dataset is still less than 80%. It isbecause firstly, some expressions have common action units, e.g.Anger, Sadness and Fear have a common action unit 4, Surprise andFear both have a common action unit 2; secondly, some expressions,like fear, anger and disgust, are very different when expressed bydifferent people, as shown in Fig. 15. We also have people fromdifferent countries; they really have different expression styles.Therefore, the recognition results for happiness and surprise arehigh, but for the other expressions only ordinary. How to robustlyrecognize spontaneous facial expressions, especially for people withbackground and cultural differences, still remains a major challenge.

The use of NIR imaging provides a solution to varyingilluminations, but it also has some limitations for the applications

Fig. 15. Fear (first row), disgust (second row), anger (third

in uncontrolled environments. Firstly, NIR works well in dealingwith the indoor illumination variations, but it is not suitable foroutdoor use due to a strong NIR component in the sunlight.Secondly, the working distance of NIR is limited, which makes itunsuitable for video surveillance at a far distance. But forexpression recognition, it would not be a problem since theexpression recognition usually needs to be done at a near ormiddle distance. One feasible application would be customers'emotion analysis when they watch advertisements. Anotherpotential application is emotion recognition for affective human–robot interaction.

For facial expressions, there are still open problems to be solved. Inmany applications of human–computer interaction, it is important tobe able to detect the emotional state of the person in a naturalsituation. Measuring the intensity of spontaneous facial expressions ismore difficult than measuring acted facial expressions due to thecomplexity, subtlety and variability of natural expressions. Further-more, partial occlusion and pose variation are among the challenges offacial expression recognition in natural environments [17].

Acknowledgments

The financial support provided by the European Regional Develop-ment Fund, the Finnish Funding Agency for Technology and Innovation,the Academy of Finland, and the TABULA RASA project (http://www.

row) and sadness (last row) made by different people.

619G. Zhao et al. / Image and Vision Computing 29 (2011) 607–619

tabularasa-euproject.org) under the Seventh Framework Programme forresearch and technological development (FP7) of the European Union(EU), grant agreement #257289 is gratefully acknowledged. XiaohuaHuang is funded by China Scholarship Council of Chinese government.StanZ. Liwould like toacknowledge the funding support fromtheChineseNational Natural Science Foundation Project #61070146, the NationalScience and Technology Support Program Project #2009BAK43B26 andthe TABULA RASA project.

References

[1] Y. Adini, Y. Moses, S. Ullman, Face recognition: the problem of compensating forchanges in illumination direction, IEEE Transactions on Pattern Analysis andMachine Intelligence 19 (7) (1997) 721–732.

[2] X. Chen, P.J. Flynn, K.W. Bowyer, IR and visible light face recognition, ComputerVision and Image Understanding 99 (2005) 332–358.

[3] R. Duda, P. Hart, D. Stork, Pattern Classification, Wiley & Sons, New York, 2001.[4] B. Fasel, J. Luettin, Automatic facial expression analysis: a survey, Pattern

Recognition 36 (2003) 259–275.[5] X. Feng, A. Hadid, M. Pietikäinen, A coarse-to-fine classification scheme for facial

expression recognition, Image Analysis and Recognition, ICIAR 2004 Proceedings,Lecture Notes in Computer Science, 3212, Springer, 2004, pp. 668–675.

[6] X. Feng, M. Pietikäinen, A. Hadid, Facial Expression Recognition with Local BinaryPatterns and Linear Programming, Pattern Recognition and Image Analysis (2)(2005) 550–552.

[7] Face Recognition Vendor Tests (FRVT), Nat'l Inst. of Standards and Technology,http://www.frvt.org 2006.

[8] A.S. Georghiades, P.N. Belhumeur, D.J. Kriegman, From few to many: illuminationcone models for face recognition under variable lighting and pose, IEEETransactions on Pattern Analysis andMachine Intelligence 26 (6) (2001) 643–660.

[9] R. Gross, V. Brajovic, An image preprocessing algorithm for illumination invariantface recognition, International Conference on Audio- and Video-Based BiometricPerson Authentication, 2003, pp. 10–18.

[10] B. Heisele, T. Koshizen, Components for Face Recognition, Int'l Conf. Face andGesture Recognition, 2004.

[11] J. Holappa, T. Ahonen, M. Pietikäinen, An optimized illumination normalizationmethod for face recognition, IEEE International Conference on Biometrics: Theory,Applications and Systems, 2008.

[12] Y. Ivanov, B. Heisele, T. Serre, Using component features for face recognition, Int'lConference on Automatic Face and Gesture Recognition, 2004, pp. 421–426.

[13] T. Kanade, J.F. Cohn, Y. Tian, Comprehensive database for facial expressionanalysis, Int'l Conf. Face and Gesture Recognition, 2000, pp. 46–53.

[14] C. Kim, J. Oh, C.H. Choi, Combined subspace method single global and localfeatures for face recognition, International Joint Conference on Neural Network,2005, pp. 2030–2035.

[15] J. Kittler, M. Hatef, R.P.W. Duin, J. Matas, On combing classifiers, IEEE Transactionson Pattern Analysis and Machine Intelligence 20 (3) (1998) 226–239.

[16] S.G. Kong, J. Heo, B. Abidi, J. Paik, M. Abidi, Recent advantages in visual andinfrared face recognition — a review, Computer Vision and Image Understanding97 (1) (2005) 103–135.

[17] S. Kumano, K. Otsuka, J. Yamato, E. Maeda, Y. Sato, Pose-invariant facial expressionrecognition using variable-intensity templates, International Journal of ComputerVision 83 (2009) 178–194.

[18] S.Z. Li, R. Chu, S. Liao, L. Zhang, Illumination invariant face recognition using near-infrared images, IEEE Trans, Pattern Analysis and Machine Intelligence 29 (4)(2007) 627–639.

[19] Z. Li, J. Imai, M. Kaneko, Facial-component-based bag of words and PHOGdescriptor for facial expression recognition, IEEE International Conference onSystems, Man, and Cybernetics, 2009, pp. 1353–1358.

[20] S. Liao, W. Fan, A.C.S. Chung, D.-Y. Yeung, Facial expression recognition usingadvanced local binary patterns, Tsallis Entropies and Global Appearance Features,International Conference on Image Processing, 2006, pp. 665–668.

[21] A. Mehrabian, Communicationwithoutwords, Psychology Today 2 (4) (1968) 53–56.[22] S. Milborrow, F. Nicolls, Locating facial features with extended active shapemodel,

European Conference on Computer Vision, 2008, pp. 504–513.[23] T. Ojala, M. Pietikäinen, T. Mäenpää, Multiresolution gray-scale and rotation

invariant texture classification with local binary patterns, IEEE Transactions onPattern Analysis and Machine Intelligence 24 (7) (2002) 971–987.

[24] M. Pantic, L.J.M. Rothkrantz, Automatic analysis of facial expressions: the state ofthe art, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (12)(2000) 1424–1455.

[25] D.A. Socolinsky, A. Selinger, A comparative analysis of face recognitionperformance with visible and thermal infrared imagery, International Conferenceon Pattern Recognition, 2002, pp. 217–222.

[26] C. Shan, S. Gong, P.W. McOwan, Facial expression recognition based on localbinary patterns: a comprehensive study, Image and Vision Computing 27 (6)(2009) 803–816.

[27] D.A. Socolinsky, A. Selinger, J.D. Neuheisel, Face recognition with visible andthermal infrared imagery, Computer Vision and Image Understanding 91 (1)(2004) 72–114.

[28] Y. Sun, L. Yin, Evaluation of spatio-temporal regional features for 3D face analysis,IEEE Computer Vision and Pattern Recognition Workshop, 2009, pp. 13–19.

[29] M. Taini, G. Zhao, S.Z. Li, M. Pietikäinen, Facial expression recognition from near-infrared video sequences, International Conference on Pattern Recognition, 2008.

[30] M. Taini, G. Zhao, M. Pietikäinen, Weight-based facial expression recognition fromnear-infrared video sequences, Image Analysis, SCIA 2009 Proceedings, LectureNotes in Computer Science, 5575, 2009, pp. 239–248.

[31] X. Tan, B. Triggs, Enhanced local texture feature sets for face recognition underdifficult lighting conditions, International Workshop on Analysis and Modeling ofFaces and Gestures, 2007, pp. 168–182.

[32] Y. Tian, T. Kanade, J. Cohn, Recognizing action units for facial expression analysis,IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2) (2001)97–115.

[33] Y.L. Tian, T. Kanade, J.F. Cohn, Facial expression analysis, in: S.Z. Li, A.K. Jain (Eds.),Handbook of Face Recognition, Springer, 2005, pp. 247–276.

[34] R.V. Mises, Mathematical Theory of Probability and Statistics, Academic Press,New York, 1964.

[35] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Ma Yi, Robust face recognition viasparse representation, IEEE Transactions on Pattern Analysis and MachineIntelligence 31 (2) (2009) 210–227.

[36] S.Y. Zhao, R.R. Grigat, An automatic face recognition system in the near infraredspectrum, International Conference on Machine Learning and Data Mining inpattern recognition, 2005, pp. 437–444.

[37] G. Zhao, M. Pietikäinen, Dynamic texture recognition using local binary patternswith an application to facial expressions, IEEE Transactions on Pattern Analysisand Machine Intelligence 29 (6) (2007) 915–928.

[38] G. Zhao, M. Barnard, M. Pietikäinen, Lipreading with local spatiotemporaldescriptors, IEEE Transactions on Multimedia. 11 (7) (2009) 1254–1263.

[39] G. Zhao, M. Pietikäinen, Boosted multi-resolution spatiotemporal descriptors forfacial expression recognition, Pattern Recognition Letters 30 (12) (2009)1117–1127.

[40] Y. Zhang, Q. Ji, Active and dynamic information fusion for facial expressionunderstanding from image sequences, IEEE Transactions on Pattern Analysis andMaching Intelligence 27 (5) (2005) 699–714.

[41] Z. Zeng, M. Pantic, G.I. Roisman, T.S. Huang, Survey of affect recognition methods:audio, visual, and spontaneous expressions, IEEE Transactions on Pattern Analysisand Machine Intelligence 31 (1) (2009) 39–58.

[42] X. Zou, J. Kittler, K. Messer, Face Recognition Using Active Near-IR Illumination,British Machine Vision Conference, 2005, pp. 209–219.