facial expression recognition with fusion features ... · salient facial areas deﬁnitude...

sensors

Article

Facial Expression Recognition with Fusion FeaturesExtracted from Salient Facial Areas

Yanpeng Liu, Yibin Li, Xin Ma and Rui Song *

School of Control Science and Engineering, Shandong University, Jinan 250061, China;[email protected] (Y.L.); [email protected] (Y.L.); [email protected] (X.M.)* Correspondence: [email protected]

Academic Editors: Xue-Bo Jin; Shuli Sun; Hong Wei and Feng-Bao YangReceived: 23 January 2017; Accepted: 24 March 2017; Published: 29 March 2017

Abstract: In the pattern recognition domain, deep architectures are currently widely used and theyhave achieved fine results. However, these deep architectures make particular demands, especiallyin terms of their requirement for big datasets and GPU. Aiming to gain better results withoutdeep networks, we propose a simplified algorithm framework using fusion features extracted fromthe salient areas of faces. Furthermore, the proposed algorithm has achieved a better result thansome deep architectures. For extracting more effective features, this paper firstly defines the salientareas on the faces. This paper normalizes the salient areas of the same location in the faces to thesame size; therefore, it can extracts more similar features from different subjects. LBP and HOGfeatures are extracted from the salient areas, fusion features’ dimensions are reduced by PrincipalComponent Analysis (PCA) and we apply several classifiers to classify the six basic expressions atonce. This paper proposes a salient areas definitude method which uses peak expressions framescompared with neutral faces. This paper also proposes and applies the idea of normalizing thesalient areas to align the specific areas which express the different expressions. As a result, the salientareas found from different subjects are the same size. In addition, the gamma correction methodis firstly applied on LBP features in our algorithm framework which improves our recognitionrates significantly. By applying this algorithm framework, our research has gained state-of-the-artperformances on CK+ database and JAFFE database.

Keywords: facial expression recognition; fusion features; salient facial areas; hand-crafted features;feature correction

1. Introduction

Facial expression plays an important role in our daily communication with other people. For thedevelopment of intelligent robots, especially indoor mobile robots, emotional interactions betweenrobots and humans are the foundational functions of these intelligent robots. With automated facialexpression recognition technology, these home service robots can talk to children and take care of oldergenerations. Also, this technology can help doctors to monitor patients, which will save hospitalsmuch time and money. In addition, facial expression technology can be applied in a car to identifywhether the driver has fatigue, and this can save many lives. Facial expression recognition is worthresearching because many situations need this technology. Many research works have been done in theliterature; the universal expressions mentioned in papers are usually: anger, disgust, fear, happiness,sadness and surprise [1–3] while some researchers add neutral and contempt [4,5]. Different sensorsare used to capture data of these expressions, and researchers recognize these basic expressions fromtwo-dimensional (2D) and 3D spaces [6,7] faces. While different methods are applied to recognizethe basic expressions in 2D and 3D spaces, landmarks localization processes are used in both 2D and3D data. Vezzetti et al. [8,9] extracted many landmarks from multiexpression faces relying on facial

Sensors 2017, 17, 712; doi:10.3390/s17040712 www.mdpi.com/journal/sensors

http://www.mdpi.com/journal/sensors

http://www.mdpi.com

http://www.mdpi.com/journal/sensors

Sensors 2017, 17, 712 2 of 18

geometrical properties, which makes it easy to localize these parts on 3D faces. Many applicationscenarios, such as service robots, apply 2D images to detect and recognize faces, so our research focuseson recognizing expressions from 2D static images.

Different styles of data and various frameworks are applied to 2D space facial expressionsrecognition. Like other recognition research, facial expressions recognition uses data from videos,images sequences [10] and static images [3,11]. All of the movement processes of the expressions areapplied in the research, which use videos and images. Research using static images only uses thepeak frames because they contain sufficient information about the specific expressions, and that isalso the reason why this paper chose to use the peak frames. There are two main kinds of algorithmframeworks applied in facial expressions recognition work. Algorithms that use mature descriptorssuch as Histogram of Oriented Gradient (HOG) [12] and Local Binary Patterns (LBP) [4] extractfeatures from the images and then send the features to the classifiers. The performances of this kindof algorithm framework rely on the effectiveness of these descriptors. In order to fuse more effectivedescriptors, researchers extract different kinds of features and fuse them together [13]. Although thefusion features behave better than one kind of feature, these features’ distinguishing features have notbeen fully used. Feature correction method is applied to the features in our paper and this significantlyimproves the recognition rate. Deep networks is another popular framework in the facial expressionrecognition domain. AU-inspired Deep Networks (AUDN) [2], Deep Belief Networks (DBN) [14] andthe Convolutional Neural Network (CNN) [10,15] are used in facial expressions recognition work.Apart from the higher recognition rate, more computing resources and data are needed in thesealgorithms. For these reasons, the former framework is applied in our research.

Face alignment is applied to help researchers to extract more effective features from the staticimages [4,16]. Automated facial landmark detection is the first step to complete this work. After findingthese landmarks on the faces, researchers can align the faces and extract features from these faces.Early days, for the limitation of face alignment technology, researchers use fewer landmarks to alignthe faces and separate the faces to several small patches for extracting features [13]. This can roughlyalign the faces while more landmarks can improve the alignment precision. There are many methodsto detect landmarks from the faces. Tzimiropoulos et al. [17] proposed a Fast-SIC method for fittingAAMs to detect marks on the faces. Zhu et al. [18] use a model based on the mixture of trees witha shared pool which marks 68 landmarks on the face. This method is applied in our algorithm and68 landmarks are used to align the salient areas. These landmarks mark the shape of the eyebrows,eyes, nose, mouth and the whole face, which can help researchers to cut the salient patches. Althoughalignment faces can help to extract more effective features from the faces, some areas on the faces donot align well during this process. In this paper, the idea of normalizing the salient areas is firstlyproposed to improve the features’ extracted effectiveness.

In order to reduce the features’ dimensions and extract more effective features, different salientareas definitude methods are proposed in the literature. Zhong et al. [3] explained the idea thatdiscovering common patches across all the expressions is actually equivalent to learning the shareddiscriminative patches for all the expressions in their paper. They transferred the problem into amulti-task sparse learning (MTSL) problem and by using this method they obtained a good result.Happy et al. [13] applied these areas found in their paper and they also gained a decent result.Liu et al. [2] used Micro-Action-Pattern Representation in the AU-inspired deep networks (AUDN)and built four criterions to construct the receptive fields. This gained a better result and accomplishedfeature extraction at the same time. In order to define the salient areas more accurately, our researchuses neutral faces to compare with the peak frames of these expressions. The Karl Pearson correlationcoefficient [19] is applied to evaluate the correlation between the neutral faces and the faces thatexpressed different expressions. For finding the precise locations of the salient areas, the faces areseparated to several small patches. After comparing the small patches to their neutral faces, the patcheswhich have weaker correlation coefficient are found and these are the areas expressing the specificexpression. By using this method, the salient areas of the six fundamental expressions are found and

Sensors 2017, 17, 712 3 of 18

after fusing these areas, the salient areas of the six basic expressions are found too. Landmarks of thefaces are used to locate these salient areas and different sizes of salient areas are normalized in ourresearch framework.

Different kinds of descriptors are applied in facial expressions recognition research. Regardingthe scale of the features extracted areas, previously, hand-crafted features were extracted from thewhole alignment face [4,16] but nowadays the salient areas are used in hand-crafted extraction [3,13].Aiming to describe the different expressions more effectively, diverse features extracted methods areused in facial expression recognition. Typical hand-crafted features include Local Binary Patterns(LBP) [4], Histogram of Oriented Gradient (HOG) [12], Scale Invariant Feature Transform (SIFT) [20],and the fusion of these features [11]. According to the literature, the fusion features contain moreinformation about these expressions and achieve better results. That is the reason why we chose toextract LBP and HOG from the faces. Although fusion features can improve recognition rate, it ishard to fuse these features well. Before different features fuse together, normalization methods areapplied to the features. Although utilizing this normalization method can improve the recognitionresult, different kinds of features’ identities cannot mix well. Aiming to use more information of theLBP features and normalize the LBP features, the gamma feature correction method is firstly appliedon LBP features in our algorithm framework.

In this paper, a straightforward but effective algorithm framework has been proposed to recognizethe six basic expressions from static images. The algorithm framework is shown in Figure 1. In orderto define and obtain these salient areas from these faces, the faces and facial landmarks are detectedin the first step. After doing that, these faces are separated into several patches and by comparingneutral faces to these expressions, the salient areas are defined. Until this step, the salient areas areseparated from the faces according to these landmarks. For extracting more effective features fromthese salient areas, the idea of normalizing the salient areas is firstly proposed to overcome salientareas misalignment. After finishing that, LBP and HOG features are extracted from these salientareas. The gamma correction method is firstly applied on LBP features and then the classifier can usemore information from these LBP features. The Z-score method is used to normalize the LBP andHOG features to fusing them. Before applying different classifiers to classify these six expressions,Principal Component Analysis (PCA) is utilized to reduce the dimensions. Finally, different classifiersare applied to evaluate the effect of our framework, and our framework has achieved a better gradethan the deep networks [2,14].

Figure 1. Framework of the proposed algorithm.

Sensors 2017, 17, 712 4 of 18

2. Methodology

In this section, the proposed algorithm will be explained in detail. This section will introduce thesalient facial areas definitude principle and show the salient areas normalization and features fusionmethods. LBP features correction methods will be introduced and applied in our algorithm. We willthen introduce the following sections of this paper.

2.1. Faces Alignment and Salient Facial Areas Definitude

Automated face and facial landmark detection is the first step in our method. Facial landmarkdetection is an important base for facial expression classification. The method that is applied in thepaper [18] is chosen to mark 68 landmarks on faces in our research. These landmarks mark the shapeof the eyebrows, eyes, nose, mouth and the whole face, so these specific areas can be located by theselandmarks. These 68 landmarks on the face and the normalized face have been shown in Figure 1.According to the average length and width of the faces and the proportion of the length and width, thefaces in CK+ database are normalized to 240 × 200.

As we all know, these six fundamental expressions have different salient areas. In this paper, analgorithm is proposed to find the salient areas in these expressions. For the purpose of extracting moreeffective features from the faces, people have applied different methods to calculate the salient areasin the faces. Zhong et al. [3] explained the idea that discovering the common patches across all theexpressions is actually equivalent to learning the shared discriminative patches for all the expressions inthe paper. Since multi-task sparse learning (MTSL) can learn common representations among multiplerelated tasks [21], they transferred the problem into an MTSL problem. They used this method andgained a good result. In order to learn expression specific features automatically, Liu et al. [2] proposedan AU-inspired deep network (AUDN). They used Micro-Action-Pattern Representation in the AUDNand built four criterions to construct the receptive fields. This gained a better result and accomplishedfeatures extraction at the same time.

These methods all found salient areas from the aligned faces and extracted features from thesesalient areas. As for our algorithm, the areas which are more salient to their own neutral faces arefirstly found. In the last paragraph, the 68 landmarks have been found and the faces are normalizedto 240 × 200. The areas in the six basic expressions are compared to their neutral faces at the samelocation. If the areas during the expressions have not moved around, the areas must have morecorrelation with the areas on the neutral faces. Using this principle, compared to the other correlationcoefficient methods in [19] the Karl Pearson correlation coefficient is applied to evaluate the correlationbetween the neutral faces and the faces expressed in different expressions. The Karl Pearson correlationcoefficient is applied to evaluate the correlation between the matrixes. For finding the precise locationsof the salient areas, the faces are separated to 750 (30 × 25) patches and every patch is 8 × 8 pixels.These 8 × 8 pixels patches are matrixes and by comparing the small patches from neutral faces andspecific expressions, the salient areas can be precisely found. The Karl Pearson correlation coefficientformulate is shown next.

γkij =

∑m(Em − E)(Nm − N)√(∑m(Em − E)2)(∑m(Nm − N)2)

(1)

where γkij is the (i, j)th patch’s correlation coefficient of specific expressions, so the scale of i is 1 to 30,

j is 1 to 25. Em is the pixel value of one subject from the specific expression and m ranges from 1 to64 while N is the pixel value of the neutral face of that subject. E is the mean of the small patch fromspecific expression, and N is the mean of the neutral face.

Rij = ∑k

ρk ∗ γkij (2)

In order to find the salient areas of all these six basic expressions, a formula is defined to evaluatethe final correlation coefficient. The Rij is the final correlation coefficient of the (i, j)th location on face.

Sensors 2017, 17, 712 5 of 18

By changing the ρk, the different proportion of the kth expression can be changed. Besides, the sum ofthe ρk must equal 1.

6

∑k

ρk = 1 (3)

The results of the six expressions are shown in Figure 2. The areas found in this section willbe applied in the next section. From Figure 2, we can find that different expressions have differentsalient areas. Equation (1) is used to evaluate the salient areas in the six fundamental expressions.In Equation (3), ρk expresses the proportion of the specific expression in the final result, the value of ρkcan be changed according to the numbers of these different expressions because there are differentnumbers of images in these expressions.

Figure 2. The salient areas of the six expressions. (a) Salient areas of all expressions and neutral;(b) Salient areas of anger; (c) Salient areas of disgust; (d) Salient areas of fear; (e) Salient areas of happy;(f) Salient areas of sad; (g) Salient areas of surprise.

2.2. Salient Areas Normalization and Features Extraction

In this section, the idea of normalizing the salient areas rather than the whole faces is proposedand applied. Furthermore, local binary patterns (LBP) features and the histogram of oriented gradient(HOG) features all are extracted from the salient areas. Compared to the method extracting featuresfrom the whole faces, features extracted from salient areas can reduce the dimensions, lower noiseimpacts and avoid overfitting.

2.2.1. Salient Areas Normalization

In the last section, these salient areas are determined. Our research has a similar result as theresult in [3], but the performance differs in the eye areas. In papers [3,13], the researchers used thepatches of the faces which come from alignment faces. Normalizing the whole faces is a good ideabefore more landmarks are marked from the faces. Then, more landmarks can be marked from thefaces, which makes it easier to extract the salient areas from the faces. There are two main reasons forchoosing to normalize the salient areas.

Firstly, aligning the faces may result in salient areas being misaligned. In order to demonstratethe alignment effects, the faces are aligned and then all the faces in one specific expression are addedto gain the average face. The each pixel Xij is the average of all images of the specific expression.

The salient mouth parts are separated from the faces and the average salient mouth areas arecalculated for comparison. Figure 3 shows the result of the average faces, average salient areas andthe mouth parts of the average faces. The mouth parts of the average faces are used to compare withthe mouth parts using salient areas alignment. From the figure, it is clear that the mouth parts of theaverage faces have weaker contrast than the alignment mouth parts. This explains that aligning thefaces leads to salient areas being misaligned. In contrast, by aligning the salient mouth areas, themouths have a clear outline. Moreover, the alignment faces have different sized salient areas. Different

Sensors 2017, 17, 712 6 of 18

faces have different size, and the different sizes of the salient areas are extracted from these faces whenwe just use these landmarks to cut these salient areas. In the end, different dimensions of LBP andHOG are extracted from these salient areas. Using these features to classify the expressions can leadto worse recognition result. The reason why different LBP and HOG dimensions are extracted fromthese different sizes of salient areas can be explained by the principles of LBP and HOG which willbe introduced in the next part. Aligning the whole faces may obtain different features from the sameexpression subjects because they have different sized areas to express the expression. This negativelyinfluences the feature training in our algorithm framework. These are the reasons why the salientareas are normalized in our algorithm. Because our salient areas normalization method can overcomethese shortcomings, our algorithm can gain a better result than only using the face alignment methods.In the experimental section, comparative experiments will be designed to compare the effects of thesalient areas alignment method with the traditional faces alignment methods.

Xij =1n ∑ xij (4)

Figure 3. Average faces and the average salient areas. (a) Average faces of the six expressions; (b) Mouthareas of the average faces; (c) Average salient mouth areas.

2.2.2. Features Extraction

• Local Binary Patterns (LBP)

Texture information is an important descriptor for pattern analysis of images, and local binarypatterns (LBP) were presented to gain texture information from the images. LBP was first described in1994 [22,23] and from then on LBP has been found to be a powerful feature for texture representation.As for these facial expressions, actions of the muscles on the faces lead the faces to generate differenttextures. LBP features can describe the texture information of the images and this is the reason whyLBP features are extracted from the salient areas. The calculation progress of the original LBP value isshown in Figure 4a. A useful extension to the original operator is the so-called uniform pattern [24],which can be used to reduce the length of the feature vector and implement a simple rotation invariantdescriptor. In our research, a uniform pattern LBP descriptor is applied to gain features from the salientareas, and the salient areas are all separated to small patches. LBP features are gained from thesesalient areas respectively and these features are concatenated as the final LBP features. The lengthof the feature vector for a single cell can be reduced from 256 to 59 by using uniform patterns. Thisis very important, because there are many small patches in our algorithm. For example, the size ofthe mouth area is 40 × 60 and the small patches’ size is 10 × 15, so the mouth area is divided into16(4 × 4) patches. The uniform LBP features are extracted from each small patch and mapped to a

Sensors 2017, 17, 712 7 of 18

59-dimensional histogram. The salient areas all are separated into several small patches and the resultsare shown in Figure 5. The numbers are shown in Table 1. The dimension of the final LBP features isfound to be 3068 by adding the numbers in Table 1.

Figure 4. (a) Calculation progress of the original Local Binary Patterns (LBP) value; (b) Mouth areawith pixels of 40 × 60; (c) Mouth area with the former pixels of 28 × 42, the real pixels are enlargedfrom the former image; (d) Display models of (b,c) mouth.

Figure 5. Small patches of salient areas. (a) Mouth areas, anger, fear, happy; (b) Forehead areas,anger, fear; (c) Cheek areas, left cheek fear, right cheek fear, left cheek anger, left cheek happy, rightcheek happy.

Table 1. Patches numbers and LBP dimensions of salient areas.

Salient Areas Forehead Mouth Left Cheek Right Cheek

Piexls 20 × 90 40 × 60 60 × 30 60 × 30Small patches number 12 16 12 12

LBP dimension 708 944 708 708

Total 3068

Becuase different features will be extracted from different sizes of salient areas, these salient areasshould be aligned. In order to demonstrate the difference, different sizes of mouth areas are cut from

Sensors 2017, 17, 712 8 of 18

one face and these areas are normalized to the same size. These mouth areas are shown in Figure 4.LBP features are extracted from these normalized faces and their distributions are shown in Figure 4.From the figure, we can know that the values in (d) and (e) result in different performance, and aconclusion can be drawn that different sizes of images have different LBP features. Besides, HOGfeatures also are extracted from these salient areas, and they have a similar result as for LBP features.

• Histogram of Oriented Gradient (HOG)

Histogram of oriented gradients (HOG) is a feature descriptor which is used in computer visionand image processing [25]. The technique counts occurrences of gradient orientation in localizedportions of an image. HOG descriptors were first described in 2005 [26], the writers used HOG forpedestrian detection in static images. During HOG features extraction, the image is divided intoseveral blocks and the histograms of different edges are concatenated as shape descriptor. HOG isinvariant to geometric and photometric transformations, except for object orientation. Because theimages that in these databases have different light conditions and different expressions have differentorientations in the eyes, nose, lips corners, as a powerful descriptor HOG is selected in our algorithm.In our paper, for extracting HOG features, every cell is 5 × 5 and 4(2 × 2) cells make up a patch.The dimension of the mouth area is 60 × 40 and every cell has 9 features, so the dimension of the moutharea is 2772. The dimensions of the four salient areas are shown in Table 2. The HOG descriptors areshown in Figure 6 and the figure shows that the mouth areas of different expression have differentHOG descriptors.

Table 2. Patches numbers and Histogram of Oriented Gradient (HOG) dimensions of salient areas.

Salient Areas Forehead Mouth Left Cheek Right Cheek

Piexls 20 × 90 40 × 60 60 × 30 60 × 30Small patches number 51 77 55 55

HOG dimension 1836 2772 1980 1980

Total 8568

Figure 6. HOG descriptors of the mouths. (a) Happy mouth areas; (b) Fear mouth areas.

2.3. Features Correction and Features Fusion

2.3.1. LBP Correction

LBP feature is a very effective descriptor of the texture information of images. Many researchers [3,13]applied LBP to describe these different expressions. In our algorithm, LBP features are extracted and

Sensors 2017, 17, 712 9 of 18

processed before they are sent into the classifiers. For most papers, researchers extract features fromimages and normalize the data to 0–1, e.g., [27] normalizes the data to 0–1. For our algorithm framework,the image’s data are normalized to 0–1, and by using this method a better recognition rate can begained. In order to normalize the LBP features of every subject, the method in Equation (5) is appliedto normalize the LBP features to 0–1.

Lm =lm

max(lm)(5)

where m is the dimension number of every salient area and lm is the value of the LBP feature.Aiming to utilize the specific characteristic of every area, these four salient areas are normalized

respectively. The distribution styles of these LBP features extracted from the four salient areas aredisplayed in Figure 7. The figure shows that the distribution model of LBP features is the power lawdistribution. In image processing, the gamma correction redistributes native camera tonal levels intoones which are more perceptually uniform, thereby making the most efficient use of a given bit depth.In our algorithm, the distribution of the LBP features is the power law distribution, and thereforemore information concentrates on minority LBP features. According to our experiment, although thedistributions of these subjects’ LBP features all follow the power law distribution, the specific valuesare slightly different. For example, LBP values from 0 to 0.5 have different numbers when differentsubjects’ images are processed. This can be changed by gamma correction method, and by doingthis, more information can concentrate on more LBP features. For using more information from LBPfeatures and making it easy to fuse LBP and HOG features, gamma correction is used to correct theLBP feature data.

Lm = L1λm (6)

where λ is the correct gamma number. As is well known, most values of the gamma number come fromexperimental experience data. The parameter σ is proposed in our algorithm to help to find the properλ. In Equation (7) we have defined the mathematical expression of σ. From Equation (7) we know thatσ has a similar meaning to variance and the zero value is separated from these data because it has nomeaning. The gamma correction method is proposed to use more information from the initial LBPfeatures, and the value of variance shows the fluctuation of the data. For the power law distribution,fewer data contain more information. As for the LBP features when the gamma correction method isapplied, more LBP features contain more information and therefore the fluctuation of LBP featuresbecomes bigger. That is the reason why parameter σ is proposed and applied, and the relationshipbetween σ and the gamma number λ proves the correctness of our method.

Every salient area is processed respectively, so four σ values will be gained. The relationshipsbetween these four salient areas’ σ and the gamma number λ are shown in Figure 8. According to theexperimental data, all the four salient areas have the maximum σ value around λ’s value 2. Therefore,all the four salient areas’ σ values are added to find the final sum and the related λ.

σ =1

np ∑n

∑p(Lnp − Un)

2 (7)

Un =1p ∑

pLnp (8)

where n is the number of the images’ number, and p is the number of nonzero data in LBP features ofthe salient areas. Therefore, p is smaller than the dimensions of the salient areas’ LBP features. Ln p isthe value of LBP feature and Un is the mean value of LBP features from the specific subjects. In ouralgorithm, the relationship between the σ, gamma number λ and the recognition rate have been found,and the relationship is shown in Figure 9.

Sensors 2017, 17, 712 10 of 18

0 0.5 1

Values of LBP features

0

200

400

Num

bers

of th

e v

alu

es

a

0 0.5 1


0

100

200

300

Num

bers

of th

e v

alu

es

b

0 0.5 1


0

100

200

300

Num

bers

of th

e v

alu

es

c

0 0.5 1


0

100

200

Num

bers

of th

e v

alu

es

d

Figure 7. (a) Display model of the mouth’s LBP features; (b) Display model of the left cheek’sLBP features; (c) Display model of forehead’s LBP features; (d) Display model of the right cheek’sLBP features.

0 2 4

λ value

0

1

2

σ v

alu

e

a

0 2 4

λ value

0

1

2

σ v

alu

e

b

0 2 4

λ value

0

1

2

σ v

alu

e

c

0 2 4

λ value

0

1

2

σ v

alu

e

d

Figure 8. (a) Relationship between the mouth’s λ and σ; (b) Relationship between the left cheek’s λ

and σ; (c) Relationship between the forehead’s λ and σ; (d) Relationship between the right cheek’s λ

and σ.

2.3.2. HOG Processing and Features Fusion

Different features describe different characters of the images; therefore, researchers have mergedsome features together to be able to take advantage of the superiority of all the features [13,27]. For ouralgorithm, the LBP and HOG descriptors are applied to utilize the texture and orientation informationof these expressions. Proper fusion methods are very important factors for recognition work andunsuitable methods can make the recognition result worse. The recognition rate of the individualfeature and fusion features will be shown in the experimental section. Zhang et al. [27] applied astructured regularization(SR) method which is employed to enforce and learn the modality specificsparsity and density of each modality, respectively. As for our algorithm, the single features are

Sensors 2017, 17, 712 11 of 18

firstly processed to their best performance and then they are normalized to the same scale. In theexperimental section, different experiments are proposed to explain the results of single features andthe fusion feature.

The gamma correction method is applied to ensure the LBP achieve their best performance.Different features must be processed to the same scale when these features are fused together.The Z-score method is used to process LBP and HOG features, and after applying the method theaverage is 0 while the variance is 1.

σ =J

∑j( f j − µ)2 (9)

µ =∑J

j f j

J(10)

f j = K(xj − µ)

σ + C(11)

where f j is the data of LBP or HOG feature and f j is the feature data after processing. As for the LBPfeatures, although the display model has changed, the data are changed to the same scale and a betterresult can be gained. Because f j is too small, number K is used to multiply f j. In our experiment, the Kequal to 100.

2.4. Principal Component Analysis( PCA) and Classifiers

Principal Component Analysis (PCA) was invented in 1901 by Karl Pearson [28] as an analog ofthe principal axis theorem in mechanics, it was later independently developed by Harold Hotelling inthe 1930s [29]. Principal component analysis (PCA) is a statistical procedure that uses an orthogonaltransformation to convert a set of observations of possibly correlated variables into a set of values oflinearly uncorrelated variables called principal components. PCA is an effective method to reduce thefeatures dimension. There are many researchers using this method to reduce the features’ dimension.In our algorithm, the fusion features’ dimension is 11,636, which is really a very large number. In orderto reduce the feature’s dimension, PCA is applied. The relationship between the recognition dateand the number of the dimension under softmax classifier is shown in Figure 10. According to theexperiment, the most appropriate dimension is 80.

There are many kinds of classifiers applied in the facial expression recognition research and theresearchers apply different classifiers to evaluate their algorithms. These classifiers include SVM withpolynomial [4,13], linear [2,4,13] and RBF [4,13] kernel, and softmax [10,15] is also utilized in this work.In order to evaluate the effectiveness of our proposed algorithm and compare with the same workin other literature, many classifiers are applied in our experimental part. In our algorithm, differentclassifiers are applied to recognize these fusion features the dimensions of which are reduced by PCA.

3. Database Processing

3.1. CK+ Database

The CK+ database [5] is an extended database of the CK [30] database which contains bothmale and female subjects. There are 593 sequences from 123 subjects in the CK+ database, but only327 sequences are assigned to 7 labels. These 7 labels are anger (45), contempt (18), disgust (59),fear (25), happy (69), sad (28), surprise (83). The sequences show the variation in the images fromneutral to the peak of the expressions. The different expressions have different numbers in thesequences. In particular, the images extended in 2010 have different pixels and two types of pixels,which are 640 × 490 and 640 × 480 in the database. In order to compare with other methods [2,3,14,27],our experiments use these 309 sequences in the 327 sequences without contempt. Similar to the methodused in [2,3], the first image (the neutral) and the last three peak frames are chosen for training and

Sensors 2017, 17, 712 12 of 18

testing. The ten-fold-validation method is applied in the experiments while the subjects are separatedinto 10 parts according to the ID of every subject. There are 106 subjects in the chosen database, so thesubjects are distributed into 10 parts which have roughly equal image number and subject number.

3.2. JAFFE Database

The JAFFE database [31,32] consists of 213 images from 10 Japanese female subjects. Every subjecthas 3 or 4 examples of all the six basic expressions and also has a sample of neutral expression. In ourexperiment, 183 images are used to evaluate our algorithm.

4. Experiments

In this section the experimental setting and the details of our paper will be described.All comparison experiment ideas came from the second section of our paper and these experimentsare applied to evaluate our methods and certify our algorithm’s correctness. Our experiments areexecuted on CK+ and JAFFE database and the results are also compared with the recognition rates inthe related literature.

4.1. Salient Areas Definitude Method Validation

Th reasons why the salient areas rather than the whole faces are chosen in our algorithm havebeen introduced in Section 2. Experiments are designed to evaluate the performance of our salientareas’ definitive method. In addition, the LBP features are extracted from all of the aligned faces andthe aligned salient areas to gain the contrast recognition rates. These results are shown in Table 3.In addition, the salient areas are separated from the raw images rather than the alignment facesaccording to specific landmarks among the 68 landmarks on the face. In Table 3, the 10-fold crossvalidation method is used to evaluate the performance of our method, and in this case only the LBPfeatures are used in our recognition experiment.

Table 3. Recognition rate on CK+ under different salient areas definitude methods.

Salient AreasDefinitude Methods

Zhong 2012 [3](MTSL)

Liu 2015 [2](LBP)

Liu 2015 [2](AUDN-GSL)

Proposed Method(with Gamma)

Proposed Method(without Gamma)

Classifer SVM SVM SVM SVM SVMRecognition rate 89.9 92.67 95.78 95.5 96.6

For the purpose of distinguishing that whether using the salient areas can be more effective thanthe whole face alignment methods or not, the mouth areas are normalized to 60 × 30, the cheek areasare normalized to 30 × 60 while the eye areas are normalized to 20 × 90. LBP features are extractedfrom the small patches whose sizes are 15 × 10 and then all these features are concatenated together.LBP features are used to evaluate the performance of our algorithm and compare the result with othermethods. The results are shown in Table 3. Several comparison experiments are designed, SVM andclassifier are applied to evaluate our algorithm. Comparing LBP features extracted from the alignmentareas with the features extracted from salient areas on alignment faces and the whole alignment faces,better recognition rates can be gained by our algorithm by using the SVM classifier. Polynomial, Linear,and RBF kernel SVM are used in our experiment and the SVM classifier is designed by Chih-ChungChang and Chih-Jen Lin [33]. The gamma correction method is used to process the LBP features in ourexperiment. Compared with the experiment designed by Zhong et al. [3] and Liu et al. [2], accordingto the results in Table 3, our algorithm has a more precise recognition rate.

4.2. Gamma Correction of LBP Features

In Section 2.3, a method was proposed to process the LBP features and the relationship betweenthe σ and gamma number λ was found. In order to evaluate the effects of gamma correction and verify

Sensors 2017, 17, 712 13 of 18

the relationship between σ and gamma number λ some comparison experiments are designed in ourpaper. CK+ and JAFFE datasets are used in our experiments. In our experiments, the number of λ

ranged from 0.1 to 3 and all the results are recorded. In order to show the relationship between σ,gamma correction number λ and the recognition rate clearly, some figures have been draw to displaythe trend of recognition rate and σ. In Figure 9 the σ is the sum of the four salient areas’ σ in CK+database. In the figure, while λ is equal to 1 there is no gamma correction, and from the figure onecan see that the biggest recognition rate and the biggest σ value all result from the value of λ near to1.8. The relationship between σ and λ of JAFFE database are shown in Figure 10. These two figuresshow that our LBP correction method has good universality power. In addition, because there arefewer images in the JAFFE database, we can see that the curves in Figure 10 are not smooth enough,but their overall trends also correspond with the relationship in Figure 10. The performances of theseexperiments are shown in Table 4, and different classifiers are used to evaluate the universality ofthe gamma correction method. Table 4 shows that gamma correction has significantly improved therecognition rate and this proves that our method of using σ to find the proper λ can be applied in facialexpression recognition work. In Table 4, 10-fold cross validation is applied on the CK+ database andthe leave-one-person-out validation method is used on the JAFFE database.

0 0.5 1 1.5 2 2.5 3

λ value

0

0.5

1

1.5

2

2.5

σ v

alu

e

a

0 0.5 1 1.5 2 2.5 3

λ value

20

40

60

80

100

reco

gn

itio

n r

ate

b

SVM(polynomial)

SVM(RBF)

SVM(Linear)

Softmax

0 0.5 1 1.5 2 2.5 3

λ value

0

0.5

1

1.5

2

2.5

σ v

alu

e

a

0 0.5 1 1.5 2 2.5 3

λ value

20

40

60

80

100

reco

gn

itio

n r

ate

b

SVM(polynomial)

SVM(RBF)

SVM(Linear)

Softmax

Figure 9. (a) Relationship between λ and σ on CK+ database; (b) Relationship between λ andrecognition rates from different classifiers on CK+ database.

Sensors 2017, 17, 712 14 of 18

0 0.5 1 1.5 2 2.5 3

λ value

0

0.5

1

1.5

2

σ v

alu

e

a

0 0.5 1 1.5 2 2.5 3

λ value

20

30

40

50

60

reco

gnitio

n r

ate

b

SVM(polynomial)

SVM(RBF)

SVM(Linear)

Softmax

Figure 10. (a) Relationship between λ and σ on JAFFE database; (b) Relationship between λ andrecognition rates from different classifiers on JAFFE database.

Table 4. Recognition rates on different classifiers with and without gamma correction.

CK+ JAFFE

Gamma-LBP LBP Gamma-LBP LBP

SVM(polynomial) 96.6 95.5 62.8 62.3SVM(linear) 96.6 95.6 63.4 60.8SVM(RBF) 96.0 87.1 62.8 61.2Softmax 97.0 95.6 61.7 59.6

In order to compare our algorithm with other research, we apply the same classifier and validationmethod as in the literature. Compared with the literature in Table 5, our experiment on the CK+database has better results and this shows that our salient areas definitude methods and LBP correctionmethod have fine performance.

Table 5. Recognition rate on CK+ under LBP feature in different literature.

Methods Zhong 2012 [3] Shan 2009 [4] Proposed Methods

Classifier SVM SVM SVMValidation Setting 10-Fold 10-Fold 10-Fold

Performance 89.9 95.1 96.6

Sensors 2017, 17, 712 15 of 18

4.3. Features Fusion, PCA and Recognition Rate Comparison

In our algorithm, LBP and HOG features are used to train the SVM and softmax classifiers andthese features all are extracted from these salient areas. In order to gain a better result, LBP and HOGfeatures are fused in our research. Using only the HOG feature, we obtain a 96.7 recognition rateand, using the fusion method, we obtain a better result, namely 98.3, which was reached on the CK+database. In addition, a similar result has been obtained on the JAFFE database. Because using fusionfeatures can lead to a better recognition result, fusion features are used in our algorithm.

The full dimension of fusion features is 11,636, which is a very large number. In addition, hugefeature dimension can pull in some noise and lead to overfitting. As for the PCA method, if thenumber of the features’ dimension is bigger than the images’ number the principal component numberis 1. In order to gain the most appropriate number, the number of the dimension is changed from 10to 1000 and by using this method, the PCA dimension can be chosen according to recognition rate.Because using the softmax classifier can obtain better recognition rate than the other classifiers, we usesoftmax to show the effects of PCA. The relationship between PCA number and recognition rate isdisplayed in Figure 11. The most appropriate PCA dimension number is chosen according to therecognition rate and the dimension number of the features put into softmax is 80 on CK+ database anda similar curve is obtained by JAFFE.

0 200 400 600 800 1000

dimension number

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

recognitio

n r

ate

Figure 11. Relationship between Principal Component Analysis (PCA) dimension and recognition rate.

Until this step, the best recognition rate of 98.3 is gained under the 10-fold-cross validation methodon the CK+ database. Therefore, to our knowledge, compared with the other methods in the literature,a state-of-the-art result has been obtained. Our result has been compared with other methods in theliterature and the results are shown in Table 6. These four experiments all used deep networks whilehand-crafted features are used in our algorithm. This explains that our algorithm has fine recognitionability by extracting features from the salient areas, correcting LBP features and fusing these features.In order to evaluate the adaptability of our algorithm, our algorithm also is applied on the JAFFEdatabase. The results from other literature and our algorithm are shown in Table 7. The experimentshows that our algorithm has quite a good adaptability. Compared with the literature [13], our methodcan recognize about 5 more images than their method on average. In addition, compared with theliterature [4], our features’ dimension on CK+ and JAFFE is 11,636 which is much less than 16,640, andthis illustrates that our algorithm needs less time and memory to train and predict.

Sensors 2017, 17, 712 16 of 18

Table 6. Recognition rate on CK+.

Literature Liu 2014 [14] Liu 2015 [2] Jung 2015 [10] Khorrami 2015 [15] Proposed Algorithm

Method BDBN AUDN DTAGNZero-biasCNN+AD

LBP+HOG

Validation Setting 10-Fold 10-Fold 10-Fold 10-Fold 10-FoldAccuracy 96.7 95.785 96.94 98.3 98.3

Table 7. Recognition rate on JAFFE.

Literature Shan 2009 [4] Happy 2015 [13] ProposedAlgorithm

ProposedAlgorithm

ProposedAlgorithm

Classifier SVM(RBF) SVM(Linear) SVM(Linear) SVM(Linear) SoftmaxValidation Setting 10-Fold 5-Fold 5-Fold 10-Fold 10-Fold

Accuracy 81.0 87.43 87.6 89.6 90.0

5. Discussion

More information about salient areas definitude methods is needed. Zhong et al. [3] designeda two-stage multi-task sparse learning algorithm to find the common and specific salient areas.LBP features rather than these pure data are used in this method, i.e., the LBP features are used torepresent the information from the images. Liu et al. [2] built a convolutional layer and a max-poolinglayer to learn the Micro-Action-Pattern representation which can depict local appearance variationscaused by facial expressions. In addition, feature grouping was applied to find the salient areas.Compared with these two methods, our algorithm only uses the raw image data and there is notraining procedure, but neutral faces are needed in our algorithm. For facial expressions, only partialareas in the face have changed, so the neutral faces can be used to calculate the correlation betweenneutral faces and specific expressions to localize the changed areas. Besides, the localiozation result canbe more accurate when the changes are smaller. Furthermore, LBP features extracted from the smallareas can also be used to compare and find the correlation. That is to say, by using these descriptors’property, our algorithm can be used to localize the changed areas.

Making the features extracted from different subjects in one class have a similar value is the mainreason why gamma correction can improve our recognition rate. On the surface, gamma correctionhas changed the display of LBP features, but in fact it has changed the value of LBP features. Differentsubjects have different ways of presenting the same expression so their LBP features have somedifference. According to our experiments, although their LBP features’ values have some difference,their basic properties are similar. For instance, LBP features from the happy class have differentvalues but these values are more different from those of the other classes. However, some subjectsfrom different classes have similar LBP features and this is the reason why these algorithms cannotrecognize the expressions. Our gamma correction method has shortened the distance in one class andthis improves the recognition rate. The application of gamma correction on LBP features has had apositive effect on the recognition result and therefore some correction methods also can be applied onother features to shorten the distance in the same class to obtain a better recognition.

Although our algorithm has achieved a state-of-the-art recognition rate, there are some weaknessin our method. Our algorithm selects these salient areas according to the landmarks on the faces, andif the landmarks are not accurate our recognition result will be influenced. Furthermore, if there is notenough image data, our gamma correction can not improve the recognition a lot. The performance ofgamma correction on the JAFFE database shows these weaknesses. These are the main weaknesses ofour algorithm.

Sensors 2017, 17, 712 17 of 18

6. Conclusions

The main contributions of this paper are summarized as follows: (1) A salient areas definitudemethod is proposed and the salient areas compared to neutral faces are found; (2) The idea ofnormalizing the salient areas to align the specific areas which express the different expressions isfirstly proposed. This makes the salient areas of different subjects have the same size; (3) The gammafeatures correction method is firstly applied on the LBP features and this significantly improves therecognition result in our algorithm frameworks; (4) Fusion features are used in our framework, andby normalizing these features to the same scale, this significantly improves our recognition rate.By applying our algorithm framework, a state-of-the-art performance in the CK+ database under the10-fold validation method using hand-crafted features has been achieved. In addition, a good result inthe JAFFE database has also been obtained.

In the future, video data processing will be the focus of our research work and we will try torecognize facial expressions from real-time videos.

Acknowledgments: This work was supported in part by the National High Technology Research andDevelopment Program 863, China (2015AA042307), Shandong Province Science and Technology Major Projects,China (2015ZDXX0801A02).

Author Contributions: Rui Song is the corresponding author who designed the algorithm and revised the paper.Yanpeng Liu conceived of, designed and performed the experiments, analyzed the data and wrote this paper.Yibin Li provided some comments and suggestions, and also revised the paper. Xin Ma provided some suggestionsand comments for the performance improvement of the recognition algorithm.

Conflicts of Interest: The authors declare no conflict of interest.

References

1. Izard, C.E. The Face of Emotion; Appleton-Century-Crofts: New York, NY, USA, 1971.2. Liu, M.; Li, S.; Shan, S.; Chen, X. Au-inspired deep networks for facial expression feature learning.

Neurocomputing 2015, 159, 126–136.3. Zhong, L.; Liu, Q.; Yang, P.; Liu, B.; Huang, J.; Metaxas, D.N. Learning active facial patches for expression

analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),Providence, RI, USA, 16–21 June 2012; pp. 2562–2569.

4. Shan, C.; Gong, S.; McOwan, P.W. Facial expression recognition based on local binary patterns:A comprehensive study. Image Vis. Comput. 2009, 27, 803–816.

5. Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The extended cohn-kanadedataset (ck+): A complete dataset for action unit and emotion-specified expression. In Proceedings ofthe IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW),San Francisco, CA, USA, 13–18 June 2010; pp. 94–101.

6. Wang, J.; Yin, L.; Wei, X.; Sun, Y. 3D facial expression recognition based on primitive surface featuredistribution. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and PatternRecognition, New York, NY, USA, 17–22 June 2006; Volume 2, pp. 1399–1406.

7. Lei, Y.; Bennamoun, M.; Hayat, M.; Guo, Y. An efficient 3D face recognition approach using local geometricalsignatures. Pattern Recognit. 2014, 47, 509–524.

8. Vezzetti, E.; Marcolin, F.; Fracastoro, G. 3D face recognition: An automatic strategy based on geometricaldescriptors and landmarks. Robot. Auton. Syst. 2014, 62, 1768–1776.

9. Vezzetti, E.; Marcolin, F. 3D Landmarking in multiexpression face analysis: A preliminary study on eyebrowsand mouth. Aesthet. Plast. Surg. 2014, 38, 796–811.

10. Jung, H.; Lee, S.; Park, S.; Lee, I.; Ahn, C.; Kim, J. Deep Temporal Appearance-Geometry Network for FacialExpression Recognition. arXiv 2015, arXiv:1503.01532.

11. Zavaschi, T.H.; Britto, A.S.; Oliveira, L.E.; Koerich, A.L. Fusion of feature sets and classifiers for facialexpression recognition. Expert Syst. Appl. 2013, 40, 646–655.

12. Hu, Y.; Zeng, Z.; Yin, L.; Wei, X.; Zhou, X.; Huang, T.S. Multi-view facial expression recognition.In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition,Amsterdam, The Netherlands, 17–19 September 2008; pp. 1–6.

Sensors 2017, 17, 712 18 of 18

13. Happy, S.; Routray, A. Robust facial expression classification using shape and appearance features.In Proceedings of the Eighth International Conference on Advances in Pattern Recognition (ICAPR), Kolkata,India, 4–7 January 2015; pp. 1–5.

14. Liu, P.; Han, S.; Meng, Z.; Tong, Y. Facial expression recognition via a boosted deep belief network.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH,USA, 23–28 June 2014; pp. 1805–1812.

15. Khorrami, P.; Paine, T.; Huang, T. Do Deep Neural Networks Learn Facial Action Units When DoingExpression Recognition? In Proceedings of the IEEE International Conference on Computer VisionWorkshops, Santiago, Chile, 7–13 December 2015; pp. 19–27.

16. Tian, Y. Evaluation of face resolution for expression analysis. In Proceedings of the Conference on ComputerVision and Pattern Recognition Workshop, Washington, DC, USA, 27 June–2 July 2004; p. 82.

17. Tzimiropoulos, G.; Pantic, M. Optimization problems for fast aam fitting in-the-wild. In Proceedings of theIEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 593–600.

18. Zhu, X.; Ramanan, D. Face detection, pose estimation, and landmark localization in the wild. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA,16–21 June 2012; pp. 2879–2886.

19. Lee Rodgers, J.; Nicewander, W.A. Thirteen ways to look at the correlation coefficient. Am. Stat. 1988, 42,59–66.

20. Tariq, U.; Lin, K.H.; Li, Z.; Zhou, X.; Wang, Z.; Le, V.; Huang, T.S.; Lv, X.; Han, T.X. Emotion recognition froman ensemble of features. In Proceedings of the IEEE International Conference on Automatic Face & GestureRecognition and Workshops (FG 2011), Santa Barbara, CA, USA, 21–25 March 2011; pp. 872–877.

21. Evgeniou, A.; Pontil, M. Multi-task feature learning. Adv. Neural Inf. Process. Syst. 2007, 19, 41.22. Ojala, T.; Pietikainen, M.; Harwood, D. Performance evaluation of texture measures with classification based

on Kullback discrimination of distributions. In Proceedings of the 12th IAPR International ConferencePattern Recognit. Conference A: Computer Vision & Image Processing, Jerusalem, Israel, 9–13 October 1994;Volume 1, pp. 582–585.

23. Ojala, T.; Pietikäinen, M.; Harwood, D. A comparative study of texture measures with classification basedon featured distributions. Pattern Recognit. 1996, 29, 51–59.

24. Heikkila, M.; Pietikainen, M. A texture-based method for modeling the background and detecting movingobjects. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 657–662.

25. Wikipedia. Histogram of Oriented Gradients. Available online: https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients (accessed on 23 January 2017).

26. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of theIEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA,20–25 June 2005; Volume 1, pp. 886–893.

27. Zhang, W.; Zhang, Y.; Ma, L.; Guan, J.; Gong, S. Multimodal learning for facial expression recognition.Pattern Recognit. 2015, 48, 3191–3202.

28. Person, K. On Lines and Planes of Closest Fit to System of Points in Space. Philiosoph. Mag. 1901, 2, 559–572.29. Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933,

24, 417.30. Kanade, T.; Cohn, J.F.; Tian, Y. Comprehensive database for facial expression analysis. In Proceedings of

the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France,26–30 March 2000; pp. 46–53.

31. Lyons, M.; Akamatsu, S.; Kamachi, M.; Gyoba, J. Coding facial expressions with gabor wavelets.In Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition,Nara, Japan, 14–16 April 1998; pp. 200–205.

32. Lyons, M.J.; Budynek, J.; Akamatsu, S. Automatic classification of single facial images. IEEE Trans. PatternAnal. Mach. Intell. 1999, 21, 1357–1362.

33. Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 27.

c© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open accessarticle distributed under the terms and conditions of the Creative Commons Attribution(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients

https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients

http://creativecommons.org/

http://creativecommons.org/licenses/by/4.0/.

facial expression recognition with fusion features ... · salient facial areas deﬁnitude...

Documents