towards meaningful prior knowledge in deep neural networks

80
FACULDADE DE E NGENHARIA DA UNIVERSIDADE DO P ORTO Facial Expression Recognition: Towards Meaningful Prior Knowledge in Deep Neural Networks Filipe Martins Marques MASTER S T HESIS Integrated Master in Bioengineering Supervisor: Jaime dos Santos Cardoso Second Supervisor: Pedro Miguel Martins Ferreira June 12, 2018

Upload: khangminh22

Post on 16-Jan-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO

Facial Expression Recognition:Towards Meaningful Prior Knowledge

in Deep Neural Networks

Filipe Martins Marques

MASTER’S THESIS

Integrated Master in Bioengineering

Supervisor: Jaime dos Santos Cardoso

Second Supervisor: Pedro Miguel Martins Ferreira

June 12, 2018

c© Filipe Marques, 2018

Resumo

As expressões faciais, por definição, estão associadas à forma como as emoções são expressas edesempenham um papel fulcral na comunicação. Isso torna a expressão facial um domínio inter-disciplinar transversal a várias ciências como ciência comportamental, neurologia e inteligênciaartificial. A expressão facial é documentada como o meio mais informativo de comunicação paraos seres humanos, pelo que as aplicações de visão por computador, como a interação homem-máquina ou o reconhecimento de língua gestual, precisam de um sistema eficiente de reconheci-mento de expressão facial.

Os métodos de reconhecimento da expressão facial têm sido estudados e explorados, demon-strando desempenhos impressionantes na deteção de emoções discretas. No entanto, essas abor-dagens apenas têm uma eficiência notável em ambientes controlados, ou seja, ambientes onde ailuminação e a pose são monitorizadas. Os sistemas de reconhecimento de expressão facial emaplicações de visão por computador, par além de poderem ser melhorados em cenários controla-dos, precisam de ser eficientes em cenários reais, embora os métodos mais recentes não tenhamatingido um desempenho desejável em tais ambientes.

As redes neurais convolucionais têm sido amplamente utilizadas em várias tarefas de visão porcomputador e reconhecimento de objetos. Recentemente, as redes neuronais convolucionais foramaplicadas ao reconhecimento da expressão facial. Contudo, estes métodos ainda não alcançaram oseu potencial no reconhecimento de expressões faciais, dado que o treino de modelos complexosem bases de dados pequenas, como aquelas disponíveis para o reconhecimento de expressão facial,normalmente resultam no sobreajuste dos dados. Tendo isto em conta, é necessário o estudo denovos métodos com redes neuronais que envolvam estratégias de treino inovadoras.

Na presente dissertação, um novo modelo é proposto no qual diferentes fontes de conheci-mento do domínio são integradas. O método proposto visa incluir informação extraída de redespré-treinadas em tarefas do mesmo domínio (reconhecimento de imagem ou objeto), juntamentecom informção morfológica e fisiológica de expressão facial. Esta inclusão de informação é con-seguida pela regressão de mapas de relevância que destacam zonas chave para o reconhecimentode expressão facial. Foi estudado em que medida o refinamento dos mapas de relevância de ex-pressões faciais e o uso de características de outras redes permitem obter melhores resultados naclassificação de expressão.

O método proposto alcançou o melhor resultado quando comparado com métodos do estadode arte implementados, mostrando assim capacidade de aprender características específicas deexpressão. Para além disso, o modelo é mais simples (menos parâmetros para serem treinados)e requer menos recursos computacionais. Deste modo demonstra-se que uma eficiente inclusãode informação do domínio origina modelos mais eficientes em tarefas onde as respetivas bases dedados são limitadas.

i

ii

Abstract

Facial expressions by definition are associated with how emotions are expressed. This makes facialexpression an interdisciplinary domain transversal to behavioral science, neurology and artificialintelligence. Facial expression is documented as the most informative mean of communication forhumans, reason why computer vision applications such as natural human computer interaction orsign language recognition need an efficient facial expression recognition system.

Facial expression recognition methods (FER) have been deeply studied and these methodshave impressive performances on the detection of discrete emotions. However, these approachesonly have remarkable efficiency in controlled environments, i.e., environments where illuminationand pose are monitored. An integration of FER systems in computer vision applications need to beefficient in real world scenarios. However, current state-of-the-art methods do not reach accurateexpression recognition in such environments.

Deep convolutional neural networks have been widely used in several computer vision tasksinvolving object recognition. Recently, deep learning methods have also been applied to facialexpression recognition. Nonetheless, these methods have not reached their full potential in theFER task as training high capacity models in small datasets, such as the ones available in the FERfield, usually result in overfitting. In this regard, further research of novel deep learning modelsand training strategies has crucial significance.

In this dissertation, a novel neural network that integrates different sources of domain knowl-edge is proposed. The proposed method integrates knowledge transfered from pre-trained net-works on similar recognition tasks with prior knowledge of facial expression. The prior knowl-edge integration is achieved by the means of a regression map with meaningful spatial features forthe model. Further experiments and studies were performed to assess whether refined regressedmaps of facial landmarks can lead to better performances and whether transfered features fromother networks can lead to better results.

The proposed method outperforms the implemented state of the art methods and shows theability to learn expression-specific features. Besides, the network is simpler and requires lesscomputational resources. Thus, it is demonstrated that an effective use of prior knowledge canlead to more efficient models in tasks where large datasets are not available.

iii

iv

Acknowledgments

First of all, I would like to thank the Faculty of Engineering of University of Porto and all thepeople that I met in university, from teachers to my colleagues and friends, for all the education,the support and the strength to make me complete this course in these five years and, most impor-tantly, for making me discover my potential.

To my supervisor, Professor Jaime Cardoso, for all the orientation, support and experience. Tothe INESC-TEC, for the facilities, kindness and networking provided. To my second supervisor,Pedro Ferreira, for all the patience, the availability to help me, the experience, dedication and mo-tivation when I needed.

To Tiago, for bringing out the dedicated worker that was hidden in me. Thank you as well forall the motivation, time, patience and joy. To Inês, for being my voice of reason. To Joana, forbringing out my free spirit. To Rita, for growing up with me and helping me build my personality.

To my parents, for being the best they can be everyday, for all the things that I can not enu-merate here and specially, for all the love.

To my siblings, for annoying me since forever, but most importantly, for being here for me allthe time as well.

To my family, for teaching me moral values and lessons that I will carry all my life. To mygrandfather that is looking down on me somewhere.

To all my friends who handle me and let me be just as I am.

Filipe Marques

v

vi

Contents

1 Introduction 11.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 52.1 Facial Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Feature Descriptors for Facial Expression Recognition . . . . . . . . . . . . . . 9

2.3.1 Local-Binary Patterns (LBP) . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 Gabor Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Learning and classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.1 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . . 112.4.2 Deep Convolutional Neural Networks (DCNNs) . . . . . . . . . . . . . 12

2.5 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 State-of-the-Art 193.1 Face detection (from Viola & Jones to DCNNs) . . . . . . . . . . . . . . . . . . 193.2 Face Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.1 Traditional (geometric and appearance) . . . . . . . . . . . . . . . . . . 223.3.2 Deep Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . 24

3.4 Expression Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Implemented Reference Methodologies 294.1 Hand-crafted based approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Conventional CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Transfer Learning Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . 344.3.1 VGG16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3.2 FaceNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.4 Physiological regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

vii

viii CONTENTS

4.4.2 Supervised term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.4.3 Unsupervised Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Proposed Method 395.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1.1 Representation Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.1.2 Facial Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.1.3 Classification Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.3 Iterative refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6 Results and Discussion 456.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.2 Relevance Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.3 Results on CK+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.4 Results on SFEW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7 Conclusions 55

References 57

List of Figures

2.1 Study of FEs by electrically stimulate facial muscles. . . . . . . . . . . . . . . . 62.2 Examples of AUs on the FACS. . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Illustration of the neutral expression and the six basic emotions. . . . . . . . . . 72.4 A typical framework of an HOG-based face detection method. . . . . . . . . . . 92.5 Example of LBP calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6 Example of an SVM classifier applied to features extracted from faces. . . . . . . 112.7 Architecture of a deep network for FER. . . . . . . . . . . . . . . . . . . . . . . 132.8 Visualization of the activation maps for different layers. . . . . . . . . . . . . . . 142.9 Batch-Normalization applied to the activations x over a mini-batch. . . . . . . . . 152.10 Dropout Neural Network Model. . . . . . . . . . . . . . . . . . . . . . . . . . . 162.11 Common pipeline for model selection. . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Multiple face detection in uncontrolled scenarios. . . . . . . . . . . . . . . . . . 203.2 Architecture of the Multi-Task CNN for face detection. . . . . . . . . . . . . . . 213.3 Detection of AUs based on geometric features. . . . . . . . . . . . . . . . . . . . 223.4 Spatial representations of the main approaches for feature extraction. . . . . . . . 243.5 Framework of the curriculum learning method. . . . . . . . . . . . . . . . . . . 25

4.1 Illustration of the pre-processing. . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Illustration of the implemented geometric feature computation. . . . . . . . . . . 304.3 Architecture of the conventional deep network used. . . . . . . . . . . . . . . . . 324.4 Examples of the implemented data augmentation process. . . . . . . . . . . . . . 344.5 Network configurations of VGG. . . . . . . . . . . . . . . . . . . . . . . . . . . 354.6 Original image followed by density maps obtained by a superposition of Gaussians

at the location of each facial landmark, with an increasing value of σ . . . . . . . 37

5.1 Architecture of the proposed network. The relevance maps are produced by regres-sion from the facial component module who is composed by an encoder-decoder.The maps x are operated (⊗) with the feature representations ( f ) that are outputtedby representation module and then fed to the classification module, predicting theclasses probabilities (y). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2 Pipeline for feature extraction from Facenet. Only the layers before pooling oper-ations are represented. GAP- Global Average Pooling. . . . . . . . . . . . . . . 41

5.3 Facial Module architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.4 Illustrative examples of the facial landmarks computation for the SFEW dataset. . 435.5 Architecture of the proposed network for iterative refinement. . . . . . . . . . . . 44

6.1 Samples from CK+ and from SFEW database. . . . . . . . . . . . . . . . . . . . 456.2 Examples of predicted relevance maps for different methods used. . . . . . . . . 48

ix

x LIST OF FIGURES

6.3 Frame-by-frame analysis of the relevance maps. . . . . . . . . . . . . . . . . . . 496.4 Class Distribution on CK+ database. . . . . . . . . . . . . . . . . . . . . . . . . 506.5 Confusion Matrix of CK+ database. . . . . . . . . . . . . . . . . . . . . . . . . 526.6 Class distribution for SFEW database. . . . . . . . . . . . . . . . . . . . . . . . 536.7 Confusion Matrix of SFEW database. . . . . . . . . . . . . . . . . . . . . . . . 54

List of Tables

6.1 Hyperparameters sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.2 Performance achieved by the traditional baseline methods on CK+. . . . . . . . . 516.3 CK+ experimental results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.4 SFEW experimental results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

xi

xii LIST OF TABLES

Abbreviations

AAM Active Appearance ModelAFEW Acted Facial Expression In The WildAU Action UnitBN Batch-NormalizationCNN Convolutional Neural NetworkCK Cohn-KanadeFACS Facial Action Coding SystemFE Facial ExpressionFER Facial Expression RecognitionGPU Graphics Processing UnitHCI Human–Computer InteractionHOG Histogram of Oriented GradientskNN k-Nearest NeighborsLBP Local binary patternsMTCNN Multi-Task Convolutional Neural NetworkNMS Non-Maximum SuppressionP-NET Proposal NetworkPCA Principal Component AnalysisR-NET Refine NetworkReLU Rectified Linear UnitsSFEW Static Facial Expressions in the WildSVM Support Vector Machine

xiii

Chapter 1

Introduction

1.1 Context

In psychology, emotion refers to the conscious and subjective experience that is characterized

by mental states, biological reactions and psychological or physiologic expressions, i.e., Facial

Expressions (FE). It is common to relate FE to affect, as it can be defined as the experience of

emotion, and is associated with how the emotion is expressed. Together with voice, language,

hands and posture of the body, FE form a fundamental communication system between humans in

social contexts.

Facial expressions were introduced as a research field by Charles Darwin in his book "The Ex-

pression of the Emotions in Man and Animals" [1]. Darwin questioned whether facial expressions

had some instrumental purpose in the evolutionary history. For example, lifting the eyebrows

might have helped our ancestors respond to unexpected environmental events by widening the

visual field and therefore enabling them to see more. Even though their instrumental function

may have been lost, the facial expression remains in humans as part of our biological endowment

and therefore we still lift our eyebrows when something surprising happens in the environment

whether seeing more is of any value or not. Since then, FEs were established as one of the most

important features of human emotion recognition.

1.2 Motivation

Expression recognition is a task that human beings perform daily and effortlessly, but it is not

yet easily performed by computers. By definition Facial Expression Recognition (FER) involves

identification of cognitive activity, deformation of facial feature and facial movements. Computa-

tionally speaking, the FER task is done using static images or their sequences. The purpose is to

categorize them into different abstract classes based on the visual facts only.

In the last few years, automated FER has attracted much attention in the research community

due to its wide range of applications. In technology and robotic systems, several developed robots

with social skills like Sony’s AIBO and ATR’s Robovie have been developed [2]. In education,

1

2 Introduction

FER can also play a role detecting students’ frustration by improving e-learning experiences [3].

Game industry is already investing in the expansion of gaming experience by adapting difficulty,

music, characters or mission according to the player’s emotional responses [4, 5].

In the medical field, emotional assessment can be used in multiple conditions. For instance,

pain detection is used for monitoring the patient progress in clinical settings and depression recog-

nition from FEs is a very important application for the analysis of psychological distress [6, 7].

Facial Expression also plays a significant role in several diseases. In autism, for instance, emo-

tions are not expressed the same way and the understanding of how basic emotions work and how

they are conveyed in autism could lead to therapy improvement [8]. Deafness also leads to an

adaptation of communication, being sign language the common mean of communication. In sign

language, FE can play a significant role. Facial and head movements are used in sign languages

at all levels of linguistic structure. At the phonological level, some signs have an obligatory fa-

cial component in their citation form. Facial actions mark relative clauses, content questions and

conditionals, amongst others [9]. Therefore, an integration of automated FER is essential for an

efficient automated sign language recognition system.

Several automated FER methods have been proposed and demonstrated remarkable perfor-

mances in highly controlled environments (i.e., high-resolution frontal faces with uniform back-

grounds). However, the automatic FER in real-world scenarios is still a very challenging task.

Those challenges are mainly related to the inter-individual’s facial expressiveness variability and

to different acquisition conditions.

Most of machine learning methods are task-specific methods, in which the representation (fea-

ture) is first extracted and, then, a classifier is learned from it. Deep learning can be seen as a part

of machine learning methods that are able to jointly learn the classification and representation

of data. Deep learning approaches learn data representations with multiple levels of abstraction,

leading to features that traditional methods could not extract. The recent success of deep networks

relies on the current availability of large labeled datasets and advances in GPU technology. In

some computer vision tasks the availability of diverse and large datasets is scarce. To overcome

this, deep training strategies are needed. State of the art methods use some strategies such as data

augmentation, dropout and ReLU and reach the optimal results in most object recognition tasks.

FER is one of the cases where only small datasets are available. Current state of the art strate-

gies for deep neural networks achieved satisfactory results in controlled environments but when

applied to expressions in the wild the performance decays abruptly. Therefore, novel strategies for

regularization in deep neural networks are needed in order to develop a robust system that is able

to recognize emotions in natural environments.

1.3 Goals

The purpose of this dissertation is the development of fundamental work on FER to propose a novel

method mainly based on deep neural networks. In particular, the main goal of this dissertation is

the proposal and development of a deep learning model for FER that explicitly models the facial

1.4 Contributions 3

key-points information along with the expression classification. The underlying idea is to increase

the discriminative ability of the learned features by regularizing the entire learning process and,

hence, improve the generalization capability of deep models to the small datasets of FER.

1.4 Contributions

In this dissertation, our efforts were targeted towards the development and analysis of different

deep learning architectures and training strategies to deal with the problem of training deep models

in small datasets. In this regard, the main contributions of this work can be summarized as follows:

• The implementation of several baseline and state of the art methods for FER, in order to

provide a fair comparison and evaluation of different approaches. The implemented methods

include traditional methods based on hand-crafted features and state of the art methods based

on deep neural networks, such as transfer learning approaches and methods that intend to

integrate physiological knowledge on FER.

• Development of a novel deep neural network that, by integrating different sources of prior

knowledge, achieves state-of-the-art performances. The proposed method integrates knowl-

edge transfered from pre-trained networks jointly with physiological knowledge on facial

expression.

1.5 Dissertation Outline

This dissertation will cover the historic overview on expression recognition followed by the ex-

position of the pipeline for FER in Chapter 2. Chapter 3 looks over the state-of-the-art on FER

in which the relevant works proposed for each step of the FER pipeline are presented (ranging

from face detection to expression recognition). Chapter 4 details the methodology followed for

the implementation of base-line and state-of-the art methods. Chapter 5 focuses on the proposed

method. A description of the databases and the implementation details are then followed by the

results and a discussion on the findings. As last chapter of the dissertation, Chapter 6 draws the

main conclusions on the performed study and discusses future work.

4 Introduction

Chapter 2

Background

Human faces generally reflect the inner feelings/emotions and hence facial expressions are sus-

ceptible to changes in the environment. Expression recognition assists in interpreting the states of

mind and distinguishes between various facial gestures. In fact, FE and FER are interdisciplinary

domains standing at the crossing of behavioral science, neurology, and artificial intelligence.

For instance, in early psychology, Mehrabian [10] has found that only 7% of the whole in-

formation that an human expresses is conveyed through language, 38% through speech, and 55%

through facial expression. FER aims to develop an automatic, efficient and accurate system to dis-

tinguish facial expressions of human beings, so that human emotions can be understood through

facial expression, such as happiness, sadness, anger, fear, surprise, disgust, etc. The develop-

ments in FER hold potential for computer vision applications, such as natural human computer

interaction (HCI), human emotion analysis and interactive video.

Section 2.1 starts with a historic overview on facial expressions followed by how human emo-

tion can be described. The default pipeline of FER is detailed after, beginning with pre-processing

in section 2.2. A technical explanation on how the main feature descriptors work is then presented

in section 2.3. These descriptors are then fed into a classifier for learning purposes: section 2.4

covers how the learning is processed. The Chapter ends with an overview on a model selection

strategy in section 2.5.

2.1 Facial Expressions

Duchenne de Boulogne believed that the human face worked as a map whose features could be

codified into universal taxonomies of mental states. This lead him to conduct one of the first

studies on how FEs are produced by electrically stimulating facial muscles (Figure 2.1)[11]. At

the same time, Charles Darwin also studied FEs and hypothesized that they must have had some

instrumental purpose in the evolutionary history. For instance, constricting the nostrils in disgust

served to reduce inhalation of noxious or harmful substances [12].

5

6 Background

Following these works, Paul Ekman claimed that there is a set of facial expressions that are

innate, and they mean that the person making that face is experiencing an emotion [13], defend-

ing the universality of facial expression. Further studies support that there is an high degree of

consistency in the facial musculature among peoples of the world. The muscles necessary to ex-

press primary emotions are found universally and homologous muscles have been documented in

non-human primates [14] [15].

Figure 2.1: Study of FEs by electrically stimulate facial muscles [11].

Physiological specificity is also documented. Heart-rate and skin temperature vary with basic

emotions. For instance, in anger, blood flow of the hands increases to prepare to a fight. Left-

frontal asymmetry is greater during enjoyment while right frontal asymmetry is greater during

disgust. These evidences support the argument that emotion expressions reliably signal action

tendencies [16] [17].

Facial expression signals emotion, communicative intent, individual differences in personal-

ity, psychiatric and medical status and helps to regulate social interaction. With the advent of

automated methods of FER, new discoveries and improvements became possible.

The description of human expressions and emotions can be divided in two main categories:

categorical and dimensional description.

It is common to classify emotions into distinct classes essentially due to Darwin and Ekman

studies [13]. Affect recognition systems aim at recognizing the appearance of facial actions or the

emotions conveyed by the actions. The former set of systems usually relies on the Facial Action

Coding System (FACS) [18]. FACS consists of facial Action Units (AUs), which are codes that

describe facial configurations. Some examples of AUs are presented in Figure 2.2.

The temporal evolution of an expression is typically modeled with four temporal segments:

neutral, onset, apex and offset. Neutral is the expressionless phase with no signs of muscular

activity. Onset corresponds to the period during which muscular contraction begins and increases

in intensity. Apex is a plateau where the intensity usually reaches a stable level; whereas offset

is the phase of muscular action relaxation. [18]. Usually, the order of these phases is: neutral-

onset-apex-offset. The analysis and comprehension of AUs and temporal segments are studied in

2.1 Facial Expressions 7

Figure 2.2: Examples of AUs on the FACS [18].

psychology and their recognition enables the analysis of sophisticated emotional states such as

pain and helps distinguishing between genuine and posed behavior [19].

Systems and models that recognize emotions can recognize basic or non-basic emotions. Ba-

sic emotions come from the affect model developed by Paul Ekman that describe six basic and

universal emotions: happiness, sadness, surprise, fear, anger and disgust (see Figure 2.3).

Figure 2.3: Illustration of the neutral expression and the six basic emotions. The images areextracted from the JAFFE database [20].

Basic emotions are believed to be limited in their ability to represent the broad range of every-

day emotions [19]. More recently, researchers considered non-basic emotion recognition using a

variety of alternatives for modeling non-basic emotions. One approach is to define an extra set of

emotion classes, for instance, relief or contempt [21]. In fact, Cohn-Kanade database, a popular

database for FE, integrates contempt as emotion label.

Another approach, which represents a wider range of emotions, is the continuous modeling

using affect dimensions [22]. These dimensions include how pleasant or unpleasant a feeling is,

how likely is the person to take action under the emotional state and the sense of control over the

emotion. Due to the higher dimensionality of such descriptions they can potentially describe more

complex and subtle emotions. Nonetheless, the richness of the space is more difficult to use for

8 Background

automatic recognition systems because it can be challenging to link such described emotion to a

FE [12].

For automatic classification systems is common to simplify the problem and adopt a cate-gorical description of affect by dividing the space in a limited set of categories defined by Paul

Ekman. This will be the approach followed in this dissertation.

2.2 Pre-processing

The default pipeline of a FER system includes as first step face detection and alignment. This is

considered a pre-processing of the original image and will be covered in this section. Face de-

tection and posterior alignment can be achieved using classical approaches, such as Viola&Jones

algorithm and HOG descriptors, or by deep learning approaches.

The Viola&Jones object detection framework [23] was proposed by Paul Viola and Michael

Jones in 2001 as the first framework to give competitive object detection rates. It can be used

for detecting objects in real time, but it is mainly applied for face detection. Besides processing

the images quickly, another advantage of the Viola&Jones algorithm is the low false positive rate.

The main goal is to distinguish faces from non-faces. The main steps of this algorithm can be

summarized as follows:

(1) Haar Feature Selection: Human faces have similar properties (e.g., the eyes region is

darker than the nose bridge regions). These properties can be matched using Haar features, also

known as digital image features based upon Haar basis functions. An Haar-like feature considers

adjacent rectangular regions at a specific location in a detection window, sums up the pixel inten-

sities in each region and calculates the difference between these sums. This difference is then used

to categorize subsections of an image.

(2) Creating an Integral Image: The integral image computes a value at each pixel (x,y)

that is the sum of the pixel values above and to the left of (x,y) inclusive. This image representa-

tion allows computing rectangular features such as Haar-like features, speeding up the extraction

process. As each feature’s rectangular area is always adjacent to at least another rectangle, any

two-rectangle feature can be computed just in six array references.

(3) Adaboost Training: The Adaboost is a classification scheme that works by combining

weak learners into a more accurate ensemble classifier. The training procedure consists of multiple

boosting rounds. During each boosting round, the goal is to find a weak learner that achieves the

lowest weighted training error. Then, the weight of the misclassified training samples are raised.

At the end of the training process, the final classifier is given by a linear combination of all weak

learners. The weight of each learner is directly proportional to its accuracy.

(4) Cascading Classifiers: The attentional cascade starts with simple classifiers that are able

to reject many of the negative (i.e., non-face) sub-windows, while keeping almost all positive (i.e.,

face) sub-windows. That is, a positive response from the first classifier triggers the evaluation of

a second and more complex classifier and so on. A negative outcome at any point leads to the

2.3 Feature Descriptors for Facial Expression Recognition 9

immediate rejection of the sub-window.

Another common method for face detection is the extraction of HOG descriptors to be fed

into a Support Vector Machine (SVM) classifier. The basic idea is that local object appearance and

shape can often be characterized rather well by the distribution of local intensity gradients or edge

directions, even without precise knowledge of the corresponding gradient or edge positions. The

HOG representation has several advantages. It captures the edge or gradient structure that is very

characteristic of local shape, and it does so in a local representation with an easily controllable

degree of invariance to local geometric and photometric transformations: translations or rotations

make little difference if they are much smaller that the local spatial or orientation bin size [24].

In practice, this is implemented by dividing the image frame into cells and, for each cell,

a local 1-D histogram of gradient directions or edge orientations, over the pixels of the cell, is

created. The combined histogram entries form the representation [24]. The feature vector is then

fed into an SVM classifier to find whether there is a face in the image or not. A representation of

this framework can be found in Figure 2.4.

Figure 2.4: A typical framework of an HOG-based face detection method [24].

More recently, deep learning methods have shown efficiency in most computer vision tasks

and hold the state of the art in face detection as well. A detailed description on the fundamentals

of deep learning can be found in section 2.4.2 and the sate of the art in deep learning methods for

face detection in section 3.1.

2.3 Feature Descriptors for Facial Expression Recognition

After face detection, the facial changes caused by facial expressions have to be extracted. This

subsection presents two of the most widely used feature descriptors, namely Local Binary Patterns

(LBP) and Gabor filters.

2.3.1 Local-Binary Patterns (LBP)

Local Binary Patterns (LBP) were first presented in [25] to be used in texture description. The

basic method labels each pixel with decimal values called LBPs or LBP codes, to describe the local

structure around each pixel. As illustrated in Figure 2.5, the value of the center pixel is subtracted

from the 8-neighbor pixels’ values; if the result is negative the binary value is 0, otherwise 1.

The calculation starts from the pixel at the top left corner of the 8-neighborhood and continues in

clockwise direction. After calculating with all neighbors, an eight digit binary value is produced.

When this binary value is converted to decimal, the LBP code of the pixel is generated, and placed

at the coordinates of that pixel in the LBP matrix.

10 Background

Figure 2.5: Example of LBP calculation, extracted from [26].

2.3.2 Gabor Filters

Gabor filter is one of the most popular approaches for texture description. Gabor filter-based

feature extraction consists in the application of a Gabor filter bank to the input image, defined by

its parameters including frequency ( f ), orientations (θ ) and smooth parameters of the Gaussian

envelop (σ ). This makes the approach invariant to illumination, rotation, scale and translation.

Gabor filters are based on the following function [27]:

Ψ(u,v) = e− π2

f 2 (γ2(u′− f )2+η2v′2))

(2.1)

u′ = ucosθ + vsinθ (2.2)

v′ =−usinθ + vosθ (2.3)

In the frequency domain (Eq. 2.1, 2.2, 2.3)) the function is a single real-valued Gaussian

centered at f . γ is the sharpness (bandwidth) along the Gaussian major axis and η is the sharpness

along the minor axis (perpendicular to the wave). In the given form, the aspect ratio of the Gaussian

is η

γ. Gabor features, also referred as Gabor jet, Gabor bank or multi-resolution Gabor features,

are constructed from responses of Gabor filters by using multiple filters with several frequencies

and orientations. Scales of a filter bank are selected from exponential spacing and orientations

from linear spacing. These filters are then convolved with the image, in order to obtain different

representations of the image to be used as descriptors.

2.4 Learning and classification

Given a collection of extracted features, it is necessary to build a model capable of correctly sep-

arate and classify the expressions. Traditional FER systems use a three stage training procedure:

(i) feature extraction/learning, (ii) feature selection, and (iii) classifier construction. On the other

hand, FER systems based on deep learning techniques comprise these three steps into one single

2.4 Learning and classification 11

step. This section presents an overview of one of the most widely used traditional classifiers, the

Support Vector Machines (SVMs) as well as one of the most relevant deep learning approaches,

the Convolutional Neural Networks (CNNs).

2.4.1 Support Vector Machine (SVM)

Support Vector Machine [28] performs an implicit mapping of data into a higher (potentially

infinite) dimensional feature space, and then finds a linear separating hyperplane with the maximal

margin to separate data in this higher dimensional space. Given a training set of labeled examples

a new test example x is classified by the following function:

f (x) = sgn(l

∑i=1

αiyiK(xi,x)+b), (2.4)

where αi are Lagrange multipliers of a dual optimization problem that describe the separating

hyperplane, K is a kernel function, and b is the threshold parameter of the hyperplane. The training

sample xi with αi > 0 is called support vector, and SVM finds the hyperplane that maximizes the

distance between the support vectors and the hyperplane. SVM allows domain-specific selection

of the kernel function. Though new kernels are being proposed, the most frequently used kernel

functions are the linear, polynomial, and Radial Basis Function (RBF) kernels. SVM makes binary

decisions, so the multi-class classification is accomplished by using, for instance, the one-against-

rest technique, which trains binary classifiers to discriminate one expression from all others, and

outputs the class with the largest output of binary classification. The selection of the SVM hyper-

parameters can be optimized through a k-fold cross-validation scheme. The parameter setting

producing the best cross-validation accuracy is picked [29].

In general, SVMs exhibit good classification accuracy even when only a modest amount of

training data is available, making them particularly suitable to expression recognition [30]. Figure

2.6 represents a possible pipeline for FER using feature descriptors along with SVM classifier.

Figure 2.6: Example of an SVM classifier applied to features extracted from faces [30].

12 Background

2.4.2 Deep Convolutional Neural Networks (DCNNs)

Recently, deep learning methods have shown to be efficient on many computer vision tasks like

pattern recognition problems, character recognition, object recognition or autonomous robot driv-

ing for instance. Deep learning models are composed of consecutive processing layers that learn

representations of data with multiple levels of abstraction, capturing features that traditional meth-

ods could not compute. One of the factors that allows the computation of complex features is the

back-propagation algorithm that indicates how a machine should change its internal parameters to

compute new representations of input data [31].

The emergent success of CNNs on recognition and segmentation tasks can be explained by 3

factors: (1) The availability of large labeled training sets; (2) The recent advances in GPU tech-

nology, which allows training large CNNs in a reasonable computation time; (3) The introduc-

tion of effective regularization strategies that greatly improve the model generalization capacity.

However, in the FER context, the availability of large training sets is scarce, arising the need of

strategies to improve the models.

CNNs learn to extract the features directly from the training database using iterative algorithms

like gradient descent. An ordinary CNN learns its weights using the back-propagation algorithm.

A CNN has two main components, namely, local receptive fields and shared weights. In local

receptive fields, each neuron is connected to a local group of the input space. The size of this

group of the input space is equal to the filter size where the input space can be either pixels

from the input image or features from the previews layer. In CNN the same weights and bias

are used over all local receptive fields, which significantly decreases the number of parameters of

the model. However, the increased complexity and depth of a typical CNN architecture network

are prone to overfit [32]. CNNs can have multiple architectures but the standard is having series

of convolutional layers that produce a certain amount of feature maps given by the number of

filters defined for the convolutions, leading to different image representations. This is followed

by pooling layers. Max pooling, the common pooling layer applies a max filter to (usually) non-

overlapping subregions of the initial representation, reducing the dimensionality of the current

representation. Then, these representations are fed into fully-connected layers that can be seen as

a multilayer perceptron that aims to map the activation volume, from the combination of previous

different layers, into a class probability distribution. The network is followed by an affine layer

that computes the scores [33] [34].

Figure 2.7 represents a possible network for facial expression recognition applying regulariza-

tion methods.

It is useful to understand the features that are being extracted by the network in order to

understand how the training and classification is performed. Figure 2.8 shows the visualization

of the activation maps of different layers. It can be seen that, the deeper the layers are, the more

sparse and localized the activation maps become [34].

2.4 Learning and classification 13

Figure 2.7: The architecture of the deep network proposed in [34] for FER.

2.4.2.1 Activation Functions

An activation function is a non-linear transformation that defines the output of a specific node

given a set of inputs. Activation functions decide whether a neuron should be activated so they

assume an important role in the network design. The commonly used activation functions are

presented as follows:

• Linear Activation: The activation is proportional to the input. The input x, will be trans-

formed to ax. This can be applied to various neurons and multiple neurons can be activated

at the same time. The issue with a linear activation function is that the whole network is

equivalent to a single layer with linear activation.

• Sigmoid function: In general, a sigmoid function is real-valued, monotonic, and differen-

tiable having a non-negative first derivative which is bell shaped. The function ranges from

0 to 1 and has an S shape. This means that small changes in x bring large changes in the out-

put, Y . This is desired when performing a classification task since it pushes the predictions

to extreme values. The sigmoid function can be written as follows:

Y =1

1+ e−x (2.5)

• ReLU: ReLU is the most widely used activation function since it is known for having better

fitting abilities than the sigmoid function [35]. ReLU function is non linear so it back-

propagates the error. ReLU can be written as:

Y = max(0,x) (2.6)

It gives an output equal to x if x is positive and 0 otherwise. Only specific neurons are

activated, making the network sparse and efficient for computation.

• Softmax: For classification problems, commonly, the output consists in a multi-class prob-

lem. Sigmoid function can only handle two classes so softmax is used for outputting the

probabilities of each class. The softmax function converts the outputs of each unit to values

14 Background

Figure 2.8: Visualization of the activation maps for different layers from [34].

2.4 Learning and classification 15

between 0 and 1, just like a sigmoid function, but it also divides each output such that the

total sum of the outputs is equal to 1. The output of the softmax function is equivalent to a

categorical probability distribution. Mathematically, the softmax function is shown below:

σ(z) j =ez j

∑Kk=1 ezk

, (2.7)

where z is a vector of the inputs to the output layer and j indexes the output units, so j = 1,

2, ..., K.

2.4.2.2 Regularization

As mentioned previously, CNN can easily overfit. To avoid overfitting, regularization methods can

be applied. Regularization techniques can be seen as an imposition of certain prior distributions

on model parameters.

Batch-Normalization is a method known for reducing internal covariate shift in neural

networks [36].

To increase the stability of a neural network, batch normalization normalizes the output of a

previous activation layer by subtracting the batch mean and dividing by the batch standard devia-

tion. In figure 2.9 a representation of the Batch-Normalization transform is presented.

Figure 2.9: Batch-Normalization applied to the activations x over a mini-batch. Extracted from[36]

In the notation y = BNγ,β (x), the parameters γ and β have to be initialized and will be learned.

Dropout is widely used to train deep neural networks. Unlike other regularization techniques

that modify the cost function, dropout modifies the architecture of the model since it forces the

network to drop different neurons across iterations. Dropout can be used between convolutional

layers or only in the classification module. Its contribution to the activation of downstream neu-

rons is temporally removed on the forward pass and any weight updates are not applied to the

neuron on the backward pass [37].

16 Background

Dropout reduces complex co-adaptations of neurons. Since a neuron cannot rely on the presence

Figure 2.10: Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right:An example of a thinned net produced by applying dropout to the network on the left. Extractedfrom [37].

of particular other neurons, it is forced to learn more robust features that are useful in conjunction

with many different random subsets of the other neurons (see figure 2.10).

Data-augmentation is one of the most common most common method to reduce overfit-

ting on image data by artificially enlarging the dataset using label-preserving transformations.

The main techniques are classified as data warping, which is an approach which seeks to directly

augment the input data to the model in the data space [38]. The generic practice is to perform ge-

ometric and color augmentation. For each input image, it is generated a new image that is shifted,

zoomed in/out, rotated, flipped, distorted, or shaded with a hue. Both image and duplicate are fed

into the neural net.

L1 and L2 regularization are traditional regularization strategies that consist in adding a

penalty term to the objective function and control the model complexity using that penalty term.

L1 and L2 are common regularization techniques not only in deep neural networks but in general

machine learning algorithms. The first L1 regularization uses a penalty term which encourages the

sum of the absolute values of the parameters to be minimum. It has frequently been observed that

L1 regularization in many models forces parameters to equal zero, so that the parameter vector is

sparse. This makes it a natural candidate for feature selection.

L2 regularization can be seen as adaptive minimization of the square error with a penalization

term that penalize in such a way that less influential features, features that cause very small in-

fluence on dependent variable, undergo more penalization. A high penalization term can lead do

underfitting. Therefore, this term needs an optimal value to prevent overfitting and underfitting

[39].

Early-Stopping is a strategy that by monitoring the validation set metrics decides when the

model stops training. An indicator that the network is overfitting to the training data is when the

2.5 Model Selection 17

loss of the validation set is not improving for a certain number of epochs. To avoid this, early-

stopping is implemented: the network will stop the training when it reaches a certain number of

epochs without improving the validation set.

2.5 Model Selection

Model selection is the task of selecting a model from a set of candidate models. Commonly, model

selection strategies are based in validation. Validation consists in partitioning a set of the training

data and use this sub-set to validate the predictions from the training. It intends to assess how the

results of a model will generalize to an independent data set as it is illustrated in Figure 2.11.

Figure 2.11: Common pipeline for model selection. Extracted from [40].

Different pipelines can be designed using validation for model selection. Commonly, a set of

data is split in three sub-sets: train set, where the model will be trained, validation set in which in

the model will be validated and test set where the model performance is assessed. Different models

or models with different hyper-parameters are validated in the validation set. The hyper-parameter

optimization is typically conveyed by a grid-search approach. Grid-search is exhaustive searching

through a manually specified subset of the hyper-parameter space. A step-by-step description for

grid-search pipeline as model selection is presented as follows:

1. The data set is split randomly, with user-independence between the sets, P times in three

sub-sets: train-set, validation-set and test-set.

2. Sets of hyper-parameters to be optimized are defined. Being A and B two hyper-parameters

sets to be optimized, each value of set A is defined as ai ( i = 1, ..., I values) and each value

of hyper-parameter B as bi ( j = 1, ...,J values).

3. The Cartesian product of the two sets, A and B, is performed, returning a set of pairs, (ai,b j)

in which a model will be trained. In the end, I× J models are trained.

4. Each model is evaluated on the validation set, returning a specific metric value.

5. The models are ordered by their performance on the validation test. The best set of hyper-

parameters that produces the best model is selected.

18 Background

6. The model with the selected hyper-parameters is evaluated on the test set of split p (with

p = 1, ...,P splits).

7. The performance of the algorithm corresponds to the average value of the performance of

the selected model on the P splits.

Chapter 3

State-of-the-Art

Automatic Facial Expression Recognition (FER) can be summarized in four steps: Face Detection,

Face Registration, Feature Extraction and Expression Recognition. These four steps encompass

methods and techniques that will be covered in the next sections.

3.1 Face detection (from Viola & Jones to DCNNs)

Face detection is usually the first step for all automated facial analysis systems, for instance face

modeling, face relighting, expression recognition or face authentication systems. Given an image,

the goal is to detect faces in the image and return their location in order to process these images.

There are some factors that may compromise and determine the success of face detection. Beyond

image conditions and acquisition protocols, different camera-face poses can lead to different views

from a face. Furthermore, structural components such as beards or glasses introduce variability

in the detection, leading to occlusions [41].

For RGB images, the algorithm of Viola & Jones [23] is still one of the most used face detec-

tion methods. It was proposed in 2001 as the first object detection framework to provide compet-

itive face detection in real time. Since it is essentially a 2D face detector, it can only generalize

within the pose limits of its training set and large occlusions will impair its accuracy.

Some methods overcome these weaknesses by building different detectors for different views

of the face [42], by introducing robustness to luminance variation [43], or by improving the weak

classifiers. Bo Wu et al. [44] proposed the utilization of a single Haar-like feature, in order to

compute an equally bin histogram that is then used in a RealBoost learning algorithm. In [45] a

new weak classifier, the Bayesian stump, is proposed. Features as LBP can also be used to improve

invariance to image conditions. Hongliang Jin et al. [46] apply LBP on a Bayesian framework

and Zhang et. al [47] combines LBP with a boosting algorithm that uses multi-branch regression

tree as its weak classifier. Another feature set can be found in [24] that applies SVM over grids of

histograms of oriented gradient (HOG) descriptors.

19

20 State-of-the-Art

Convolutional Neural Networks (CNNs) have been widely used in image segmentation or clas-

sification tasks as well as for face localization. Convolutional networks are specifically designed

to learn invariant representations of images as they can easily learn the type of shift-invariant local

features that are relevant to face detection and pose estimation. Therefore, CNN-based face de-

tectors outperform the traditional approaches, specially in unconstrained scenarios, in which there

is a large variability of face-poses, viewing angles, occlusions and illumination conditions. Some

examples of face detection in unconstrained scenarios can be found in Figure 3.1. They can also

be replicated in large images at a small computational cost when compared with the traditional

methods mentioned before [48].

Figure 3.1: Multiple face detection in uncontrolled scenarios using the CNN-based method pro-posed in [49].

In [48] a CNN detects and estimates pose by minimizing an energy function with respect to the

face/non-face binary variable and the continuous pose parameters. This way, the trained algorithm

is capable of handle a wide range of poses without retraining, outperforming traditional methods.

The work of Haoxiang Li et al. [49] also takes advantage of the CNN discriminative capacity

proposing a cascade of CNNs. The cascade operates at different resolutions in order to quickly

discard obvious non-faces and evaluate carefully the small number of strong candidates. Besides

achieving state of the art performances the algorithm is capable of fast face detection.

Another state of the art approach is the work of Kaipeng Zhang et al. [50] in which a deep

cascaded multi-task framework (MTCNN) is designed to detect face and facial landmarks. The

method consist of three stages: in the first stage, it produces candidate windows. Then, it refines

the windows by rejecting a large number of non-faces windows through a more complex CNN.

Finally, it uses a more powerful CNN to refine the result again and output five facial landmarks

positions. The schema of the MTCNN is illustrated in Figure 3.2. In particular, the input image is

resized to different scales, being the input to a three-stage cascaded framework:

Stage 1: First, a fully convolutional network, called Proposal Network (P-Net), is imple-

mented to obtain the candidate facial windows and their bounding box regression vectors. Then,

candidates are calibrated based on the estimated bounding box regression vectors. Non-maximum

suppression (NMS) is performed to merge highly overlapped candidates.

Stage 2: All candidates are fed to another CNN, called Refine Network (R-Net), which further

rejects a large number of false candidates, performs calibration with bounding box regression, and

conducts NMS.

3.2 Face Registration 21

Stage 3: This stage is similar to the second stage, but the purpose of this stage is to identify

face regions with more supervision. In particular, the network will output five facial landmarks’

positions.

Figure 3.2: Architecture of the Multi-Task CNN for face detection [50].

In this network there are three main tasks to be trained: face/non-face classification, bounding

box regression, and facial landmark localization. For the face/non-face classification task, the goal

of training is to minimize the traditional loss function for classification problems, the categorical

cross-entropy. The other two tasks (i.e., the bounding box and landmark localization) are treated

as a regression problem, in which the Euclidean loss has to be minimized [50].

Since this method holds the state of the art for face detection, the MTCNN is used as face

detector in this dissertation.

3.2 Face Registration

Once the face is detected, many FER methods require a face registration step for face alignment.

During the registration, fiducial points (or landmarks) are detected, allowing the alignment of the

face to different poses and deformations. These facial key-points can also be used to compute

localized features. Interest points combined with local descriptors provide reliable and repeatable

measurements from images for a wide range of applications, capturing the essence of a scene

without the need for a semantic-level [51]. Landmark localization is then an essential step to take,

as these fiducial points can be used for face alignment and to compute meaningful features for

FER [12]. Key-points are mainly located around facial components such as eyes, mouth, nose and

chin. These key-points can be computed either using scale invariant feature transform (SIFT) [52]

[53] or through a CNN where facial landmark localization is taken as a regression problem [50].

3.3 Feature Extraction

Facial features can be extracted using different approaches and techniques that will be covered in

this sub-section. Feature extraction approaches can be broadly classified into two main groups:

22 State-of-the-Art

hand-crafted features and learned features, which can be applied locally or to the global im-

age. Concerning the temporal information, algorithms can also be further divided into static or

dynamic.

3.3.1 Traditional (geometric and appearance)

Hand-crafted features can be divided into appearance or geometric. Geometric features describe

faces using distances and shapes of fiducial points (landmarks). Many geometric-based FER meth-

ods recognize expressions by first detecting AUs and then decoding a specific expression from

them. As an example, [54] looks over the recognition of facial actions through landmarks dis-

tance, taking as prior the fact that facial actions involved in spontaneous emotional expressions

are more symmetrical, involving both the left and the right side of the face. Figure 3.3 represents

the recognition of one AU.

Figure 3.3: Detection of AUs based on geometric features used in [54].

Geometric features can provide useful information when tracked on temporal axis. Such ap-

proach can be found in [55], in which a model for dynamic facial expression recognition based

on landmark localization is proposed. Geometric features can also be used to build an active

appearance model (AAM), the generalization of a statistical model of the shape and gray-level ap-

pearance of the object of interest [56]. AAM is often used for deriving representations of faces for

facial action recognition [57] [58]. Local geometric feature extraction approaches aim to describe

deformations on motions or localized regions of the face. It is the example of the work proposed

3.3 Feature Extraction 23

by Stefano Berretti et al. [59] that describes local deformations (given by the key-points) through

SIFT descriptors. Dynamic descriptors for local geometric features are based on landmark dis-

placements coded with motion units [60], [61] or deformation of facial elements as eyes, mouth

or eyebrows [62][63].

From the literature it is clear that geometric features are effective on the description of facial

expressions. However, effective geometric feature extraction highly depends upon the accurate

facial key-points detection and tracking. In addition, geometric features are not able to encode

relevant information caused by skin texture changes and expression wrinkles.

Appearance features are based on image filters applied to the image to extract the appearance

changes on the face. Global appearance methods began to focus on Gabor-wavelet representations

[64] [65]. However, the most popular global appearance features are Gabor filters and LBPs.

In [66] and [67], the input (face images) are convolved with a bank of Gabor filters to extract

multi-scale and multi-orientational coefficients that are invariant to illumination, rotation, scale

and translation. LBP’s are widely used for feature extraction on face expression recognition for

their tolerance against illumination changes and their computational simplicity. Caifeng Shan et

al. [29] implements LBP as feature descriptors, using an AdaBoost to learn the most discrim-

inative LBP features and an SVM as discriminator. In general, such methods have limitations

on generalization to other datasets. Other global appearance methods are based on the Bag of

Words (BoW) approach. Karan Sikka et al. [68] explores BoW for an appearance-based dis-

criminative feature extraction, combining highly discriminative Multi-Scale Dense SIFT (MSDF)

features with spatial pyramid matching (SPM).

Dynamic global appearance features are an extension to the temporal domain. In [69] local

binary pattern histograms from three orthogonal planes (LBP-TOP) are proposed. Bo Sun et al.

[70] uses a combination of LBP-TOP and local phase quantization from three orthogonal planes

(LPQ-TOP), a descriptor similar to LBP-TOP but more robust to blur. Often a combination of

different descriptors is used to form hybrid models. It is the example of the work proposed in [71],

where LBQ-TOP is used along with local Gabor binary patterns from three orthogonal planes

(LGBP-TOP). Figure 3.4 shows the most commonly used appearance-based feature extraction

approaches.

Local appearance features require previous knowledge of regions of interest such as mouth,

eyes or eyebrows. Consequently, its performance is dependent on the localization and tracking

of these regions. In [72], the appearance of gray-scale frames is described by spreading an array

of cells across the mouth and extracting the mean intensity from each cell. The features are then

modeled using an SVM. A Gray Level Co-occurrence Matrix is used in [73] as feature descriptor

of specific regions of interest.

There is not a straight answer on which feature extraction method is better, it depends on the

problem and/or AUs to detect. Therefore, it is common to combine both types of appearance-based

methods, as is usually associated with an increase of performance [74] [75]. The performance of

FER methods based on these traditional descriptors has remarkable results but mainly for con-

trolled scenarios. The performance using these features decreases dramatically in unconstrained

24 State-of-the-Art

(1)

(2)

(3)

(4)

(5)

(6)

Figure 3.4: Spatial representations of main approaches for feature extraction: (1) Facial Points, (2)LBP histograms, (3) LBQ histograms, (4) Gabor representation, (5) SIFT descriptors, (6) denseBag of Words [19].

environments where face images cover complex and large intra-personal variations such as pose,

illumination, expression and occlusion. The challenge is then to find an ideal facial representation

which is robust for facial expression recognition in unconstrained environments. As described

in the following subsection, the recent success of deep leaning approaches, specially those using

CNNs, has been extended to the FER problem.

3.3.2 Deep Convolutional Neural Networks

The state of the art for FER is mostly composed by deep learning based methods. For instance,

the following works [76][77][78] are some implementations of CNNs for expression recognition,

holding a state of the art performance on public dataset of uncontrolled environments (SFEW).

Zhiding Yu et al. [77] uses ensembles of CNNs, a commonly used strategy to reduce the model’s

variance and, hence, improve the model performance. Ensembles of networks can be seen as

multiple networks initialized with different weights leading to different responses. The output in

this case is averaged but it can be merged by other means, as majority voting.

Another commonly used technique is transfer learning: a CNN is previously trained, usually

in some related dataset and then is fine tuned for the target dataset. Mao Xu et al. [79] propose a

facial expression recognition model based on transfer features from CNNs for face identification.

The facial expression recognition model transfers high-level features from face identification to

3.3 Feature Extraction 25

classify them into one of seven discrete emotions with the multi-class SVM classifier. In [80],

another transfer learning method is proposed. It uses as pre-trained networks the AlexNet [35], a

well-known network trained on large-scale database. Since the target dataset is 2D gray-scale, the

authors perform image transformations to convert 2D gray-scale images to 3D values.

Another strategy to work around limited datasets is the use of artificial data. Generative Ad-

versarial Networks (GANs) [81] generate artificial examples that must be identified by a discrim-

inator. GANs have already been used in face recognition. The method proposed in [82] generates

meaningful artifacts on state of the art face recognition algorithms, for example glasses, leading

to more data and preventing real noise to influence the network final prediction. Such approaches

can possibly be extended to FER as well.

To solve the problem of model generalization on FER methods, curriculum learning [83] can

be a viable option. Initially, the weights favor “easier” examples, or examples illustrating the sim-

plest concepts, that can be learned most easily. The next training criterion involves a slight change

in the weighting of examples that increases the probability of sampling slightly more difficult ex-

amples. At the end of the sequence, the re-weighting of the examples is uniform and trained on

the target training set. In [84] a meta-dataset is created with values of complexity of the task of

emotion recognition. This meta-dataset is generated through a complexity function that measures

the complexity of image samples and ranks the original training dataset. The dataset is split in

different batches based on complexity rank. The deep network is then trained with easier batches

and progressively fine tuned with the harder ones till the network is trained with all the data (see

Figure 3.5).

Figure 3.5: Framework of curriculum learning method proposed in [84]. Faces are extracted andaligned and then ranked into different subsets based on curriculum. Succession of deep convolu-tion networks are applied, and then the weights of the fully connected layers are fine-tuned.

Regularization as mentioned before is also one of the factors that contribute to better deep

models. More recently, new methods for regularization were introduced, such as dropout [37],

drop-connect [85], max pooling dropout [86], stochastic pooling [87] and, to some degree, batch

normalization [36]. H. Ding et al. [88] proposes a probabilistic distribution function to model the

high level neuron response based on an already fine tuned face network, leading to regularization

on feature level, achieving state of the art performance for SFEW database.

Besides regularization and transfer learning approaches, the inclusion of domain knowledge on

the domain of the problem is also a common approach since it discriminates the feature space. The

work of Liu et al. [78] explores the psychological theory that FEs can be decomposed into multiple

26 State-of-the-Art

action units and use the domain knowledge to build a oriented network for FER. The network can

be divided in three stages where, firstly, it is built a convolutional layer and a max-pooling layer to

learn the Micro-Action-Pattern (MAP) representation, extracting domain knowledge from the data,

which can explicitly depict local appearance variations caused by facial expressions. Then, feature

grouping is applied to simulate larger receptive fields by combining correlated MAPs adaptively,

aiming to generate more abstract mid-level semantics. As last stage, a multi-layer learning process

is employed in each receptive field respectively to construct group-wise sub-networks. Likewise,

the method proposed by Ferreira et al. [89] intends to build a network inspired on the physiological

support that FEs are the result of the motions of facial muscles. The proposed network is an end-

to-end deep neural network along with a well-designed loss function that forces the model to

learn expression-specific features. The model can be divided in three main components where the

representation component is a regular encoding that computes a feature space. This feature space

is then filtered with a relevance map of facial expressions computed by the facial-part component.

The new feature space is fed to the classification component and classified in discrete classes.

With this approach, the network increases the ability to compute discriminative features for FER.

3.4 Expression Recognition

Within a categorical classification approach, there are several works that tackle the categorical sys-

tem in different ways: most works use feature descriptors followed by a classifier to categorically

classify expressions/emotions. Principal Component Analysis (PCA) can be performed before

classification in order to reduce feature dimensionality. This classification can be achieved by us-

ing SVMs [90] [59], Random Forests [91] or k-nearest neighbors (kNN) [67]. More recently, deep

networks are being used and, as mentioned in the previous sub-section, perform jointly feature

extraction and recognition but it is possible to stop training and output the extracted features from

the network and then proceed to classification and recognition [76][77][78].

3.5 Summary

Facial expression recognition has a standard pipeline, starting on face detection/alignment, feature

extraction and expression recognition/classification. Feature extraction plays a crucial role on the

performance of a FER system . It can be performed using traditional methods that either take

in account geometric or appearance features or can be more complex, using convolutional neural

networks as a feature descriptor. CNNs learn representations of data with levels of abstraction

that traditional methods cannot, leading to new features. However, CNNs performance depend on

the dataset size and GPU technology that determine its velocity. For FER, the datasets available

are scarce, arising the need to implement methods that regularize, augment and generalize the

datasets available in order to improve their performance. Novel methods on FER intend to im-

prove generalization and scarcity of datasets by employing prior knowledge to the networks. Prior

knowledge can consist in features transfered from other networks and domains but it can also be

3.5 Summary 27

domain knowledge that improve the discrimination in the feature space. Either way, the works

that include prior knowledge indicate that this approach can achieve state of the art results and can

help the generalization and the use of small datasets.

28 State-of-the-Art

Chapter 4

Implemented Reference Methodologies

Several approaches, from traditional methods to state of the art methods, were implemented. The

methodology followed for each approach will be covered in this chapter.

The implemented traditional approaches include hand-crafted based methods (geometric and

apperance) as well as a conventional CNN trained from the scratch. These methods were imple-

mented as baselines for the proposed FER. In addition, several state-of-the-art methods were also

implemented (i.e., transfer learning and physiological inspired networks) to serve as a starting

point for the proposed method.

Figure 4.1: Illustration of the pre-processing where (a) and (b) are instances from an unconstraineddataset (SFEW [92]) and (c) and (d) are instances from a controlled public dataset (CK+ [93]). Theoriginal images are fed to the face-detector (a MTCNN framework [50]) and the faces detectedwill be used as input to the implemented methods.

As a pre-processing step, all methods are preceded by face-detection and alignment. Then, the

images are normalized, cropped and resized. To jointly perform face-detection and alignment, it

29

30 Implemented Reference Methodologies

is used the MTCNN as face-detector [50]. Some examples of pre-processed images are presented

in Figure 4.1.

4.1 Hand-crafted based approaches

Concerning hand-crafted methods, several appearance-based methods as well as a geometric-

based approach were implemented. The geometric approach is based on features computed from

facial key-points (see Figure 4.2), such as:

1. Distances x and y of each key-point to the center point;

2. Euclidean distance of each key-point to the center point;

3. Relative angle of each key-point to the center point corrected by nose angle offset.

These features are then concatenated to form geometric feature descriptor. Finally, this feature

descriptor is fed into a multi-class SVM for expression classification.

Figure 4.2: Illustration of the implemented geometric feature computation.

The implemented appearance-based FER methods are based on two commonly used tech-

niques for texture classification, namely Gabor filter banks and LBP. Regarding the Gabor filters

approach, a bank of Gabor filters with different orientations, frequencies and standard deviations

was first created. Afterwards, the input images are convolved (or filtered) with the different Ga-

bor filter kernels, resulting in several image representations of the original image. The mean and

4.2 Conventional CNN 31

variance of the filtered images (image representation) are then used as descriptors for classifica-

tion. In particular, different Gabor-descriptors were extracted, according to the degree of local

information:

• Gabor-global: This feature descriptor consists in the concatenation of global mean and

variance values of each Gabor representation.

• Gabor-local: The Gabor representations are divided into a grid of cells. The mean and

variance of each cell are computed and, then, concatenated to form the feature vector.

• Gabor-kpts: It requires the information of the facial key-points coordinates. In particular,

the mean and variance of the Gabor representations are computed locally in a neighborhood

of each facial key-point.

Then, these feature descriptors are fed into an SVM for expression classification.

Regarding the implemented LBP-based approach, the LBP representation of the input image

is first computed and then this LBP representation is used to build an histogram of the LBP pat-

terns. For an extra level of rotation and luminance invariance, only the uniform LBP patterns [94]

were extracted. Similarly to the Gabor-based approach, different LBP feature descriptors were

computed:

• LBP-global: This feature feature vector consists in the global histograms of the LBP pat-

terns.

• LBP-local: The LBP representations are divided into a grid of cells. Then, the histograms

of the LBP representations of each cell are computed and, then, concatenated to form the

feature vector.

• LBP-kpts: The histograms of the LBP representations of the region around each facial

key-point are concatenated to form the feature vector.

For classification, these LBP-based features descriptors are also used to train a multi-class SVM.

Moreover, different combinations of these methods were also performed as well. That is,

LBPs were applied to Gabor representations and geometric features were concatenated with LBPs

features.

4.2 Conventional CNN

Hand-crafted methods have been widely used in image recognition problems, although, broad

representations of the image could be extracted using deep neural networks. Convolutional layers

take image or feature maps as the input, and convolve these inputs with a set of filter banks in a

sliding-window manner to output feature maps that represent a spatial arrangement of the facial

image. The weights of convolutional filters within a feature map are shared, and the inputs of the

32 Implemented Reference Methodologies

feature map layer are locally connected. Second, sub-sampling layers lower the spatial resolution

of the representation by averaging or max-pooling the given input feature maps to reduce their

dimensions and thereby ignore variations in small shifts and geometric distortions [95]. A deep

neural network was implemented and trained from scratch. The architecture, regularization and

learning strategies of the implemented CNN from scratch are described in sections 4.2.1, 4.2.2 and

4.2.3, respectively.

4.2.1 Architecture

The architecture of the implemented model is presented in Figure 4.3. The architecture can be

divided into two main parts: representation computation, Xr, and classification,Xc .

Figure 4.3: Architecture of the conventional deep network used.

As the schema suggests, the representation module corresponds to Xr and the classification

module to Xc. Xr is a functional block that takes as inputs the pre-processed images x and com-

putes new representations of the data. It starts from sequences of two consecutive 3x3 convolu-

tional layers, with rectified linear units (ReLU) as non-linearities followed by a 2x2 max-pooling

operation for down sampling. Between the convolutional layers it can be included regularization

layers that will be covered next. The classification module Xc corresponds to the classification

module. It consists of a sequence of fully connected layers where the last layer is a softmax layer

that outputs the probabilities for each class label, y. Between the fully connected layers the inclu-

sion of regularization layers was assessed as presented in section 6.1.

4.2.2 Learning

The model is trained in order to return the predictions y on class labels. The goal of training is to

minimize the following loss function:

Lclassi f ication =−N

∑i=1

yTi log(yi), (4.1)

where yi is a column vector with one-hot encoding of the class label for input i and yi are the soft-

max predictions of the model. The Adaptive Moment Estimation (Adam) is used as optimizer. It

computes the adaptive learning rates for each parameter and stores an exponential decay rate of

past squared gradients and it keeps an exponential decay rate of past gradients. The learning rate

4.2 Conventional CNN 33

(Lr) is optimized by means of a grid-search procedure. The range of values used for the optimizer

is searched by grid-search approach (see Table 6.1).

4.2.3 Regularization

Due to the ability of high representational capacity and high number of parameters estimated

by the deep models, overfitting is a common problem. Regularization techniques penalizes the

weight matrices of the nodes.As described in the following subsection, a wide range of regulariza-

tion strategies were applied to the implemented CNN. These regularization techniques were also

applied to the remaining networks that will be presented in this dissertation.

4.2.3.1 Data-Augmentation

The simplest way to reduce overfitting is to increase the size of the training data, however, in

some domains is not always possible. Therefore data augmentation is needed. Data augmentation

consists in the synthesis of an artificially number of training samples by different image transfor-

mations and noise addition. In here, a randomized data augmentation scheme based on geometric

transformations is applied during the training step. The purpose of data-augmentation is to in-

crease the robustness of the model by training on wider range of face positions, poses and viewing

angles. When performing data-augmentation, the transformations applied to the image have to

be studied to not corrupt the correspondent label. For instance, vertical flips are not performed

since some images lose their assigned label and would corrupt the classification system. The data

augmentation process is applied in an online-fashion, within every iteration, to all the images of

each mini-batch. The next equation represent the geometric transformations used to augment the

training data: [x′

y′

]=

[s 0

0 s

][cos(θ) −sin(θ)

sin(θ) cos(θ)

][x− t1

y− t2

][1− p p

p 1− p

], (4.2)

where θ is the rotation angle, t1 and t2 define translation parameters, s defines scale factors and p

a binary variable for horizontal flip. Pixels mapped outside the original image are assigned to the

same value of the closest existing pixel.

The parameters for each transformation are presented in section 6.1 and the range for each

transformation is chosen in order to assure that the image is never corrupted by abrupt transforma-

tions. Some instances of the augmented data are presented in figure 4.4.

4.2.3.2 Dropout

Dropout is commonly used in the fully connected layers and not on the convolutional layers since

it affects the number of parameters of the network and this number increases in the fully connected

34 Implemented Reference Methodologies

Figure 4.4: Examples of the implemented data augmentation process. For each pair, the left imagecorresponds to original image and right to the respective transformation: (a) Horizontal Flip; (b)Rotation ; (c) Zooming (d) Width Shift; (e) and (f) Height shifts.

layers. Given this, whereas it will be applied in the representation module or only in the classifi-

cation is feasible by defining a binary variable, d. Its magnitude, D is also searched and is defined

in table 6.1, Chapter 6.

4.2.3.3 L2

L2 regularization is performed for each new feature computation, i.e, for each convolutional layer

L2 regularization is performed. The penalization term is optimized and a detailed description is

presented in Implementation details in Table 6.1 from Chapter 6.

4.2.3.4 Early-Stopping

In order to apply early-stopping during training it is necessary to define the patience, p, that denotes

the number of epochs with no further improvement after which the training will be stopped and it

is a hyper-parameter of the network (see Table 6.1 in Chapter 6).

4.2.3.5 Batch-Normalization

This layer has a defined momentum and is initialized by γ and β . Batch-Normalization, as

Dropout, can be defined between the convolutional layers or only between fully-connected lay-

ers by defining a binary variable, B.

4.3 Transfer Learning Based Approaches

It is rare to train an entire Convolutional Network from scratch with random weight initialization

because of the scarce data and generic features can be used across different models and problems.

When performing transfer learning, a first model is trained in a base network on a base dataset and

task, then, the learned features are transfered to the desired target, being trained on the new dataset.

The success of the method highly depends on the generalization of the features extracted and how

4.3 Transfer Learning Based Approaches 35

similar the first task is to the target task. Two different networks were used as a pre-trained model:

VGG16 and Facenet. They hold the state-of-the-art in most object recognition problems and

the dataset in which they are trained belongs to a similar domain, therefore, potentially common

features will be used for classification.

4.3.1 VGG16

VGG16 is a pre-trained network on the ImageNet Large Scale Visual Recognition Challenge

dataset. The original dataset contains 1.2 million training images with another 50,000 images

for validation and 100,000 images for testing. The goal of this image classification challenge is to

train a model that can correctly classify an input image into 1,000 separate object categories. The

categories correspond to common object classes as dogs, cats, houses, vehicles and so on [96].

VGG16 uses only 3 x 3 convolutional layers stacked on top of each other in increasing depth.

Reducing volume size is handled by max pooling. Two fully-connected layers, each with 4,096

nodes are then followed by a softmax classifier. As the name suggests, VGG16 has 16 weight

layers. A representation of VGG16 architecture is presented in Figure 4.5.

Figure 4.5: Network configurations of VGG, from [96]. The pre-trained network used correspondsto configuration D (VGG16).

There are two steps in the training of the pre-trained network. Firstly, the VGG16 will be used

as a fixed feature extractor. That is, the fully-connected layers are removed and replaced by fully-

connected layers adapted to our dataset. The hyper-parameters of the new dense layers (number of

units, regularization and number of layers) are optimized by means of grid-search (see their range

of values in table 6.1, chapter 6). In the second step, the convolutional layers fixed before are then

trained on the FER dataset in order to fine tune the weights. All the layers are trained in the second

step.

36 Implemented Reference Methodologies

4.3.2 FaceNet

Facenet learns a mapping from face images to a compact Euclidean Space where distances directly

correspond to a measure of face similarity. Once this is done, tasks such as face recognition, veri-

fication, and clustering are easy to do using standard techniques (using the FaceNet embeddings as

features). The training is done using triplets: one image of a face (‘anchor’), another image of that

same face (‘positive exemplar’), and an image of a different face (‘negative exemplar’) [97]. The

dataset consists in 100 million to 200 million face thumbnails, having 8 million identities. The

adapted training strategy for Facenet is equal to VGG16: The convolutional layers are fixed for a

certain amount of epochs, until the network converges. Afterwards, all the layers of the network

are fine-tunned. As performed in VGG16 strategy, the classification part is also composed by fully

connected layers and regularization layers whose hyperparameters are also optimized using grid

search approach (as presented in Table 6.1, Chapter 6).

4.4 Physiological regularization

Transfer learning is commonly used for additional feature computation but the benefits are highly

dependent on the source-target domain similarity and the number of parameters to be trained is

increased as well. The base of transfer learning is inductive transfer where the allowed hypothesis

domain is shrunken and the features used are more selective. These selection of feature space can

also be imposed by using domain knowledge. In fact, in FER, it is known that FEs are the result

of the motions of facial muscles [13].

The physiological based neural network as proposed in [89] is composed by three well-designed

modules: the facial-parts module, the representation module and the classification module.

The purpose of the facial-parts module is to learn an encoding-decoding function that maps

from an input image, x to a relevance map, x, representing the probability of each pixel being

relevant for recognition. This task is trained using a supervised learning approach when the anno-

tation of facial key-points exists. Otherwise, it is performed an unsupervised learning where the

loss function enforces sparsity and spatial contiguity on the activations of x.

The representation module is a series of convolutions trained from scratch with random

weight initialization. The representation module aims to learn a embedding function that maps

from an input image x to a feature space f . The relevance map, x that is being learned in the

facial-parts module is then used to filter the learned representations f to a new feature space, f ′,

enforcing them to only respond strongly to the most relevant facial parts as possible.

The classification module is the same to the module presented in figure 4.3 and it consists in

a sequence of fully connected layers followed by regularization, returning a vector of probabilities

for each class, y.

4.4 Physiological regularization 37

4.4.1 Loss Function

The goal of the network is to explicitly model relevant local facial regions and expression recog-

nition. Given this, the network has two tasks to be trained: regression of relevance maps, x, and

expression recognition labels, y. The class labels are trained by defining a categorical cross en-

tropy cost function as defined in eq. 4.2. The regression task is trained using supervised learning

when annotations of key-points exist and unsupervised strategies when only class labels exist.

4.4.2 Supervised term

The supervised learning requires annotation of the true coordinates of key-points located over im-

portant facial components, such as the eyes, nose, mouth and eyebrows. In this scenario, a target

relevance map for each training image is created. As illustrated in Fig. 4.6, for a given training

image, each facial landmark is represented by a Gaussian, with mean at the key-point coordinates

and a predefined standard deviation. The target relevance map is formed by the mixture of the

Gaussians of each facial landmark. The standard deviation should be set to control the neighbor-

hood size around the facial landmarks and is also an hyperparameter for this model. The relevance

Figure 4.6: Original image followed by density maps obtained by a superposition of Gaussians atthe location of each facial landmark, with an increasing value of σ .

map is trained by training a defined loss: LrelevanceMap. The goal is to minimize the mean squared

error between the target and the predicted relevance maps, such that:

LrelevanceMap =1N

N

∑i=1

(X targeti − xi)

2, (4.3)

where X targeti is the map created by superposition of the gaussians for each key-point and xi the

map predicted by the model for the training sample i. Therefore, this loss term encourages the

relevance map x to take high values in the neighborhood of the most important facial components.

4.4.3 Unsupervised Term

Contrary to the supervised learning strategy, this training strategy does not require the availability

of the key-points annotations. In this scenario, the facial-module loss, L f acial parts , is defined to

regularize the activations of the relevance map x by imposing sparsity and spatial contiguity as

38 Implemented Reference Methodologies

follows:

L f acial_module =N

∑i=1

Lcontiguity(xi)+α

N

∑i=1

Lsparsity(xi), (4.4)

where α controls the dominance of each component. Sparsity assures that small and disjoint facial

regions are relevant for the recognition and corresponds to L1 regularization. The sparsity term is

defined as follows:

Lsparsity(x) =1

m×n ∑m,n|xm,n| , (4.5)

where m and n denote the resolution of the relevance map x. The contiguity term enforces the

activations of x to be smooth and spatially localized. Contiguity corresponds to the total variation

regularization and is defined by:

Lcontiguity(x) =1

m×n ∑m,n|xm+1,n− xm,n|+ |xm,n+1− xm,n| (4.6)

Chapter 5

Proposed Method

Inspired by the previous state of the art methods exposed and by the idea that prior knowledge

plays a crucial role in FER, the novel method proposed is a deep neural network architecture with

an encoding that corresponds to a pre-trained network followed by a stage with a loss function that

jointly learn the most relevant facial parts along with expression recognition.

The proposed network is divided in three main modules: the representation module, the fa-cial module and the classification module. The regression maps of facial module are obtained

from a balance between map supervision and contiguity and sparsity impositions, contrary to the

physiological based network where only one approach of learning was applied separately. The

representation module also differs from the previous networks since the feature space produced

is a merge operation between the relevance maps and the transfered features from a pre-trained

network. Transfered learning approaches are limited by the domain similarity but an hybrid ap-

proach in which only low level feature spaces are transfered can, in fact, induce the network to

compute additional features. The proposed method intends to obtain representations from early

stages on pre-trained networks and then transform the transfered features to a wider feature space

that highlights relevant facial regions for FER.

Concerning the facial module, the approach will be similar to the physiological regularization

network seen before but including both supervised and unsupervised learning of the relevance

maps. Models that impose domain knowledge are desired since they induce a bias in the network

to compute relevant features in the problem to be solved. The facial module presented before

introduces domain knowledge but the knowledge introduced has to be computed carefully since

it will affect all the activations from the representation module. A supervised approach tells the

network the exact areas but it can "over-fit" in a sense that the model ignores potential relevant

features or gives the same weights to regions that contribute differently to the expression. For

instance, wrinkles or dimples can be decisive for the classification but they are not included in

the relevance map and ideally the network should compute features capable of consider these

expressions. These type of expressions are small and disjoint facial regions and can be valued by

unsupervised approaches that enforce sparse and contiguous features. The proposed network uses

a facial module that will produce relevance maps learned through supervised maps and through

39

40 Proposed Method

mathematical impositions such as sparsity and contiguity. Therefore, the facial module will have

regions of reference for relevant features but freedom to compute additional sparse features that

are not included in the targeted relevance maps.

5.1 Architecture

As presented in Figure 5.1, the network is composed by three main modules: the representation

module, the facial module and the classification module. Each module will be covered in detail in

the following subsections.

Figure 5.1: Architecture of the proposed network. The relevance maps are produced by regressionfrom the facial component module who is composed by an encoder-decoder. The maps x areoperated (⊗) with the feature representations ( f ) that are outputted by representation module andthen fed to the classification module, predicting the classes probabilities (y).

5.1.1 Representation Module

The representation module contains a series of layers that will encode a set of high-level features

to be used by the proposed network. Being x the input images and Xr the representation module,

the high-level feature representation f can be explicitly written by: f = Xr(x).

The proposed network will evaluate the representation module by computing the encoded

features, f from an encoder designed from scratch or by importing the features f from a pre-trained

network. When evaluating the scratch encoder, the representation module corresponds only to the

encoding block presented in figure 5.1 and it consists in a series of convolution layers. The encod-

ing block has the same architecture as the Xr block in the scratch network. The hyperparameters

used for this encoding block (number of layers, depth, regularization performed, etc.) correspond

to the best hyperparameters of the scratch network and are presented in Table 6.1.

When evaluating the representation module with a pre-trained network the encoding block

corresponds to a set of convolutional layers original from the pre-trained network. The main

idea is to extract the features from a specific layer and then use this feature representation to be

refined by the relevance maps. The pipeline for feature extraction from the pre-trained network is

presented in Figure 5.2.

When the representation module is based on a pre-trained network, the feature space extracted

is fed into a series of additional convolutional layers, returning the feature space ( f ) that will be

5.1 Architecture 41

Figure 5.2: Pipeline for feature extraction from Facenet. Only the layers before pooling operationsare represented. GAP- Global Average Pooling.

used in the merge operation. The additional convolutions assure the computation of more complex

and high level features since the extracted features came from the pre-trained network in an early

stage. The resulting feature space, f , is then operated with the relevance maps obtained from the

facial module. The merge operation ⊗ between the feature space f and the relevance map x has

two possible approaches that will be evaluated, an element-wise product or a concatenation:

f ′ = f ⊗ x, (5.1)

where f are the activations of the representation module (learned features) , with N feature maps,

that is merged with the relevance map x. The merge operation can be an element-wise product

that returns a new set of features f ′ with N feature maps. Alternatively, the merge operation can

represent a concatenation between the feature maps from representation module and the relevance

map from facial module. In this scenario, the output will be a concatenated set of features with

the previous N feature maps plus the map of relevance (N+1 feature maps). It is necessary that the

terms operated have the same dimension, therefore relevance map suffers a pooling operation to

have the same dimension of the feature space f . Due to the need of similar semantic level of the

operated terms, the layer extracted from the pre-trained network has to come from an intermediate

layer where the features are sized 17 by 17.

5.1.2 Facial Module

The facial module can be seen as an encoder-decoder where a convolutional path is followed by a

deconvolution path, in a such way that it is possible to learn a mapping between an input image,

x, to a relevance map x. A scheme of the facial module can be found in Figure 5.3.

The convolutional path follows the typical architecture of a fully convolutional network sim-

ilar to the scratch network presented. It comprises several sequences of two consecutive 3x3

convolutional layers, with rectified linear units (ReLUs) as non-linearities and L2 regularization,

followed by a 2x2 max-pooling operation for down-sampling. The number of convolutional filters

is doubled at each max- pooling operation. The sequences of pooling and transpose operations are

represented by the Xe and Xd respectively and are repeated according to the desired depth of the

42 Proposed Method

Figure 5.3: Facial Module architecture. A regression map of facial components x is obtained aftersequences of convolutions (Xe) and deconvolutions (Xd).

network. For each pooling-transpose operation a skip-connection is implemented. This way, sub-

sequent layers can re-use middle representations, maintaining more information which can lead to

better performances.

Every step in the deconvolution path comprises a 2x2 transpose convolution and two 3x3 con-

volutions, each one followed by a ReLU and regularized by L2. The transpose convolution is

applied for up-sampling and densification of the incoming features maps. At the final layer a 3x3

convolution with a activation function (it is evaluated what activation produces the best maps:

sigmoid or linear) and is used to map the activations into a probability relevance map x.

5.1.3 Classification Module

The classification module architecture has the same structure of Xc module of the scratch neural

network in Figure 4.3. It consists in a sequence of fully connected layers followed by regulariza-

tion finishing in a vector of probabilities for each class, y.

5.2 Loss Function

There are two main tasks performed by the network: the regression of relevance maps and the

classification task. The goal of the model is to minimize a loss function composed by the loss of

each task performed as it shows in the following equation:

L = Lclassi f ication +L f acial_module (5.2)

The Lclassi f ication is a categorical cross entropy as defined before in Equation 4.2 for the scratch

neural network. The facial module is learned by a interaction between three terms: a term respon-

sible for the supervised learning of the relevance maps, Lsupervised , that corresponds to the mean

squared error between the produced map, x and the targeted map xtarget(see Eq. 4.3).

5.3 Iterative refinement 43

It should be noted that the proposed method requires annotation of facial landmarks in order to

compute the targeted map. Some datasets as CK+ [93] provide facial landmarks annotation but, for

instance, the SFEW database [92] is weakly annotated: only emotions are annotated. To solve this,

a key-point detector is applied, generating facial landmarks for each train image automatically. The

key-point detector is a framework presented by Bulat et al. [98]. Some instances of the application

of the key-point detector over the train set of SFEW is presented in Figure 5.4.

Figure 5.4: Illustrative examples of the facial landmarks computation for the SFEW dataset usingthe framework proposed in [98]. Each pair of images contains the original image with the faciallandmarks superimposed (left side) and the corresponding target density map (right side).

The terms that integrates the weakly supervision of the maps impose sparsity and contiguity

(see Eq. 4.4).

L f acial_module = γLsupervised +λLsparsity +αLcontiguity (5.3)

The factors γ , λ and α are positive values and will balance the weight of each member (see Table

6.1). The task of regression of the maps now favors a balance between sparse and contiguous

representations and expression-specific regions. Since this task is optimized intermediately in the

network, the classification task will depend on the relevance maps.

5.3 Iterative refinement

Some state of the art approaches for recognition tasks apply iterative strategies where some density

map needs to be refined. For instance, Cao [99] proposes a network for pose estimation through

part affinity fields estimated recursively. For the presented network, the relevance maps obtained

assumed a crucial role on the classification since they will be merged with the computed features.

The refinement of these relevance maps may improve the model. To implement this strategy,

the task of regression of maps is defined as stage of the network and each stage is an iterative

prediction architecture, following Wei et al. [100], which refines the predictions over successive

stages with intermediate supervision at each stage.

44 Proposed Method

Figure 5.5: Architecture of the proposed network for iterative refinement. The maps (x) producedby regression from the facial component module composed by an encoder-decoder are operated⊗

with the feature representations ( f ) that are outputted by the representation module Xr. Theresultant features can be fed to an additional stage that computes a new relevance map x. Thefinal feature representations f ′ are then fed to the classification module Xc, predicting the classesprobabilities y.

When the network is implemented with a recursive approach, the feature space returned by

the merge operation, f ′, will be the input for a new stage. For stage >=1, the feature space from

the first encoding, f , is supplied to each merge operation, allowing the classifier to freely combine

contextual information by picking the most predictive features (see Figure 5.5). It is expected that

a new and more refined map is generated in each stage. The number of stages, nstg, is defined when

designing the network and can be found in table 6.1. When the network reaches the last stage, the

final feature space is then fed into the classification module, Xc, returning the class probabilities,

y.

Chapter 6

Results and Discussion

The experimental evaluation of the implemented methods was performed using two public avail-

able databases in the FER research field: the Extended Cohn- Kanade (CK+) database [93] and

the Static Facial Expressions in the Wild (SFEW) database [92]. Datasets used in FER can be

grouped by the nature of the environment: controlled, where illumination and pose is defined, and

uncontrolled environments where external conditions and pose are not controlled. CK+ images

are acquired in a controlled environment annotated with 8 expression labels (6 basic plus neutral

and contempt). It has limited gender, age and ethnic diversity and contains only frontal views with

homogeneous illumination.

The other dataset used, SFEW, is targeted for unconstrained FER. It is the first database that

depicts real-world or simulated real-world conditions for expression recognition. The images are

all extracted from movies and labeled with the six primary emotions plus neutral expression [101].

Therefore, there is a wide range of poses, viewing angles, occlusions, illumination conditions and,

hence, the recognition is much more challenging. In figure 6.1 can be found samples for each

database used.

Figure 6.1: (1) - Samples from CK+ dataset where images were acquired under controlled envi-ronments [102]. (2) - Samples from SFEW dataset. Images of spontaneous expression acquiredacquired in uncontrolled environments [92].

45

46 Results and Discussion

6.1 Implementation Details

As a common pre-processing across all methods, the multi-task CNN face detector [50] is used for

face detection and the images were normalized, cropped and resized to 120 by 120 pixels except

for methods that are based in the pre-trained networks. When using Facenet as a pre-trained net-

work, the images were resized to 160 by 160 pixels and 224 by 224 for the VGG16 as pre-trained

network.

Regarding the traditional approaches, the grid cell size is 10x10, the window of Gabor-kpts

and LBP-kpts is 16x16. For LBP, a neighborhood and a radius of 8 are used. The Gabor filter

bank comprises 16 filters with different values of σ : {1,3}, θ :{0; π

4 ; π

2 ; 3π

4 ;} and f :{0.05;0.25}.

The data augmentation performed consisted in geometric transformations where the rotation

angle θ is randomly sampled up to −5π

180 . The scale factor, s that defines random zoom over the

image is a random value from the interval [0.95,1.05]. The translation parameters t1 and t2 are

randomly sampled fraction values up to 5 % of the image height and width. For inputs with 120

by 120, the translational parameters assume integer values from the interval [0,6]. The horizontal

flip, p, is a boolean variable, assuming True or False value.

The hyperparameters of models are optimized by means of grid search and validation set from

the training set. The hyperparameters sets used can be found in Table 6.1. The methods imple-

mented can be categorized in three main approaches: CNN from scratch, Physiological inspired

Network and the proposed network. For each approach there are multiple sets of hyperparameters

that were optimized. The parameters include common parameters across the approaches such as

dropout magnitude on the dense layers, D, dimension of fully-connected layers, FCu, learning rate,

Lr, magnitude of L2 and a boolean that determines the use of batch normalization, Bd , between

fully-connected layers. The number of dense layers was set to 3. Concerning batch-normalization,

the weights of β were initialized with zeros and the weights of γ with ones. For all experiments,

500 epochs were defined to train each network, setting the patience of Early-Stopping to 45 epochs.

The gaussians used to form the relevance maps were obtained using a standard deviation of 21.

For a fair comparison with other methods, the architecture of the CNN from scratch was opti-

mized. The number of functional blocks defined previously as Xr defines the depth of the network

and is also an hyperparameter. Regularization between convolutional layers is also optimized.

For the physiological inspired network the best activation function that returns the relevance

map x is searched, as well as the coefficients that control the interaction between relevance map

regression and the classification task (λ when only fully supervision is performed and λ and γ

when only contiguity and sparsity is imposed).

The proposed network is optimized by searching the best parameters that create an accurate

relevance map for classification (α , γ and λ define the interaction between supervision, contiguity,

6.2 Relevance Maps 47

Table 6.1: Hyperparameters sets.

Hyperparameter Symbol SetDropout Dense Magnitude D {0.3;0.4}Batch Normalization Dense Bd {True;False}

Dense Units FCu {1024;512}Learning Rate Lr {1e−4;1e−5}

L2 Regularizer Factor L2 {{1e−3;1e−4}

Scratch

Architecture Blocks Xr {3;4}Batch Normalization

in Conv LayersBr {True;False}

Dropout in Conv Layers dr {True;False}

PhysiologicalInspired Nets

Maps Activation Function Ac {Linear;Sigmoid}Fully Supervision λ {1;2;5}

Weakly Supervisionλ {{1e−3;1e−4; 1e−5}γ {{1e−3; 1e−4; 1e−5}

ProposedNetwork

Supervision Factor λ {1;2;5}

Weakly Supervision Factorα {{1e−3;1e−4}γ {{1e−3;1e−4}

Maps Activation Function Ac {Linear;Sigmoid}Merge Operation

⊗{Concat;Product}

Number of Stages nstg {1;3}Representation Module Xr {Scratch;Facenet}

sparsity and classification in the loss function). Besides the activation function, the merge opera-

tion between the relevance map x and the computed features is searched between a concatenation

or a point-wise product. For evaluating the iterative refinement strategy over the map refinement,

the appropriate number of stages, nstg is searched. The representation module, Xr, that produces

the features to be merged with the maps have two possible sources, a scratch implementation or a

pre-trained network on Facenet.

All deep models are implemented in Keras with Tensorflow as backend. All models are trained

with the Adam optimization algorithm using a batch size of 64 samples. No learning decay was

used.

6.2 Relevance Maps

The maps computed by the methods that intend to use prior knowledge to capture semantic fea-

tures related to facial expression were outputted and analyzed. In order to observe and analyze this

task, the predicted relevance maps, x, can be found in Figure 6.2. The map predicted by the pro-

posed method (column 5) is placed next to the maps predicted by networks that only use a weakly

supervision (column 3) or only a fully supervision of the maps (column 4). The activations of these

maps are strong around relevant facial components in the three learning schemes and introduce ad-

ditional discriminative representations. As expected, the maps from the fully supervised learning

approach are the most similar to the target map (column 2) since they were trained to minimize the

48 Results and Discussion

Figure 6.2: Examples of predicted relevance maps for different methods used. (1) Original samplesfrom CK+ dataset (2) Target relevance map (3) Predicted map using only weakly supervisionfrom physiological inspired net. (4) Predicted relevance map from fully supervised scheme fromphysiological inspired net. (5) Predicted relevance map from proposed network using both typesof supervision.

mean square error between these two maps. Therefore, a strong information around facial land-

marks is given to the model but peculiar regions that are not encoded as facial landmarks are not

activated (e.g, wrinkles), forcing the network to not activate potential regions that encode some

expression. On the other hand, although the weakly supervised learning does not use any informa-

tion about facial landmark location, it creates maps that are sparse and spatially localized around

important facial components as well as expression wrinkles. Since no supervision is performed in

this learning scheme, an exhaustive optimization of hyper-parameters is needed to obtain a suit-

able relevance map. Due to the high number of hyper-parameters and models computed, it was

not possible to do a proper optimization of hyper-parameters for the weakly supervision scheme.

As it presented in table 6.1, only two values for each hyper-parameter were tested. For this reason,

as it shows in Figure 6.2, column 3, the produced maps are weakly highlighted expression specific

regions. An extensive optimization would be needed in order to produce accurate relevance maps

with more contrast between activations.

6.2 Relevance Maps 49

The proposed method includes the two types of learning, allowing for an interaction between

different terms. The coefficients γ , α and λ will define the degree of freedom of the activated

features. The higher the λ , the closest the maps will be to the target map and less additional fea-

tures coming from the weakly supervised approach. Column 5 presents the images produced by

the proposed method. Almost all of the surrounding areas around facial landmarks are encoded in

these maps along with other regions that were only present in the maps from the weakly supervi-

sion approach. The chin is highlighted in the three samples and the wrinkles and dimples from the

first row are also present.

Within the highlighted regions, the activations have different magnitude, being more specific

in encoding key structures for expression recognition. For instance, figure 6.3 illustrates the gen-

erated maps for the same expression of the same subject but in different temporal moments. Al-

though is an anger expression in all frames, it is clear that the expression is more intense in the last

frames. Since the targeted maps are generated taking into account only the facial landmarks, these

maps are similar in all frames, not representing the different intensity for the anger expression.

With the proposed model this dimensionality of the expression is covered: wrinkles around the

eyebrows and the nose are more highlighted the more intense is the expression. This dimension-

ality of the relevance maps forces a better discrimination of the feature space generated.

Figure 6.3: Frame-by-frame analysis of the relevance maps. First row corresponds to the originalimage, second row presents the targeted map and third row presents the relevance maps generatedby the proposed model. First column represents a neutral expression while the remain columnsrepresent the anger expression.

50 Results and Discussion

6.3 Results on CK+

CK+ contains 327 annotated image sequences with 8 expression labels: the 6 basic emotions plus

the neutral and contempt ones. Each video starts with a neutral expression and reaches the peak

in the last frame. Similar to other works, [103] [89], the first frame and the last three frames of

each video were extracted. The result is a subset of 1308 images. Figure 6.4 shows the class

distribution for the CK+ dataset. All the splits are stratified, therefore, they maintain the original

class distribution.

Figure 6.4: Class Distribution on CK+ [93].

For model selection and evaluation, the data is stratified and randomly split three times in

training and test. In each split, 80 % of the original set corresponds to the training and 20 % to

the test set . Each train set is further divided, also with subject independence and in a stratified

way with 80 % for training and 20 % for validation. In the end, for each split, the validation set

is used to validate the training and the model selected is tested on the test set. The performance is

evaluated by computing the average accuracy and loss on the three test sets.

The experiments on CK+ are presented in Table 6.2 and in Table 6.3. Table 6.2 presents

the results on CK+ using traditional methods based on hand-crafted features. Geometric features

outperform appearance methods, holding the best performance on hand-crafted based methods. It

shows the significance of facial landmarks for FER since these features alone reach an accuracy

of almost 80 %. Within appearance methods, LBP and Gabor show similar performances. All

the traditional approaches are outperformed by convolutional neural networks by a significant

difference. The best method of the traditional approaches (Geometric features + LBP around key-

points) differs by almost 10 % from the weaker method using convolutional neural networks (CNN

from scratch).

6.3 Results on CK+ 51

Table 6.2: Performance achieved by the traditional baseline methods on CK+.

PerformanceLoss Acc( %)

Hand-craftedFeatures

Geometric 0.74 79.76

Appearance

LBP Global 1.59 46.35Gabor Global 1.62 43.11

LBP Local 0.81 67.82Gabor Local 0.93 65.53

LBP kpts 0.91 72.41Gabor kpts 0.93 70.13

Gabor Global+LBP Global

1.64 42.25

Gabor Local+LBP Local

0.82 69.12

Gabor+ LBP kpts

0.67 77.26

Geometric + AppearanceGeometric+ LBP kpts

0.55 79.76

In Table 6.3 the methods based on convolutional neural networks are evaluated. The proposed

method is compared with a CNN from scratch, that works as a baseline, and methods that hold

state-of-the-art in FER. Although having the lowest score, the CNN from scratch is a strong base-

line since it presents a result near state of the art methods. It has strong regularization applied and

the representation module has the same architecture as the proposed method and as the fully and

weakly supervised learning approaches. Within pre-trained networks approach, Facenet beats the

VGG16 as it was expected since the domain of the database in which facenet was trained is similar

to the databases used. As stated in [89], the inclusion of physiological knowledge approach is bet-

ter than a CNN from scratch, showing that domain knowledge can improve the model. Analyzing

the two approaches for knowledge inclusion, imposing sparsity and contiguity demonstrate better

results than a supervised approach with maps of facial landmarks regions. These observations can

be explained by the fact that a weakly supervision allows for activation of regions that are not

present in the target maps such as wrinkles.

The proposed method pre-trained on facenet beats all the other approaches, either in average

accuracy or in average loss. The proposed method, that includes both approaches of learning

the maps, when using the representation module from the scratch CNN outperforms all networks

except the one pre-trained on facenet. Besides outperforming other scratches networks, it outper-

forms a pre-trained approach that was trained in millions of images with a more complex network

(VGG16). It shows that a simpler network, with fewer parameters to be trained and with do-

main knowledge inclusion can beat heavier networks as VGG16. It is also surprising that the

proposed method based with a feature space from facenet beats a pre-trained network on facenet

since the features extracted from original network belong to intermediate layers and facenet is

more complex and deeper. These observations can be explained by the capability of the network

52 Results and Discussion

Table 6.3: CK+ experimental results.

Method Average Accuracy (%) Average LossCNN from Scratch 88.6 0.58Pre-trained Facenet 93.75 0.28Pre-trained VGG16 91.67 0.42

Fully Supervised 88.79 0.47Weakly Supervised 89.78 0.43

Proposed Method with CNN from scratch 91.11 0.51Proposed Method with pre-trained facenet 94.21 0.20

Proposed Method with pre-trained facenet with iterative refinement 93.85 0.23

in use higher amount of features with low semantic level and then discriminate them with prior

knowledge (facial landmarks regions).

The implemented iterative refinement strategy tested consisted in iteratively repeat the stage

responsible for the relevance maps estimation over features pre-computed. The underlying idea is

to refine the map over consecutive stages and inspect whether these refine maps induce a better

classification task. As Table 6.3 presents, the proposed method with this iterative refinement

approach does not introduce gains in terms of performance, showing a similar performance. In

fact, it was observed that loss value of each map generated maintained or decrease its value,

indicating that the maps were near optimal and a iterative refinement strategy would not lead to

more accurate maps in the presented case.

Figure 6.5: Confusion Matrix of CK+ database.

The confusion matrix illustrated in Figure 6.5 shows the performance of the proposed method

6.4 Results on SFEW. 53

using a pre-trained network. Anger, contempt and fear are the expressions that are most difficult

to classify. This can be explained by the fact that the frequency of these classes is lower than most

of classes (see Figure 6.4).

6.4 Results on SFEW.

The other dataset used, SFEW, is targeted for unconstrained FER. SFEW was created as part of

the Emotion Recognition in the Wild (EmotiW) 2015 Grand Challeng [104] and it has a strict

evaluation protocol with predefined training, validation, and test sets. In particular, the training

set comprises a total of 891 images. Since it was not possible to obtain the test set, the results

are reported on the validation data that contains 431 images. It has 7 classes, the basic 6 basic

emotions plus the neutral expression. The class distribution for the test set of SFEW is presented

in Figure 6.6.

Figure 6.6: Class distribution for SFEW database [92].

SFEW is known for being one of the most challenging FER datasets.For instance, the chal-

lenge baseline performance is 35.96% [104] and the state of the art performance for SFEW is held

by Yu Zhiding et al. [77] with an accuracy of 52.29 %. However, the method proposed by Yu

Zhiding et al., as most of top state of the art methods for SFEW, uses an ensemble of multiple net-

works to boost their performance and uses other databases to train before hand. The implemented

methods were applied directly to SFEW and can be found in Table 6.4. As observed in CK+

results, the proposed method outperforms both CNN from scratch and the pre-trained network.

The consistency on the results of both datasets shows that the model, besides getting simpler with

fewer parameters to be trained, also performs better with the inclusion of features of pre-trained

networks and integration of domain knowledge.

54 Results and Discussion

Table 6.4: SFEW experimental results.

Method Average Accuracy (%) LossCNN from scratch 36.01 1.88Pre-trained Facenet 46.02 1.57

Proposed method (with Pre-trained Facenet) 47.26 1.79

It should be pointed out that when using just a pre-trained network, the performance of the

method only achieved the accuracies reported when all the layers from facenet were trained with

in our dataset. When training only with the fully-connected layers, the performance was similar

to the performance of the CNN implemented from scratch.

The proposed method uses a fixed feature extractor from facenet by transferring the feature

space of a specific layer, without training the facenet layers on our dataset. An improvement to the

proposed method could be the fine-tunned of all facenet layers on SFEW.

The confusion matrix correspondent to the best performance (Proposed method with pre-

trained facenet) is presented in Figure 6.7.

Figure 6.7: Confusion Matrix of SFEW database.

The recognition accuracy for fear is the lowest among all the classes. The fear expression is

mostly confused with anger and neutral expression. This observation is also documented in other

works [77] [89].

Chapter 7

Conclusions

Facial Expressions can assist on the interpretation of different states of mind and are part of the

fundamental communication system in humans. Their automatic recognition would open new

strategies and improvements in different fields that involve Human-Computer interaction (HCI) or

systems where expression have a crucial semantic meaning.

Several FER methods have been evaluated, from methods based on traditional feature extrac-

tion, such as LBP and Gabor filters, to different approaches on deep convolutional neural networks.

It is clear that deep networks perform better than methods based on hand-crafted features due to

the ability of computing an whole new set of data representations. Within deep neural approaches,

several methods had been proposed and some state-of-the art methods are remarkable in some

object recognition tasks. However, large datasets are scarce and some state of the art methods

design heavy networks that demand high computational resources, not being viable and efficient

in some cases. There are some studies that look over domain knowledge and its role in deep

neural networks. In most cases, a correct inclusion of prior knowledge can lead to better feature

discrimination, therefore, better results.

In order to study the role of prior-knowledge in deep neural networks, several state-of-the art

were implemented and a novel method was presented. The proposed method is a deep neural

network architecture with an encoding that corresponds to a pre-trained network and a posterior

stage with a loss function that jointly learn the most relevant facial parts through different sources

of learning along with the expression recognition. The result is a model that is able to learn

expression-specific features, demonstrating better performances than the state of the art methods

implemented. The proposed method is composed by three main modules: (1) Representation

Module, (2) Facial Module, (3) Classification Module. The facial module aims to regress a rel-

evance map that highlights regions around facial landmarks. The train of this task is ruled by

an interaction between supervised learning and unsupervised learning that imposes sparsity and

contiguity. The output of this module is a relevance map with crucial activated regions for FER.

The relevance map will filter the feature space that is returned by the representation module. This

representation module can be an encoding implemented from scratch or an encoding transfered

from a pre-trained network, Facenet. Finally, the classification module is trained on these filtered

55

56 Conclusions

features and returns a vector with the predicted classes.

The experimental results on the two databases used, CK+ (with controlled conditions), and

SFEW (natural conditions), demonstrate that the proposed method outperforms the state of the

art methods implemented, showing the potential in integrate different sources of prior knowledge:

Domain knowledge coming from facial landmarks and expression morphology, and prior repre-

sentations transfered from other networks trained on image recognition tasks. The studies and

experiments performed using a encoding from scratch as representation module also reveal that a

simpler network architecture with robust regularization and rich prior knowledge can beat some

pre-trained networks that have complex and deeper architectures, therefore, more parameters to

be tuned. Concerning only the facial module, it is clear that a balance between supervision of the

regression maps and mathematical impositions such as sparsity and contiguity can lead to refined

maps of relevance since facial landmarks regions are encoded, along with small and disjoint re-

gions such as wrinkles. Since the relevance maps play a crucial role on discriminating the feature

space, approaches that lead to refined maps can also lead to better results. Given this, a recursive

strategy was implemented where the the facial module was repeated consecutively to refine the

predictions over successive stages with intermediate supervision at each stage. The refinement on

the successive maps was not clear and the performance was similar. This can be explained by the

accurate computation of these maps on the first stage, with no relevant gains on the next ones.

As future work, the proposed network and its training strategies could be applied to more

datasets and to other domains as well. The proposed method can also be applied for video, using

for instance Long Short Term Memory (LSTSM) or optical flow streams networks along with the

proposed method.

References

[1] Charles Darwin and Phillip Prodger. The expression of the emotions in man and animals.Oxford University Press, USA, 1998.

[2] Tijn Kooijmans, Takayuki Kanda, Christoph Bartneck, Hiroshi Ishiguro, and NorihiroHagita. Interaction debugging: an integral approach to analyze human-robot interaction.In Proceedings of the 1st ACM SIGCHI/SIGART conference on Human-robot interaction,pages 64–71. ACM, 2006.

[3] Ashish Kapoor, Winslow Burleson, and Rosalind W Picard. Automatic prediction of frus-tration. International journal of human-computer studies, 65(8):724–736, 2007.

[4] Chek Tien Tan, Daniel Rosser, Sander Bakkes, and Yusuf Pisan. A feasibility study inusing facial expressions analysis to evaluate player experiences. In Proceedings of The 8thAustralasian Conference on Interactive Entertainment: Playing the System, page 5. ACM,2012.

[5] Sander Bakkes, Chek Tien Tan, and Yusuf Pisan. Personalised gaming: a motivation andoverview of literature. In Proceedings of the 8th Australasian Conference on InteractiveEntertainment: Playing the System, page 4. ACM, 2012.

[6] Jeffrey M Girard, Jeffrey F Cohn, Mohammad H Mahoor, S Mohammad Mavadati, ZakiaHammal, and Dean P Rosenwald. Nonverbal social withdrawal in depression: Evidencefrom manual and automatic analyses. Image and vision computing, 32(10):641–647, 2014.

[7] Stefan Scherer, Giota Stratou, Marwa Mahmoud, Jill Boberg, Jonathan Gratch, AlbertRizzo, and Louis-Philippe Morency. Automatic behavior descriptors for psychological dis-order analysis. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE Interna-tional Conference and Workshops on, pages 1–8. IEEE, 2013.

[8] Sarah Griffiths, Christopher Jarrold, Ian S Penton-Voak, Andy T Woods, Andy L Skinner,and Marcus R Munafò. Impaired recognition of basic emotions from facial expressionsin young people with autism spectrum disorder: Assessing the importance of expressionintensity. Journal of autism and developmental disorders, pages 1–11, 2017.

[9] Eeva A Elliott and Arthur M Jacobs. Facial expressions, emotions, and sign languages.Frontiers in psychology, 4, 2013.

[10] Albert Mehrabian. Communication without words. Communication theory, pages 193–200,2008.

[11] Guillaume-Benjamin Duchenne. The mechanism of human facial expression. Cambridgeuniversity press, 1990.

57

58 REFERENCES

[12] Ciprian Adrian Corneanu, Marc Oliu Simon, Jeffrey F Cohn, and Sergio Escalera Guerrero.Survey on rgb, 3d, thermal, and multimodal approaches for facial expression recognition:History, trends, and affect-related applications. IEEE transactions on pattern analysis andmachine intelligence, 38(8):1548–1568, 2016.

[13] Paul Ekman. An argument for basic emotions. Cognition & emotion, 6(3-4):169–200, 1992.

[14] Karen L Schmidt and Jeffrey F Cohn. Human facial expressions as adaptations: Evolu-tionary questions in facial expression research. American journal of physical anthropology,116(S33):3–24, 2001.

[15] David Matsumoto, Dacher Keltner, Michelle N Shiota, MAUREEN O’Sullivan, and MarkFrank. Facial expressions of emotion. Handbook of emotions, 3:211–234, 2008.

[16] Robert W Levenson, Paul Ekman, and Wallace V Friesen. Voluntary facial action generatesemotion-specific autonomic nervous system activity. Psychophysiology, 27(4):363–384,1990.

[17] Nico H Frijda, Anna Tcherkassof, et al. Facial expressions as modes of action readiness.The psychology of facial expression, pages 78–102, 1997.

[18] Paul Ekman and Wallace V Friesen. Facial action coding system. 1977.

[19] Evangelos Sariyanidi, Hatice Gunes, and Andrea Cavallaro. Automatic analysis of facialaffect: A survey of registration, representation, and recognition. IEEE transactions onpattern analysis and machine intelligence, 37(6):1113–1133, 2015.

[20] Frank Y Shih, Chao-Fa Chuang, and Patrick SP Wang. Performance comparisons of facialexpression recognition in jaffe database. International Journal of Pattern Recognition andArtificial Intelligence, 22(03):445–459, 2008.

[21] Tanja Bänziger, Marcello Mortillaro, and Klaus R Scherer. Introducing the geneva mul-timodal expression corpus for experimental research on emotion perception. Emotion,12(5):1161, 2012.

[22] Hatice Gunes and Björn Schuller. Categorical and dimensional affect analysis in continuousinput: Current trends and future directions. Image and Vision Computing, 31(2):120–136,2013.

[23] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simplefeatures. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings ofthe 2001 IEEE Computer Society Conference on, volume 1, pages I–I. IEEE, 2001.

[24] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. InComputer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Con-ference on, volume 1, pages 886–893. IEEE, 2005.

[25] Timo Ojala, Matti Pietikainen, and David Harwood. Performance evaluation of texturemeasures with classification based on kullback discrimination of distributions. In PatternRecognition, 1994. Vol. 1-Conference A: Computer Vision & Image Processing., Proceed-ings of the 12th IAPR International Conference on, volume 1, pages 582–585. IEEE, 1994.

REFERENCES 59

[26] Abdenour Hadid. The local binary pattern approach and its applications to face analysis. InImage Processing Theory, Tools and Applications, 2008. IPTA 2008. First Workshops on,pages 1–9. IEEE, 2008.

[27] Joni-Kristian Kamarainen. Gabor features in image analysis. In Image Processing Theory,Tools and Applications (IPTA), 2012 3rd International Conference on, pages 13–14. IEEE,2012.

[28] Nello Cristianini and John Shawe-Taylor. An introduction to support vector machines andother kernel-based learning methods. Cambridge university press, 2000.

[29] Caifeng Shan, Shaogang Gong, and Peter W McOwan. Facial expression recognition basedon local binary patterns: A comprehensive study. Image and Vision Computing, 27(6):803–816, 2009.

[30] Philipp Michel and Rana El Kaliouby. Facial expression recognition using support vectormachines. In The 10th International Conference on Human-Computer Interaction, Crete,Greece, 2005.

[31] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.

[32] Michael Goh. Facial expression recognition using a hybrid cnn–sift aggregator. In Multi-disciplinary Trends in Artificial Intelligence: 11th International Workshop, MIWAI 2017,Gadong, Brunei, November 20-22, 2017, Proceedings, volume 10607, page 139. Springer,2017.

[33] Arushi Raghuvanshi and Vivek Choksi. Facial expression recognition with convolutionalneural networks. CS231n Course Projects, 2016.

[34] Shima Alizadeh and Azar Fazel. Convolutional neural networks for facial expression recog-nition. arXiv preprint arXiv:1704.06756, 2017.

[35] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in neural information processing sys-tems, pages 1097–1105, 2012.

[36] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network train-ing by reducing internal covariate shift. In International Conference on Machine Learning,pages 448–456, 2015.

[37] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journalof machine learning research, 15(1):1929–1958, 2014.

[38] Jason Wang and Luis Perez. The effectiveness of data augmentation in image classificationusing deep learning. Technical report, Technical report, 2017.

[39] Andrew Y Ng. Feature selection, l 1 vs. l 2 regularization, and rotational invariance. InProceedings of the twenty-first international conference on Machine learning, page 78.ACM, 2004.

[40] Evaluating machine learning models - o’reilly media. https://www.oreilly.com/ideas/evaluating-machine-learning-models. (Accessed on 06/11/2018).

60 REFERENCES

[41] Ming-Hsuan Yang, David J Kriegman, and Narendra Ahuja. Detecting faces in images:A survey. IEEE Transactions on pattern analysis and machine intelligence, 24(1):34–58,2002.

[42] Michael Jones and Paul Viola. Fast multi-view face detection. Mitsubishi Electric ResearchLab TR-20003-96, 3:14, 2003.

[43] Bernhard Froba and Andreas Ernst. Face detection with the modified census transform.In Automatic Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE InternationalConference on, pages 91–96. IEEE, 2004.

[44] Bo Wu, Haizhou Ai, Chang Huang, and Shihong Lao. Fast rotation invariant multi-viewface detection based on real adaboost. In Automatic Face and Gesture Recognition, 2004.Proceedings. Sixth IEEE International Conference on, pages 79–84. IEEE, 2004.

[45] Rong Xiao, Huaiyi Zhu, He Sun, and Xiaoou Tang. Dynamic cascades for face detection.In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8.IEEE, 2007.

[46] Hongliang Jin, Qingshan Liu, Hanqing Lu, and Xiaofeng Tong. Face detection using im-proved lbp under bayesian framework. In Image and Graphics (ICIG’04), Third Interna-tional Conference on, pages 306–309. IEEE, 2004.

[47] Lun Zhang, Rufeng Chu, Shiming Xiang, Shengcai Liao, and Stan Z Li. Face detectionbased on multi-block lbp representation. In International Conference on Biometrics, pages11–18. Springer, 2007.

[48] Margarita Osadchy, Yann Le Cun, and Matthew L Miller. Synergistic face detectionand pose estimation with energy-based models. Journal of Machine Learning Research,8(May):1197–1215, 2007.

[49] Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua. A convolutionalneural network cascade for face detection. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 5325–5334, 2015.

[50] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and align-ment using multitask cascaded convolutional networks. IEEE Signal Processing Letters,23(10):1499–1503, 2016.

[51] Tinne Tuytelaars, Krystian Mikolajczyk, et al. Local invariant feature detectors: a survey.Foundations and trends R© in computer graphics and vision, 3(3):177–280, 2008.

[52] Jimei Yang, Shengcai Liao, and Stan Z Li. Automatic partial face alignment in nir videosequences. In International Conference on Biometrics, pages 249–258. Springer, 2009.

[53] David G Lowe. Object recognition from local scale-invariant features. In Computer vision,1999. The proceedings of the seventh IEEE international conference on, volume 2, pages1150–1157. Ieee, 1999.

[54] Maja Pantic and Ioannis Patras. Dynamics of facial expression: recognition of facial actionsand their temporal segments from face profile image sequences. IEEE Transactions onSystems, Man, and Cybernetics, Part B (Cybernetics), 36(2):433–449, 2006.

REFERENCES 61

[55] Robert Walecki, Ognjen Rudovic, Vladimir Pavlovic, and Maja Pantic. Variable-state latentconditional random fields for facial expression recognition and action unit detection. InAutomatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conferenceand Workshops on, volume 1, pages 1–8. IEEE, 2015.

[56] Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. Active appearance mod-els. IEEE Transactions on pattern analysis and machine intelligence, 23(6):681–685, 2001.

[57] Simon Lucey, Ahmed Bilal Ashraf, and Jeffrey F Cohn. Investigating spontaneous facialaction recognition through aam representations of the face. In Face recognition. InTech,2007.

[58] Ognjen Rudovic, Vladimir Pavlovic, and Maja Pantic. Multi-output laplacian dynamic or-dinal regression for facial expression recognition and intensity estimation. In Computer Vi-sion and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2634–2641. IEEE,2012.

[59] Stefano Berretti, Boulbaba Ben Amor, Mohamed Daoudi, and Alberto Del Bimbo. 3d facialexpression recognition using sift descriptors of automatically detected keypoints. The VisualComputer, 27(11):1021, 2011.

[60] Ira Cohen, Nicu Sebe, Ashutosh Garg, Lawrence S Chen, and Thomas S Huang. Facialexpression recognition from video sequences: temporal and static modeling. ComputerVision and image understanding, 91(1):160–187, 2003.

[61] Ira Cohen, Nicu Sebe, FG Gozman, Marcelo Cesar Cirelo, and Thomas S Huang. Learningbayesian network classifiers for facial expression recognition both labeled and unlabeleddata. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Com-puter Society Conference on, volume 1, pages I–I. IEEE, 2003.

[62] Petar S Aleksic and Aggelos K Katsaggelos. Automatic facial expression recognition usingfacial animation parameters and multistream hmms. IEEE Transactions on InformationForensics and Security, 1(1):3–11, 2006.

[63] Montse Pardàs and Antonio Bonafonte. Facial animation parameters extraction and expres-sion recognition using hidden markov models. Signal Processing: Image Communication,17(9):675–688, 2002.

[64] Zhengyou Zhang, Michael Lyons, Michael Schuster, and Shigeru Akamatsu. Compari-son between geometry-based and gabor-wavelets-based facial expression recognition usingmulti-layer perceptron. In Automatic Face and Gesture Recognition, 1998. Proceedings.Third IEEE International Conference on, pages 454–459. IEEE, 1998.

[65] Marian Stewart Bartlett, Gwen Littlewort, Mark Frank, Claudia Lainscsek, Ian Fasel, andJavier Movellan. Recognizing facial expression: machine learning and application to spon-taneous behavior. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEEComputer Society Conference on, volume 2, pages 568–573. IEEE, 2005.

[66] Michael J Lyons, Julien Budynek, and Shigeru Akamatsu. Automatic classification ofsingle facial images. IEEE transactions on pattern analysis and machine intelligence,21(12):1357–1362, 1999.

62 REFERENCES

[67] Wenfei Gu, Cheng Xiang, YV Venkatesh, Dong Huang, and Hai Lin. Facial expressionrecognition using radial encoding of local gabor features and classifier synthesis. PatternRecognition, 45(1):80–91, 2012.

[68] Karan Sikka, Tingfan Wu, Josh Susskind, and Marian Bartlett. Exploring bag of wordsarchitectures in the facial expression domain. In Computer Vision–ECCV 2012. Workshopsand Demonstrations, pages 250–259. Springer, 2012.

[69] Guoying Zhao and Matti Pietikainen. Dynamic texture recognition using local binary pat-terns with an application to facial expressions. IEEE transactions on pattern analysis andmachine intelligence, 29(6):915–928, 2007.

[70] Bo Sun, Liandong Li, Tian Zuo, Ying Chen, Guoyan Zhou, and Xuewen Wu. Combiningmultimodal features with hierarchical classifier fusion for emotion recognition in the wild.In Proceedings of the 16th International Conference on Multimodal Interaction, pages 481–486. ACM, 2014.

[71] Lang He, Dongmei Jiang, Le Yang, Ercheng Pei, Peng Wu, and Hichem Sahli. Multimodalaffective dimension prediction using deep bidirectional long short-term memory recurrentneural networks. In Proceedings of the 5th International Workshop on Audio/Visual Emo-tion Challenge, pages 73–80. ACM, 2015.

[72] A Geetha, Vennila Ramalingam, S Palanivel, and B Palaniappan. Facial expressionrecognition–a real time approach. Expert Systems with Applications, 36(1):303–308, 2009.

[73] Benjamín Hernández, Gustavo Olague, Riad Hammoud, Leonardo Trujillo, and EvaRomero. Visual learning of texture descriptors for facial expression recognition in ther-mal imagery. Computer Vision and Image Understanding, 106(2):258–269, 2007.

[74] Sander Koelstra, Maja Pantic, and Ioannis Patras. A dynamic texture-based approach torecognition of facial actions and their temporal models. IEEE transactions on pattern anal-ysis and machine intelligence, 32(11):1940–1954, 2010.

[75] Maja Pantic and Marian Stewart Bartlett. Machine analysis of facial expressions. In Facerecognition. InTech, 2007.

[76] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation fromscratch. arXiv preprint arXiv:1411.7923, 2014.

[77] Zhiding Yu and Cha Zhang. Image based static facial expression recognition with multipledeep network learning. In Proceedings of the 2015 ACM on International Conference onMultimodal Interaction, pages 435–442. ACM, 2015.

[78] Mengyi Liu, Shaoxin Li, Shiguang Shan, and Xilin Chen. Au-aware deep networks forfacial expression recognition. In Automatic Face and Gesture Recognition (FG), 2013 10thIEEE International Conference and Workshops on, pages 1–6. IEEE, 2013.

[79] Mao Xu, Wei Cheng, Qian Zhao, Li Ma, and Fang Xu. Facial expression recognition basedon transfer learning from deep convolutional networks. In Natural Computation (ICNC),2015 11th International Conference on, pages 702–708. IEEE, 2015.

[80] Tian Xia, Yifeng Zhang, and Yuan Liu. Expression recognition in the wild with transferlearning.

REFERENCES 63

[81] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances inneural information processing systems, pages 2672–2680, 2014.

[82] Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter. Adversarial gen-erative nets: Neural network attacks on state-of-the-art face recognition. arXiv preprintarXiv:1801.00349, 2017.

[83] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learn-ing. In Proceedings of the 26th annual international conference on machine learning, pages41–48. ACM, 2009.

[84] Liangke Gui, Tadas Baltrušaitis, and Louis-Philippe Morency. Curriculum learning forfacial expression recognition. In Automatic Face & Gesture Recognition (FG 2017), 201712th IEEE International Conference on, pages 505–511. IEEE, 2017.

[85] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization ofneural networks using dropconnect. In International Conference on Machine Learning,pages 1058–1066, 2013.

[86] Haibing Wu and Xiaodong Gu. Towards dropout training for convolutional neural networks.Neural Networks, 71:1–10, 2015.

[87] Matthew D Zeiler and Rob Fergus. Stochastic pooling for regularization of deep convolu-tional neural networks. arXiv preprint arXiv:1301.3557, 2013.

[88] Hui Ding, Shaohua Kevin Zhou, and Rama Chellappa. Facenet2expnet: Regularizing a deepface recognition net for expression recognition. In Automatic Face & Gesture Recognition(FG 2017), 2017 12th IEEE International Conference on, pages 118–126. IEEE, 2017.

[89] Pedro Ferreira, Jaime Cardoso, and Ana Rebelo. Physiological inspired deep neu-ral networks for emotion recognition. https://drive.google.com/open?id=11HI3sEF4V0U06F-30-LLwWFO99IuCCH4, 2018.

[90] Irene Kotsia and Ioannis Pitas. Facial expression recognition in image sequences usinggeometric deformation features and support vector machines. IEEE transactions on imageprocessing, 16(1):172–187, 2007.

[91] Arnaud Dapogny, Kevin Bailly, and Séverine Dubuisson. Dynamic facial expression recog-nition by joint static and multi-time gap transition classification. In Automatic Face andGesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on,volume 1, pages 1–6. IEEE, 2015.

[92] Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. Static facial expressionanalysis in tough conditions: Data, evaluation protocol and benchmark. In Computer VisionWorkshops (ICCV Workshops), 2011 IEEE International Conference on, pages 2106–2112.IEEE, 2011.

[93] Takeo Kanade, Jeffrey F Cohn, and Yingli Tian. Comprehensive database for facial ex-pression analysis. In Automatic Face and Gesture Recognition, 2000. Proceedings. FourthIEEE International Conference on, pages 46–53. IEEE, 2000.

64 REFERENCES

[94] Shengcai Liao, Xiangxin Zhu, Zhen Lei, Lun Zhang, and Stan Z Li. Learning multi-scaleblock local binary patterns for face recognition. In International Conference on Biometrics,pages 828–837. Springer, 2007.

[95] Byoung Chul Ko. A brief review of facial emotion recognition based on visual information.sensors, 18(2):401, 2018.

[96] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. arXiv preprint arXiv:1409.1556, 2014.

[97] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embeddingfor face recognition and clustering. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 815–823, 2015.

[98] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3dface alignment problem? (and a dataset of 230,000 3d facial landmarks). In InternationalConference on Computer Vision, 2017.

[99] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d poseestimation using part affinity fields. In CVPR, volume 1, page 7, 2017.

[100] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional posemachines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition, pages 4724–4732, 2016.

[101] Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. Acted facial expressionsin the wild database. Australian National University, Canberra, Australia, Technical ReportTR-CS-11, 2, 2011.

[102] Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and IainMatthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unitand emotion-specified expression. In Computer Vision and Pattern Recognition Workshops(CVPRW), 2010 IEEE Computer Society Conference on, pages 94–101. IEEE, 2010.

[103] Mengyi Liu, Shiguang Shan, Ruiping Wang, and Xilin Chen. Learning expressionletson spatio-temporal manifold for dynamic facial expression recognition. In Computer Vi-sion and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1749–1756. IEEE,2014.

[104] Emotion recognition in the wild challenge 2015. https://cs.anu.edu.au/few/emotiw2015.html. (Accessed on 06/10/2018).