image based static facial expression recognition with ...image based static facial expression...

Image based Static Facial Expression Recognition with Multiple Deep

Network Learning Zhiding Yu Carnegie Mellon University

Cha Zhang Microsoft Research

Nov 9th, 2015

Motivation

• Helps computer to better understand human

• Helps computer to interact with human more naturally

• Wide array of practical applications

• Current emotional intell. is limited (Has considerable room to improve)

Affect-aware personal assistant/companion devices

Autism intervention Honest signal Affect-aware game development

Datasets

FER 2013 Dataset

• Web crawled + human labeling

• 48x48 image resolution

• 28709 training samples

• 3589 validation samples

• 3590 testing samples

• Noisy data: Inconsistent face cropping + non-face images + observed labeling errors

EmotiW-SFEW Challenge 2015

• Frames from movie (requires face detection)

• Wild (spontaneous) setting

• Limited training data (958 Train + 436 Val + 372 Test)

• Unbalanced class sizes

Face Detection

The Face Detection Cascade

Images without detected faces

Images without detected faces

Input Images

Faces detected by DCNN

Faces detected by JDA

Images not containing faces

Faces detected by MoT

JDA [1] DCNN [2] MoT [3]

No

Yes Yes Yes

No No

[1] D. Chen, S. Ren, Y. Wei, X. Cao and J. Sun. Joint cascade face detection and alignment, ECCV 2014 [1] C. Zhang and Z. Zhang, Improving Multiview Face Detection with Multi-Task Deep Convolutional Neural Networks, WACV 2014 [2] X. Zhu and D. Ramanan, Face detection, pose estimation and landmark localization in the wild, CVPR 2012

Examples of JDA and DCNN Detections

Red: JDA Blue: DCNN

Detection Results on SFEW Test (372 Faces)

JDA DCNN MoT JDA+DCNN JDA+DCNN+MoT

Correct Det Num 333 358 352 363 371

Faces missed by JDA but found by DCNN

Faces missed by JDA+DCNN but found by MoT

False Positive

Recognition System

The Basic CNN Architecture

24

24

64

3 3

128

12

12

24

24

64

3 3 Dense

1024

48

48

Stochastic Pooling

3 3

128

6

6 Dense

1024

Dense

7 5

5

Stochastic Pooling

Stochastic Pooling

128

12

12

3 3

Improvement I: Image Perturbation

With image perturbation, we can:

• Data augmentation by randomly perturbing training data (Data Aug)

• More robust prediction by voting with perturbed testing data (Voting)

Perturbation with Parameterized Warping

Translation + Rotation + Skewing + Scaling

Before warping

After warping

CNN Architecture with Data Aug. & Voting

…

Perturbed Images

24

24

64

3 3

128

12

12

24

24

64

3 3 Dense

1024 48

48

Stochastic Pooling

3 3

128

6

6 Dense

1024

Dense

7

5

5

Stochastic Pooling

Stochastic Pooling

128

12

12

3 3

Averaged Weight

7

…

Desired training response

Combined training response

Cost function

CNN #2 Training

Resp.

CNN #1 Training

Resp.

CNN #K Training

Resp.

+

w1

w2

wK

…

CNN #1

CNN #2

CNN #K

Improvement II: Multiple Network Learning (MNL)

The MNL Algorithm Diagram:

Proposed MNL Cost Functions

Hinge Loss (HL):

Log Likelihood Loss (LL):

Experimental Results

FER 2013

• Preprocessing: Hist Eq + Plane Fitting + Unit Norm

• Training: CNN(s) trained on the FER training set

• Validation: Select the optimal training epoch by maximizing the FER validation set accuracy

FER 2013

• Human label consistency against majority voted GT label: 65-68%

• Basic CNN: 65.07%

• CNN + Data Aug: 68.6%

• CNN + Data Aug + Voting: 70.33%

• FER 2013 Winner: 71.162%

• MNL (Log Like Loss): 72.05%

• MNL (Hinge Loss): 72.08%

FER 2013

surprise sad angry sad happy angry sad sad happy angry

happy neutral disgust surprise sad happy angry surprise sad angry

Correct Prediction (Fear)

Wrong Prediction

EmotiW-SFEW 2015

• Preprocessing: Hist Eq + Plane Fitting + Unit Norm

• Training: • Pre-train on the FER combined set (Train + Val + Test)

• Fix the network parameters at bottom layers (Only allow the last two dense layers to be updated)

• Fine-tune on the SFEW training set (Domain adaptation)

• Validation: Select the optimal fine-tune epoch by maximizing the SFEW validation set accuracy

EmotiW-SFEW 2015

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.6

0.62

Validation Test

Acc

ura

cy

Single

Average1

Average2

SVM

LogLike

HingeLoss

Accuracy numbers (Baselines: 35.96%/39.13%) Learned network ensemble weights

EmotiW-SFEW 2015

Log Likelihood

Hinge Loss

Conclusions

• CNNs are arguably the most powerful tool so far for emotion recognition tasks

• Fine-tuning plays the role of domain adaptation

• Image perturbations and voting-based prediction are significant keys to improving the performance

• A weighted committee of multiple networks can further improve the classification performance

Thank You!

image based static facial expression recognition with ...image based static facial expression...

Documents