image based static facial expression recognition with ...image based static facial expression...
TRANSCRIPT
Image based Static Facial Expression Recognition with Multiple Deep
Network Learning Zhiding Yu Carnegie Mellon University
Cha Zhang Microsoft Research
Nov 9th, 2015
Motivation
• Helps computer to better understand human
• Helps computer to interact with human more naturally
• Wide array of practical applications
• Current emotional intell. is limited (Has considerable room to improve)
Affect-aware personal assistant/companion devices
Autism intervention Honest signal Affect-aware game development
Datasets
FER 2013 Dataset
• Web crawled + human labeling
• 48x48 image resolution
• 28709 training samples
• 3589 validation samples
• 3590 testing samples
• Noisy data: Inconsistent face cropping + non-face images + observed labeling errors
EmotiW-SFEW Challenge 2015
• Frames from movie (requires face detection)
• Wild (spontaneous) setting
• Limited training data (958 Train + 436 Val + 372 Test)
• Unbalanced class sizes
Face Detection
The Face Detection Cascade
Images without detected faces
Images without detected faces
Input Images
Faces detected by DCNN
Faces detected by JDA
Images not containing faces
Faces detected by MoT
JDA [1] DCNN [2] MoT [3]
No
Yes Yes Yes
No No
[1] D. Chen, S. Ren, Y. Wei, X. Cao and J. Sun. Joint cascade face detection and alignment, ECCV 2014 [1] C. Zhang and Z. Zhang, Improving Multiview Face Detection with Multi-Task Deep Convolutional Neural Networks, WACV 2014 [2] X. Zhu and D. Ramanan, Face detection, pose estimation and landmark localization in the wild, CVPR 2012
Examples of JDA and DCNN Detections
Red: JDA Blue: DCNN
Detection Results on SFEW Test (372 Faces)
JDA DCNN MoT JDA+DCNN JDA+DCNN+MoT
Correct Det Num 333 358 352 363 371
Faces missed by JDA but found by DCNN
Faces missed by JDA+DCNN but found by MoT
False Positive
Recognition System
The Basic CNN Architecture
24
24
64
3 3
128
12
12
24
24
64
3 3 Dense
1024
48
48
Stochastic Pooling
3 3
128
6
6 Dense
1024
Dense
7 5
5
Stochastic Pooling
Stochastic Pooling
128
12
12
3 3
Improvement I: Image Perturbation
With image perturbation, we can:
• Data augmentation by randomly perturbing training data (Data Aug)
• More robust prediction by voting with perturbed testing data (Voting)
Perturbation with Parameterized Warping
Translation + Rotation + Skewing + Scaling
Before warping
After warping
CNN Architecture with Data Aug. & Voting
…
Perturbed Images
24
24
64
3 3
128
12
12
24
24
64
3 3 Dense
1024 48
48
Stochastic Pooling
3 3
128
6
6 Dense
1024
Dense
7
5
5
Stochastic Pooling
Stochastic Pooling
128
12
12
3 3
Averaged Weight
7
…
Desired training response
Combined training response
Cost function
CNN #2 Training
Resp.
CNN #1 Training
Resp.
CNN #K Training
Resp.
+
w1
w2
wK
…
CNN #1
CNN #2
CNN #K
Improvement II: Multiple Network Learning (MNL)
The MNL Algorithm Diagram:
Proposed MNL Cost Functions
Hinge Loss (HL):
Log Likelihood Loss (LL):
Experimental Results
FER 2013
• Preprocessing: Hist Eq + Plane Fitting + Unit Norm
• Training: CNN(s) trained on the FER training set
• Validation: Select the optimal training epoch by maximizing the FER validation set accuracy
FER 2013
• Human label consistency against majority voted GT label: 65-68%
• Basic CNN: 65.07%
• CNN + Data Aug: 68.6%
• CNN + Data Aug + Voting: 70.33%
• FER 2013 Winner: 71.162%
• MNL (Log Like Loss): 72.05%
• MNL (Hinge Loss): 72.08%
FER 2013
surprise sad angry sad happy angry sad sad happy angry
happy neutral disgust surprise sad happy angry surprise sad angry
Correct Prediction (Fear)
Wrong Prediction
EmotiW-SFEW 2015
• Preprocessing: Hist Eq + Plane Fitting + Unit Norm
• Training: • Pre-train on the FER combined set (Train + Val + Test)
• Fix the network parameters at bottom layers (Only allow the last two dense layers to be updated)
• Fine-tune on the SFEW training set (Domain adaptation)
• Validation: Select the optimal fine-tune epoch by maximizing the SFEW validation set accuracy
EmotiW-SFEW 2015
0.46
0.48
0.5
0.52
0.54
0.56
0.58
0.6
0.62
Validation Test
Acc
ura
cy
Single
Average1
Average2
SVM
LogLike
HingeLoss
Accuracy numbers (Baselines: 35.96%/39.13%) Learned network ensemble weights
EmotiW-SFEW 2015
Log Likelihood
Hinge Loss
Conclusions
• CNNs are arguably the most powerful tool so far for emotion recognition tasks
• Fine-tuning plays the role of domain adaptation
• Image perturbations and voting-based prediction are significant keys to improving the performance
• A weighted committee of multiple networks can further improve the classification performance
Thank You!