margin-mix: semi{supervised learning for face expression ...expression recognition in the main...

7
Margin-Mix: Semi–Supervised Learning for Face Expression Recognition Corneliu Florea 1[0000-0001-9754-6795] , Mihai Badea 1 , Laura Florea 1 , Andrei Racoviteanu 1 , and Constantin Vertan 1 Image Processing and Analysis Laboratory, Unversity Politehnica of Bucharest {corneliu.florea; mihai sorin.badea; laura.florea; andrei.racoviteanu; constantin.vertan }@upb.ro 1 Domain comparison: object recognition vs face expression recognition In the main paper, we emphasized that face expression recognition (FER) do- main is more compact than the object recognition domain. More precisely, FER has smaller inter class variance. Given the original dimensionality of the two domains (32 × 32 × 3 = 3072 for CIFAR images and respectively 48 × 48 = 2304 for FER+ images), to our best knowledge there is no universally accepted technique to visualize the distribution of the points in the original space. In this material, we propose some approaches widely used to visualize the original spaces and to provide an intuition about our proposal. 1.1 Low dimensionality in the CNN embedding space In this case, for the convolutional neural network (CNN), we use the smaller AlexNet [2]. There, on the last layer before the prediction, we impose a di- mensionality of 2 instead of the standard 4096. Instead of the WideResNet, the AlexNet type of network is able to accept such a modification with a dra- matic change in performance (which nevertheless is smaller). In this scenario, the network was trained in purely supervised manner with the large margin loss. The distribution of the training points from the two domains, illustrated by the FER+ and CIFAR examples can be seen in figure 1. One may easily note that data in the FER domain is more tightly clustered and in the first epochs, the clusters are overlapping no matter the classes. As epoch progress and the networks are adjusted for better separation, one may see that CIFAR corresponding become more distinct, while the FER are less so. Thus we argue that the difference in the distribution of points requires different strategy. 1.2 Visualization using t-SNE Another popular technique for visualization high dimensional data is t-SNE [4]. In this case, we have considered the original architecture with full 4096 dimen- sionality and employed t-SNE for visualization of data structure with respect

Upload: others

Post on 31-Jan-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

  • Margin-Mix: Semi–Supervised Learning for FaceExpression Recognition

    Corneliu Florea1[0000−0001−9754−6795], Mihai Badea1, Laura Florea1,Andrei Racoviteanu1, and Constantin Vertan1

    Image Processing and Analysis Laboratory, Unversity Politehnica of Bucharest{corneliu.florea; mihai sorin.badea; laura.florea;andrei.racoviteanu; constantin.vertan }@upb.ro

    1 Domain comparison: object recognition vs faceexpression recognition

    In the main paper, we emphasized that face expression recognition (FER) do-main is more compact than the object recognition domain. More precisely, FERhas smaller inter class variance.

    Given the original dimensionality of the two domains (32× 32× 3 = 3072 forCIFAR images and respectively 48 × 48 = 2304 for FER+ images), to our bestknowledge there is no universally accepted technique to visualize the distributionof the points in the original space.

    In this material, we propose some approaches widely used to visualize theoriginal spaces and to provide an intuition about our proposal.

    1.1 Low dimensionality in the CNN embedding space

    In this case, for the convolutional neural network (CNN), we use the smallerAlexNet [2]. There, on the last layer before the prediction, we impose a di-mensionality of 2 instead of the standard 4096. Instead of the WideResNet,the AlexNet type of network is able to accept such a modification with a dra-matic change in performance (which nevertheless is smaller). In this scenario,the network was trained in purely supervised manner with the large margin loss.The distribution of the training points from the two domains, illustrated by theFER+ and CIFAR examples can be seen in figure 1.

    One may easily note that data in the FER domain is more tightly clusteredand in the first epochs, the clusters are overlapping no matter the classes. Asepoch progress and the networks are adjusted for better separation, one maysee that CIFAR corresponding become more distinct, while the FER are less so.Thus we argue that the difference in the distribution of points requires differentstrategy.

    1.2 Visualization using t-SNE

    Another popular technique for visualization high dimensional data is t-SNE [4].In this case, we have considered the original architecture with full 4096 dimen-sionality and employed t-SNE for visualization of data structure with respect

  • 2 C. Florea et al.

    to the number of epochs. The plots may be seen in figure 2. This visualizationtechnique also suggests more overlapped clusters in the case of FER problemthan for object recognition (CIFAR).

    2 Separated vs overlapping clusters

    In the main paper, we compare our proposal with the MixMatch solution [1].The main difference lies in the self-labeling step: our proposal, MarginMix, usesdistances from each unlabeled data to the class centroids and those distancesare turned into soft labels (i.e. probabilities / membership values). In contrastMixMatch uses PseudoLabels [3]. The latter bases the self-labels on the bordersdefined by the annotated data. Un-annotated data is thus hard labeled. Weshow an intuitive graphical comparison between these two strategies in figure 3for a situation where class clusters are well separated and respectively in figure4 for overlapping clusters.

    The main idea is that PseudoLabels is very conservative on the original cho-sen border as it imposes that unlabeled data to have sharp labels (i.e only (1, 0)or (0, 1)). Its advantage lies in better definition of the class borders in areaswhere there are too few points. However, in the case where the clusters corre-sponding to points from different classes are overlapping, the initial border isnot well defined, as it depends on chance, and PseudoLabels does not encouragestrong modifications from that initial case.

    In contrast, our proposal uses self labeling based on large margins (dis-tance to class centroids). It produces soft values for the labels (i.e (0.34, 0.66) or(0.72, 0.28), etc). It also behaves correctly in the case of separated clusters, butis being far more malleable in the case of overlapping class clusters and thus itsperformance is more robust.

  • Margin-Mix: SSL for Face Expression Recognition 3

    CIFAR-10 FER+

    Ep

    och

    1E

    poch

    5E

    poch

    10

    Ep

    och

    20

    Fig. 1. Examples of the two problems approached. The training set is represented onthe 2 neurons from the last layer fully connected before the decision one in an AlexNet.On the left column is represented the distribution of points in CIFAR-10 database whileon the right column is FER+. One notes the degree to which the clusters correspondingto different classes are overlapped/separable in the two situations

  • 4 C. Florea et al.

    CIFAR-10 FER+

    Ep

    och

    1E

    poch

    5E

    poch

    10

    Ep

    och

    128

    Fig. 2. Examples of the two problems approached. The training set is represented onthe 4096 dimensional layer and visualized with t-SNE. On the left column is repre-sented the distribution of points in CIFAR-10 database while on the right column isFER+. One notes the degree to which the clusters corresponding to different classesare overlapped/separable in the two situations

  • Margin-Mix: SSL for Face Expression Recognition 5

    Supervised case

    Semi-supervised case:

    some data lost the labels.

    Border is based on labeled data

    Self-labeling with PseudoLabels (MixMatch):

    border is learnt from supervised data

    and it hard labels un-annotated data with respect to the border

    After self-labeling, the border is rather preserved

    Our proposal self-annotated unlabeled data

    with respect to the distance to the centroids of labeled examples

    After self-labeling, the border computed

    using the semi-supervised approach (in MarginMix)

    is very similar with the supervised one

    Fig. 3. Comparison between the proposed algorithm based on self labeling using dis-tances to centroids and MixMatch that uses entropy regularization (PseudoLabels) onclusters that are well separated. Self-labeled data points color is proportional to thelabel value. Both algorithms are successful.

  • 6 C. Florea et al.

    Supervised caseSemi-supervised case:

    some of the data have lost the labels

    First the border is computed based on supervised data.

    The new border (green) is different from the supervised baseline.

    Self-Labeling: The initial border becomes more firm

    after self labeling with PseudoLabels

    After self-labeling, the border computed using the Margin-Mix

    approach is more similar with the supervised one

    Fig. 4. Comparison between the proposed algorithm based on self labeling using dis-tances to centroids and using entropy regularization (PseudoLabels) on overlappingclusters. Note the strength of the labels in the case PseudoLabels and in case of dis-tances to centroids. Only our algorithm is successful in this case.

  • Margin-Mix: SSL for Face Expression Recognition 7

    References

    1. Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.A.:Mixmatch: A holistic approach to semi-supervised learning. In: NIPS. pp. 5050–5060 (2019) 2

    2. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-volutional neural networks. In: Advances in neural information processing systems.pp. 1097–1105 (2012) 1

    3. Lee, D.H.: Pseudo-label: The simple and efficient semi-supervised learning methodfor deep neural networks. In: ICML Workshops (2013) 2

    4. Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of machine learningresearch 9(Nov), 2579–2605 (2008) 1