deep learning face representation from predicting 10,000 ...mmlab.ie.cuhk.edu.hk/pdf/yisun_ ·...
Post on 16-Feb-2019
Embed Size (px)
Deep Learning Face Representation from Predicting 10,000 Classes
Yi Sun1 Xiaogang Wang2 Xiaoou Tang1,31Department of Information Engineering, The Chinese University of Hong Kong2Department of Electronic Engineering, The Chinese University of Hong Kong3Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
email@example.com firstname.lastname@example.org email@example.com
This paper proposes to learn a set of high-level featurerepresentations through deep learning, referred to as Deephidden IDentity features (DeepID), for face verification.We argue that DeepID can be effectively learned throughchallenging multi-class face identification tasks, whilst theycan be generalized to other tasks (such as verification) andnew identities unseen in the training set. Moreover, thegeneralization capability of DeepID increases as more faceclasses are to be predicted at training. DeepID featuresare taken from the last hidden layer neuron activations ofdeep convolutional networks (ConvNets). When learnedas classifiers to recognize about 10, 000 face identities inthe training set and configured to keep reducing the neuronnumbers along the feature extraction hierarchy, these deepConvNets gradually form compact identity-related featuresin the top layers with only a small number of hiddenneurons. The proposed features are extracted from variousface regions to form complementary and over-completerepresentations. Any state-of-the-art classifiers can belearned based on these high-level representations for faceverification. 97.45% verification accuracy on LFW isachieved with only weakly aligned faces.
1. IntroductionFace verification in unconstrained conditions has been
studied extensively in recent years [21, 15, 7, 34, 17, 26,18, 8, 2, 9, 3, 29, 6] due to its practical applicationsand the publishing of LFW , an extensively reporteddataset for face verification algorithms. The current best-performing face verification algorithms typically representfaces with over-complete low-level features, followed byshallow models [9, 29, 6]. Recently, deep models such asConvNets  have been proved effective for extractinghigh-level visual features [11, 20, 14] and are used forface verification [18, 5, 31, 32, 36]. Huang et al. learned a generative deep model without supervision. Cai
Figure 1. An illustration of the feature extraction process. Arrowsindicate forward propagation directions. The number of neurons ineach layer of the multiple deep ConvNets are labeled beside eachlayer. The DeepID features are taken from the last hidden layerof each ConvNet, and predict a large number of identity classes.Feature numbers continue to reduce along the feature extractioncascade till the DeepID layer.
et al.  learned deep nonlinear metrics. In , thedeep models are supervised by the binary face verificationtarget. Differently, in this paper we propose to learn high-level face identity features with deep models through faceidentification, i.e. classifying a training image into oneof n identities (n 10, 000 in this work). This high-dimensional prediction task is much more challenging thanface verification, however, it leads to good generalizationof the learned feature representations. Although learnedthrough identification, these features are shown to beeffective for face verification and new faces unseen in thetraining set.
We propose an effective way to learn high-level over-complete features with deep ConvNets. A high-levelillustration of our feature extraction process is shown inFigure 1. The ConvNets are learned to classify all thefaces available for training by their identities, with the lasthidden layer neuron activations as features (referred to as
Deep hidden IDentity features or DeepID). Each ConvNettakes a face patch as input and extracts local low-levelfeatures in the bottom layers. Feature numbers continue toreduce along the feature extraction cascade while graduallymore global and high-level features are formed in thetop layers. Highly compact 160-dimensional DeepID isacquired at the end of the cascade that contain rich identityinformation and directly predict a much larger number (e.g.,10, 000) of identity classes. Classifying all the identitiessimultaneously instead of training binary classifiers as in[21, 2, 3] is based on two considerations. First, it ismuch more difficult to predict a training sample into oneof many classes than to perform binary classification. Thischallenging task can make full use of the super learningcapacity of neural networks to extract effective featuresfor face recognition. Second, it implicitly adds a strongregularization to ConvNets, which helps to form sharedhidden representations that can classify all the identitieswell. Therefore, the learned high-level features have goodgeneralization ability and do not over-fit to a small subsetof training faces. We constrain DeepID to be significantlyfewer than the classes of identities they predict, which iskey to learning highly compact and discriminative features.We further concatenate the DeepID extracted from variousface regions to form complementary and over-complete rep-resentations. The learned features can be well generalizedto new identities in test, which are not seen in training,and can be readily integrated with any state-of-the-art faceclassifiers (e.g., Joint Bayesian ) for face verification.
Our method achieves 97.45% face verification accuracyon LFW using only weakly aligned faces, which is almostas good as human performance of 97.53%. We also observethat as the number of training identities increases, theverification performance steadily gets improved. Althoughthe prediction task at the training stage becomes morechallenging, the discrimination and generalization ability ofthe learned features increases. It leaves the door wide openfor future improvement of accuracy with more training data.
2. Related workMany face verification methods represent faces by high-
dimensional over-complete face descriptors, followed byshallow models. Cao et al.  encoded each face image into26K learning-based (LE) descriptors, and then calculatedtheL2 distance between the LE descriptors after PCA. Chenet al.  extracted 100K LBP descriptors at dense faciallandmarks with multiple scales and used Joint Bayesian for verification after PCA. Simonyan et al.  computed1.7M SIFT descriptors densely in scale and space, encodedthe dense SIFT features into Fisher vectors, and learned lin-ear projection for discriminative dimensionality reduction.Huang et al.  combined 1.2M CMD  and SLBP descriptors, and learned sparse Mahalanobis metrics for
face verification.Some previous studies have further learned identity-
related features based on low-level features. Kumar et al. trained attribute and simile classifiers to detect facialattributes and measure face similarities to a set of referencepeople. Berg and Belhumeur [2, 3] trained classifiers todistinguish the faces from two different people. Featuresare outputs of the learned classifiers. They used SVMclassifiers, which are shallow structures, and their learnedfeatures are still relatively low-level. In contrast, we classifyall the identities from the training set simultaneously. More-over, we use the last hidden layer activations as featuresinstead of the classifier outputs. In our ConvNets, theneuron number of the last hidden layer is much smallerthan that of the output, which forces the last hidden layerto learn shared hidden representations for faces of differentpeople in order to well classify all of them, resultingin highly discriminative and compact features with goodgeneralization ability.
A few deep models have been used for face verificationor identification. Chopra et al.  used a Siamese network for deep metric learning. The Siamese network extractsfeatures separately from two compared inputs with twoidentical sub-networks, taking the distance between theoutputs of the two sub-networks as dissimilarity. used deep ConvNets as the sub-networks. In contrastto the Siamese network in which feature extraction andrecognition are jointly learned with the face verificationtarget, we conduct feature extraction and recognition intwo steps, with the first feature extraction step learned withthe target of face identification, which is a much strongersupervision signal than verification. Huang et al. generatively learned features with CDBNs , then usedITML  and linear SVM for face verification. Cai et al. also learned deep metrics under the Siamese networkframework as , but used a two-level ISA network as the sub-networks instead. Zhu et al. [35, 36] learned deepneural networks to transform faces in arbitrary poses andillumination to frontal faces with normal illumination, andthen used the last hidden layer features or the transformedfaces for face recognition. Sun et al.  used multiple deepConvNets to learn high-level face similarity features andtrained classification RBM  for face verification. Theirfeatures are jointly extracted from a pair of faces instead offrom a single face.
3. Learning DeepID for face verification
3.1. Deep ConvNets
Our deep ConvNets contain four convolutional layers(with max-pooling) to extract features hierarchically, fol-lowed by the fully-connected DeepID layer and the softmaxoutput layer indicating identity classes. The input is 39
Figure 2. ConvNet structure. The length, width, and height ofeach cuboid denotes the map number and the dimension of eachmap for all input, convolutional, and max-pooling layers. Theinside small cuboids and squares denote the 3D convolution kernelsizes and the 2D pooling region sizes of convolutional and max-pooling layers, respectively. Neuron numbers of the last two fully-connected layers are marked bes