arxiv:1911.03567v1 [cs.cv] 8 nov 2019souhail bakkali zuheng ming muhammad muzzamil luqman...

7
Face Detection in Camera Captured Images of Identity Documents under Challenging Conditions Souhail Bakkali Zuheng Ming Muhammad Muzzamil Luqman Jean-Christophe Burie L3i, La Rochelle University, France {souhail.bakkali, zuheng.ming, jcburie, mluqma01}@univ-lr.fr Abstract—Benefiting from the advance of deep convolutional neural network approaches (CNNs), many face detection algo- rithms have achieved state-of-the-art performance in terms of accuracy and very high speed in unconstrained applications. However, due to the lack of public datasets and due to the vari- ation of the orientation of face images, the complex background and lighting, defocus and the varying illumination of camera captured images, face detection on identity documents under unconstrained environments has not been sufficiently studied. To address this problem more efficiently, we survey three state- of-the-art face detection methods based on general images, i.e. Cascade-CNN, MTCNN and PCN, for face detection in camera captured images of identity documents, given different image quality assessments. For that, The MIDV-500 dataset, which is the largest and most challenging dataset for identity documents, is used to evaluate the three methods. The evaluation results show the performance and the limitations of the current methods for face detection on identity documents under the wild complex environments. These results show that the face detection task in camera captured images of identity documents is challenging, providing a space to improve in the future works. Keywords-Face detection, Identity Document, Deep CNNs I. I NTRODUCTION A lot of researches has been dedicated and reserved to identity document analysis and recognition on mobile devices, willing to facilitate the data input process [1]. A practical and common approach to this problem involves detecting and comparing individual’s live face to the face image found in his/her identity document. However, the face detection and recognition on the identity documents poses numerous challenges that are different from general face detection and recognition on the normal visual images. The main challenges of face detection in images on identity docu- ments are mainly due to the challenging conditions in which camera captured images on identity documents have been taken, such as occlusion, defocus, complex background, the variation of the orientation of identity documents and the varying illumination conditions caused by the reflect. Fig- ure1 shows the challenging conditions of camera captured images on identity documents from the MIDV-500 dataset [2]. Besides, collecting information from identity documents (such as faces, signatures, texts..) is still a challenge to solve due to the strict privacy laws and regulations. Since identity documents contain sensitive personal data, people are not willing to risk the leak of the personal information that may lead to some complications. Thus, it becomes difficult to evaluate and compare various identity document analysis methods to each other, since they have been tested on private industry-oriented and locked down data [3], [4], which are collected from customers and are not available for the public in the light of security and market advantage. Therefore, the lack of public datasets for images in identity documents also hinder the research on this sensitive field. Thanks to the availability of large face datasets and the progress in deep neural network architectures [5], [6], [7], face detection and recognition in general face images has made tremendous strides in the past decade. Recently, there have been a necessity to build a model to accurately dif- ferentiate faces from the backgrounds in challenging condi- tions, while keeping real-time performance. A deep cascaded multi-task architecture built on deep convolutional neural networks (CNNs) employs a coarse-to-fine strategy for face detection, by adopting different stages of CNNs in order to predict the face progressively [8], [9], [10]. Benefiting from the simplicity of the global networks and the progressive increase of depth of the CNNs, the deep cascaded based CNNs for face detection can achieve the state-of-the-art performance in terms of accuracy and speed. Specifically, Cascade-CNN [10] firstly proposed the deep Cascade-CNNs architecture for face detection task. The MTCNN [9], based on the deep cascaded multi-task framework, achieved also the state-of-the-art performance and attained very high speed for face detection and alignment. Furthermore, PCN [8] which is also based on the deep cascaded CNNs, has tackled the detection of rotating faces in a coarse-to-fine manner. It can accurately detect faces with arbitrary rotation-in-plane angles, which are common to the camera captured images on identity documents. In this work, these three state-of- the-art methods Cascade-CNN, MTCNN and PCN, that are the most widely used for face detection tasks with different advantages, are adopted to detect faces on identity document images. The recent published MIDV-500 dataset, which is a Mobile Identity Document Video dataset consisting of 500 video clips for 50 different identity with 17 types of ID cards, is used to evaluate the three different face detection frameworks. In summary, our main contributions of this paper are: arXiv:1911.03567v1 [cs.CV] 8 Nov 2019

Upload: others

Post on 06-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1911.03567v1 [cs.CV] 8 Nov 2019Souhail Bakkali Zuheng Ming Muhammad Muzzamil Luqman Jean-Christophe Burie L3i, La Rochelle University, France fsouhail.bakkali, zuheng.ming, jcburie,

Face Detection in Camera Captured Images of Identity Documents underChallenging Conditions

Souhail Bakkali Zuheng Ming Muhammad Muzzamil Luqman Jean-Christophe BurieL3i, La Rochelle University, France

{souhail.bakkali, zuheng.ming, jcburie, mluqma01}@univ-lr.fr

Abstract—Benefiting from the advance of deep convolutionalneural network approaches (CNNs), many face detection algo-rithms have achieved state-of-the-art performance in terms ofaccuracy and very high speed in unconstrained applications.However, due to the lack of public datasets and due to the vari-ation of the orientation of face images, the complex backgroundand lighting, defocus and the varying illumination of cameracaptured images, face detection on identity documents underunconstrained environments has not been sufficiently studied.To address this problem more efficiently, we survey three state-of-the-art face detection methods based on general images, i.e.Cascade-CNN, MTCNN and PCN, for face detection in cameracaptured images of identity documents, given different imagequality assessments. For that, The MIDV-500 dataset, which isthe largest and most challenging dataset for identity documents,is used to evaluate the three methods. The evaluation resultsshow the performance and the limitations of the currentmethods for face detection on identity documents under thewild complex environments. These results show that the facedetection task in camera captured images of identity documentsis challenging, providing a space to improve in the future works.

Keywords-Face detection, Identity Document, Deep CNNs

I. INTRODUCTION

A lot of researches has been dedicated and reservedto identity document analysis and recognition on mobiledevices, willing to facilitate the data input process [1]. Apractical and common approach to this problem involvesdetecting and comparing individual’s live face to the faceimage found in his/her identity document. However, the facedetection and recognition on the identity documents posesnumerous challenges that are different from general facedetection and recognition on the normal visual images. Themain challenges of face detection in images on identity docu-ments are mainly due to the challenging conditions in whichcamera captured images on identity documents have beentaken, such as occlusion, defocus, complex background, thevariation of the orientation of identity documents and thevarying illumination conditions caused by the reflect. Fig-ure1 shows the challenging conditions of camera capturedimages on identity documents from the MIDV-500 dataset[2]. Besides, collecting information from identity documents(such as faces, signatures, texts..) is still a challenge tosolve due to the strict privacy laws and regulations. Sinceidentity documents contain sensitive personal data, peopleare not willing to risk the leak of the personal information

that may lead to some complications. Thus, it becomesdifficult to evaluate and compare various identity documentanalysis methods to each other, since they have been testedon private industry-oriented and locked down data [3], [4],which are collected from customers and are not availablefor the public in the light of security and market advantage.Therefore, the lack of public datasets for images in identitydocuments also hinder the research on this sensitive field.Thanks to the availability of large face datasets and theprogress in deep neural network architectures [5], [6], [7],face detection and recognition in general face images hasmade tremendous strides in the past decade. Recently, therehave been a necessity to build a model to accurately dif-ferentiate faces from the backgrounds in challenging condi-tions, while keeping real-time performance. A deep cascadedmulti-task architecture built on deep convolutional neuralnetworks (CNNs) employs a coarse-to-fine strategy for facedetection, by adopting different stages of CNNs in order topredict the face progressively [8], [9], [10]. Benefiting fromthe simplicity of the global networks and the progressiveincrease of depth of the CNNs, the deep cascaded basedCNNs for face detection can achieve the state-of-the-artperformance in terms of accuracy and speed. Specifically,Cascade-CNN [10] firstly proposed the deep Cascade-CNNsarchitecture for face detection task. The MTCNN [9], basedon the deep cascaded multi-task framework, achieved alsothe state-of-the-art performance and attained very high speedfor face detection and alignment. Furthermore, PCN [8]which is also based on the deep cascaded CNNs, has tackledthe detection of rotating faces in a coarse-to-fine manner. Itcan accurately detect faces with arbitrary rotation-in-planeangles, which are common to the camera captured imageson identity documents. In this work, these three state-of-the-art methods Cascade-CNN, MTCNN and PCN, that arethe most widely used for face detection tasks with differentadvantages, are adopted to detect faces on identity documentimages. The recent published MIDV-500 dataset, which is aMobile Identity Document Video dataset consisting of 500video clips for 50 different identity with 17 types of IDcards, is used to evaluate the three different face detectionframeworks.

In summary, our main contributions of this paper are:

arX

iv:1

911.

0356

7v1

[cs

.CV

] 8

Nov

201

9

Page 2: arXiv:1911.03567v1 [cs.CV] 8 Nov 2019Souhail Bakkali Zuheng Ming Muhammad Muzzamil Luqman Jean-Christophe Burie L3i, La Rochelle University, France fsouhail.bakkali, zuheng.ming, jcburie,

Figure 1. Frame examples from MIDV-500 dataset. Different conditionsare presented such as different orientation, complex background fromdifferent scenarios and the variation of the illumination.

• Evaluating the three state-of-the-art methods Cascade-CNN, MTCNN and PCN for face detection in cameracaptured images on identity documents under challeng-ing environments.

• The bounding box coordinates of face regions in ar-bitrary frames in the MIDV-500 dataset’s videos havebeen manually annotated for the evaluation.

• We have demonstrated that, under the challengingconditions, with various orientation of the document,complex background and the varying illumination, theMTCNN model shows its superiority to the othermethods evaluated on the MIDV-500 dataset.

The rest of paper is organized as follows: in Section II,we briefly review the existing studies in the area of identitydocument analysis and recognition; Section III describes theframeworks of the evaluated methods; in Section IV, wepresent the Mobile Identity Document Video dataset (MIDV-500) [2] and the experimental evaluation results; the finalSection V draw a conclusion and presents the future work.

II. RELATED WORK

Lately, numerous researches have been devoted to doc-ument analysis and recognition, analyzing and identify-ing different identity fields: document numbers, documentholder name components, machine readable zone, signaturesand face detection using state-of-the-art visual recognitionapproaches [11], [12], [13]. As for the related work providedon [2], a text field OCR study was proposed to performtext field extraction, recognition and identity document dataextraction from video clips.

The face detection problem has been an important task incomputer vision systems that aims to extract informationfrom face images, with respect to the head position andocclusion, while detecting faces quickly and accurately oninput images. Previous face detection systems were mostlybased on hand-crafted features extracted by the featureengineering for the classification methods. A number ofvariant methods and local descriptors have been proposedfor face detection and recognition task, such as, rapid objectdetection using a boosted cascade of simple features [14],

LBP [15], HOG [16], Gabor-LBP [17], SIFT [18]. Recentresearches in this area focus more on the uncontrolled facedetection problem, where various poses, complex lighting,occlusion, exaggerated expressions, full rotation-in-planeangle face images can lead to large visual variations, and tosignificant divergence in face appearances, while detectingfaces in full rotation-in-plane angles can greatly advancethe performance of both face alignment and face detection-recognition.

Nevertheless, the first study on ID document face photois attributed to Starovoitov et al. [19], [20], assuming thatall face images are frontal faces without large expressionvariations as most methods do. The algorithm is similar to ageneral constrained face matcher, except that it is developedfor a document photo dataset. Since the popularity of deepneural networks could partially be attributed to a specialproperty that the low-level image features are transferable,i.e. they are not limited to a particular task, but applicableto many image analysis tasks. Given this property, onecan first train a domain specific neural network by transferlearning on a relatively small dataset, as proposed by Y.Shi and Anil K. Jain [21]. A recent face detection studywas performed in [2] using open source libraries [22], [23]with default frontal face detectors. It was conducted usingoriginal frames, and using projectively restored documentimages based on ground truth document coordinates. Theground truth coordinates were projectively transformed fromthe template coordinates to frame coordinates (according tothe ground truth document boundaries). The same processwas done for face detection results in the cropped documentdetection mode.

III. FACE DETECTION FRAMEWORKS

The frameworks have been chosen for their performancein terms of high accuracy and speed, their low time costin a fast run-time on the Multi-oriented FDDB [24] andWiderFace [25] datasets, and also for their specific architec-ture based on deep cascaded multi-task convolutional neuralnetworks (CNNs). High accuracy is achieved with a deepneural network; having three networks - each with multiplelayers - allow for higher precision, as each network canfine-tune the results of the previous one. This technique isso called Hard Sample Mining. As for the three methods,training only the first stage is not perfect, because it wouldrecognize some images with no faces in it as positivesamples (window candidates containing at least one face).These images are known as false positives, so we includethem into The second stage, since its role is to refinebounding box edges and reduce false positives. This canhelp the second stage targets the first network’s weaknessesand improve accuracy. Similarly, it is applied to the thirdstage as well.

Page 3: arXiv:1911.03567v1 [cs.CV] 8 Nov 2019Souhail Bakkali Zuheng Ming Muhammad Muzzamil Luqman Jean-Christophe Burie L3i, La Rochelle University, France fsouhail.bakkali, zuheng.ming, jcburie,

Since different tasks are employed in each CNNs, differ-ent types of training images are used in the learning process,such as faces, non-faces and partially aligned faces, andinstead of defining only one loss function for all tasks, oneloss function each is defined, they are switched back andforth with different ratios to balance different loss functions,according to the task importance.

As for the training data used to train the models, threekinds of images are employed: positive samples, negativesamples and suspected samples. Positive samples are thosewindows with IoU ratio over 0.7/0.65 to any ground truthfaces; negative regions are those windows with IoU ratiosmaller than 0.3 to a ground truth face, and suspected/partsamples are those windows with IoU between 0.4 and0.7/0.65. Positive and Negative samples are used for the clas-sification task of faces/non-faces. Positive and suspected/partsamples contribute to the training of bounding box regres-sion and calibration.

A. Cascade-CNN face detectorThe Cascade-CNN face detector [8], compared to other

face detection systems, that learns the classifier by relying onhand-craft features, evaluates the input given image to rejectnon-face regions and distinguishes faces from high coveredregions at higher resolution. After that, a calibration networkis applied to process the remaining detection windows toadjust its size and calibrate its location to approach apotential face. The proposed cascade face detector is a three-stage deep convolutional network. After each stage, Non-Maximum Suppression is introduced to eliminate highlyoverlapped detection windows to make a more accuratedetection window at the correct scale, with an Intersection-over-Union ratio exceeding a set ratio. The remaining cali-brated bounding boxes are considered as the outputs of themodel.

The experiments on the the challenging Face DetectionDataset and Benchmark FDDB show that the Cascade-CNN detector outperform the state-of-the-art methods in thecontinuous score evaluation, and is comparable to the state-of-the-art methods on the Annotated Faces in the Wild AFW[26].

B. Joint Face Detection and AlignmentThe proposed MTCNN framework [9] uses multi-task

cascaded convolutional networks with three stages, to predictface and landmark location in a coarse-to-fine manner.The cascaded convolutional network aims to reduce theextra computational expense of the cascade face detectorproposed by Viola and Jones [14], it utilizes Haar featureswith boosted cascade framework to evaluate frontal faces,but the framework is relatively weak towards uncontrolledapplications where faces are in varied poses and complexunexpected lighting. The need of convolutional networkswas necessary to achieve remarkable performance in a

variety of computer vision tasks. The contribution of thismethod is to combine face alignment and face detection forreal time performance.

The overall framework of this approach consists of ob-taining different scaled images by passing in the slidingwindow and image pyramid, each candidate window goesthrough the detector stage by stage. In each stage, thedetector rejects faces with low confidence level, regressesthe bounding boxes of remaining faces. For the remainingface candidates, they are updated to the new boundingboxes that are regressed. After each stage, non-maximum-suppression is performed on every box to merge those highlyoverlapped face candidates. The output of the last stagewill sort bounding boxes of the remaining calibrated facecandidates with facial landmarks’ positions, having only onebounding box for every face in the image.

The performance of the model was evaluated on FDDBand WIDER FACE datasets, outperforming the state-of-the-art methods by a large margin in both benchmarks.

C. Progressive Calibration Networks (PCN)The proposed progressive calibration network (PCN) [10]

is a real-time and accurate face detector, which aims toprogressively calibrate the rotation-in-plane angle of eachface candidate to upright in a three-stage multi-task deepconvolutional network. More specifically, the calibrationprocess to precise the rotation-in-plane angle is dividedinto several progressive steps and only predicts the coarseorientation in each stage. Given an input image, a slidingwindow and image pyramid principle are applied to obtainand detect all different sized faces within the image. Eachface will be passed through the detector stage by stage. Afterpassing in the image, the detector gathers face candidatesand their bounding box coordinates, parse the stage outputto get a list of confidence levels for each bounding box,then rejects most candidates with low face confidence, i.e.(Boxes that the network is not quite sure contains a face).Simultaneously, the estimated bounding boxes of remainingface candidates are regressed and calibrated according to thepredicted coarse rotation-in-plane angles. After each stage,non-maximum-suppression is conducted to eliminate redun-dant boxes. The calibration based on the coarse rotation-in-plane prediction brings almost no additional time cost,leading to accurate and fast calibration.

The experiments on the multi-oriented FDDB and Rota-tion WIDER FACE datasets show that PCN performs waybetter than the baseline Cascade, with almost no extra timecost benefited from the efficient calibration process.

IV. EXPERIMENTAL BASELINES

In the following part, we decribe at first the MIDV-500dataset. After that, we present the evaluation results on thedataset to demonstrate the effectiveness and the limits of themethods described above.

Page 4: arXiv:1911.03567v1 [cs.CV] 8 Nov 2019Souhail Bakkali Zuheng Ming Muhammad Muzzamil Luqman Jean-Christophe Burie L3i, La Rochelle University, France fsouhail.bakkali, zuheng.ming, jcburie,

A. Dataset StructureThe MIDV-500 is a Mobile Identity Document Video

dataset consisting of 500 video clips for 50 different identitydocument types with ground truth, including 17 types of IDcards, 14 types of passports, 13 types of driving licencesand 6 other identity documents of various countries. Thevideo clips were recorded in 5 different conditions:

CA, CS: ”Clutter” - The document lays on table withmany objects in the background.HA, HS: ”Hand” - The document is held on hand.KA, KS: ”Keyboard” - The document lays on keyboards.PA, PS: ”Partial” - The document is partially or completelyhidden off-screen.TA, TS: ”Table” - The document is presented on a table.

The dataset has about (50 documents) x (5 conditions) x(2 devices), each video was split into 10 frames per secondwhile the duration of each video is 3 seconds, which makesit (30 images) x (2 devices) for every condition. In total thereare 15 000 annotated frame images in the whole dataset.

For each extracted video frame, bounding box annotationfor each face was performed manually. In total, there are48 ID-faces out of 50 identity documents. The MIDV-500dataset allows to perform studies of object detection andinformation extraction algorithms, measures their accuracyand their robustness against various distortions. For ourstudy, we regard the face detection experiment to evaluatethe state-of-the-art frameworks described above. For that, weconsider only a part of the dataset taking into account thepresence of a face on the document. Since each conditioncontains 30 frames per 3 seconds, most frames remain thesame. So, we have taken only 2 frames per condition toevaluate the robustness of the three models. This way, 1000frames (50 document) x (5 conditions) x (2 devices) x (2frames) out of 15 000 were filtered for the experiment.

B. Results on MIDV-500 DatasetFace detection was performed using Cascade-CNN,

MTCNN and PCN. The detection was conducted usingoriginal frames, where faces presented on the documentimages were annotated manually. We haven’t conductedface detection on projective restoration of document imagesbased on ground truth document coordinates, since ourobjective is to perform the robustness of the three modelson various poses and different face appearances in differentangles. Following the protocol proposed by [24] of theevaluation of the face detector, we evaluate the three facedetectors mentioned above in terms of ROC curves onMIDV-500 dataset shown in Figure 2. In the ROC curve,the 20 to 200 FP is usually a sensible operating rangein the existing works [27]. We can see that the MTCNN(red curve) is the best face detector on MIDV-500 dataset,then the PCN (on blue curve) and the Cascade-CNN (green

curve) performs worst. Even if the PCN face detector isdesigned for the rotation-invariant face detection, the PCNdoes not show its superiority on MIDV-500 dataset. As wellas the PCN, the Cascade-CNN can also detect the rotatingface images by using four small CNNs to deal with fourdirections of the rotation, i.e. up, down, right, left. Howeverfor each direction, the CNN used for the detection is smalland the performance is inferior than the other detectors.Overall, MTCNN shows a good generalization ability onthe new challenging dataset, and is tolerant to the moderaterotation although it is not designed for the rotated faces.We also evaluate the speed of the three face detectors onMIDV-5OO dataset as shown in Table I. The evaluationof speed is employed on CPU (Intel (R) Core(TM) i7-6500 CPU @ 2.50GHZ 2.59GHZ) and GPU (Nvidia-TitanX) respectively. Meanwhile the recall rate at 20, 50, 100,200, 500 false positives on MIDV-500 are also shown inTable I. Figure 3 shows the examples of the detectionresults obtained by the PCN, MTCNN and Cascade-CNNrespectively. From the detected results, we can see that theMTCNN gains the best detection results in overall. It showsthat the condition of detection is quite challenging, where itvaries from the background of the environment, the complexillumination (such as the reflect light) and the distortion ofthe images introduced by the pose variation in the 3D space.Comparing to the other two methods, the Cascade-CNN hasmost mistakes like much more false positive detection andbad location of the coordinates of the bounding boxes. ThePCN is much more better than Cascade-CNN, but in termsof the accuracy of locating the bounding boxes, it is inferiorto MTCNN. It has to say, the examples does not includethe extreme rotation case of the images, so for the up-right images, the MTCNN shows better performance thanPCN. Comparing to the low resolution of the images in theMulti-Oriented FDDB or WiderFace datasets (from 40x40to 400x400 maximum), the high resolution (1920x1080) ofMIDV-500 images is the main reason for the low speedresults.

C. DiscussionComparing to the performance of these three methods

on the classic datasets Multi-Oriented FDDB or WiderFace,either the recall rate or the speed of the three methods PCN,MTCNN and Casecade-CNN on the MIDV-500 datasetare greatly deteriorated. This is partly due to the quitechallenging conditions of the MIDV-500 dataset, and also theheterogeneous face recognition problems. Since the methodsevaluated in this work are mainly trained on the datasetssuch as FDDB or WiderFace which are mainly visual photosor selfies, faces in MIDV-500 dataset are the images inthe ID-cards, passports, driving licences and other identitydocuments. These ones are way different from the trainingdata images used in other datasets. In our case, we mayfind black and white photos and some relative poor image

Page 5: arXiv:1911.03567v1 [cs.CV] 8 Nov 2019Souhail Bakkali Zuheng Ming Muhammad Muzzamil Luqman Jean-Christophe Burie L3i, La Rochelle University, France fsouhail.bakkali, zuheng.ming, jcburie,

Table ISPEED AND ACCURACY COMPARISON BETWEEN THE THREE DIFFERENT METHODS. THE MIDV-500 RECALL RATE (%) AT 20, 50, 100, 200, 500

FALSE POSITIVES.

Method Recall rate at FP on MIDV-500 Speed/FPS20 50 100 200 500 CPU GPU

PCN 0.0027 0.0106 0.0301 0.0744 0.2533 1.30 1.236MTCNN 0.0213 0.1550 0.2383 - - 1.344 2.71

CascadeCNN 0.0053 0.0062 0.0089 0.0133 0.0221 2.518 8.62

Figure 2. ROC curves of three face detectors PCN, MTCNN and Cascade-CNN on MIDV-500. The horizontal axis on the ROC curve is the numberof “False positives over the whole dataset” and the vertical axis on the ROCcurve is the ”True positive rate”, i.e. recall rate. Usually, 20 to 200 FP isa sensible operating range in the existing works [27].

quality. However, the generalization ability is still a problemin a deep learning based data-driven detection method, whichis strongly based on the data used to train the model. Inparticularly, for the detection task, the different way ormanner used to label faces can also affect the performance.For instance, some datasets would include more parts ofthe head to completely cover the face, so the resultingdetected bounding boxes are larger than others, this willcause the difference when we calculate the intersection-over-union used to evaluate the model. Figure 4 shows how thedifference of annotating bounding boxes between differentdatasets affects the detection. The green bounding boxesare the detected results by PCN with very high confidence(larger than 0.99), they indicate that the model thought hasdetected faces very well. However, we can see that thedetected bounding boxes do not really fit the ground truth(the IoU with the ground truth bounding box is less than 0.5).This divergence is probably caused by the different way toannotate the bounding boxes between MIDV-500 and thetraining dataset.

V. CONCLUSION

In this paper, we evaluated three state-of-the-art facedetection methods PCN, MTCNN and Cascade-CNN on the

new challenging MIDV-500 dataset. Since the face detectionis a necessary step for the following identity document anal-ysis and recognition, it is worth to survey how the currentface detection methods work on the ID card-like documentssuch as passports, citizen cards and ID-cards. The evaluationresults show the performance and the limitations of thecurrent methods for face detection on the new challengingMIDV-500 dataset. Yet, there is still much space to improvefor future works for the face detection task under challengingconditions.

ACKNOWLEDGMENT

This work was supported by the MOBIDEM project, partof the “Systematic Paris-Region” and “Images & Network”Clusters, funded by the French government.

REFERENCES

[1] L. De Koker, “Money laundering compliance—the challengesof technology,” in Financial Crimes: Psychological, Techno-logical, and Ethical Issues. Springer, 2016, pp. 329–347.

[2] V. V. Arlazarov, K. Bulatov, T. Chernov, and V. L. Ar-lazarov, “Midv-500: A dataset for identity documents analysisand recognition on mobile devices in video stream,” arXivpreprint arXiv:1807.05786, 2018.

[3] N. Skoryukina, D. P. Nikolaev, A. Sheshkus, and D. Polevoy,“Real time rectangular document detection on mobile de-vices,” in Seventh International Conference on Machine Vi-sion (ICMV 2014), vol. 9445. International Society forOptics and Photonics, 2015, p. 94452A.

[4] S. Usilin, D. Nikolaev, V. Postnikov, and G. Schaefer, “Visualappearance based document image classification,” in 2010IEEE International Conference on Image Processing. IEEE,2010, pp. 2133–2136.

[5] P. Hu and D. Ramanan, “Finding tiny faces,” in Proceedingsof the IEEE conference on computer vision and patternrecognition, 2017, pp. 951–959.

[6] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis,“Ssh: Single stage headless face detector,” in Proceedings ofthe IEEE International Conference on Computer Vision, 2017,pp. 4875–4884.

[7] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “From facialparts responses to face detection: A deep learning approach,”in Proceedings of the IEEE International Conference onComputer Vision, 2015, pp. 3676–3684.

[8] X. Shi, S. Shan, M. Kan, S. Wu, and X. Chen, “Real-timerotation-invariant face detection with progressive calibrationnetworks,” in Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, 2018, pp. 2295–2303.

Page 6: arXiv:1911.03567v1 [cs.CV] 8 Nov 2019Souhail Bakkali Zuheng Ming Muhammad Muzzamil Luqman Jean-Christophe Burie L3i, La Rochelle University, France fsouhail.bakkali, zuheng.ming, jcburie,

PCN

MTCNN

CascadeCNN

Figure 3. Some examples of the detected faces by PCN, MTCNN and CascadeCNN respectively. The rectangle (i.e. bounding box) in blue is the groundtruth and the rectangle in green is the detected bounding box.

Figure 4. The examples show how the annotation divergence betweendifferent datasets affects the detected results. The detected bounding boxes(in green) have high confidence (larger than 0.99) but the IoU with theground truth bounding box (in blue) are less than 0.5.

[9] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detec-tion and alignment using multitask cascaded convolutionalnetworks,” IEEE Signal Processing Letters, vol. 23, no. 10,pp. 1499–1503, 2016.

[10] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A con-volutional neural network cascade for face detection,” inProceedings of the IEEE conference on computer vision andpattern recognition, 2015, pp. 5325–5334.

[11] A. M. Awal, N. Ghanmi, R. Sicre, and T. Furon, “Com-plex document classification and localization application onidentity document images,” in 2017 14th IAPR InternationalConference on Document Analysis and Recognition (ICDAR),vol. 1. IEEE, 2017, pp. 426–431.

[12] M. Simon, E. Rodner, and J. Denzler, “Fine-grained classifi-cation of identity document types with only one example,” in2015 14th IAPR International Conference on Machine VisionApplications (MVA). IEEE, 2015, pp. 126–129.

[13] L.-P. De las Heras, O. R. Terrades, J. Llados, D. Fernandez-Mota, and C. Canero, “Use case visual bag-of-words tech-niques for camera based identity document classification,” in2015 13th International Conference on Document Analysisand Recognition (ICDAR). IEEE, 2015, pp. 721–725.

[14] P. Viola, M. Jones et al., “Rapid object detection using aboosted cascade of simple features,” CVPR (1), vol. 1, no.511-518, p. 3, 2001.

[15] R. G. Cinbis, J. Verbeek, and C. Schmid, “Unsupervisedmetric learning for face identification in tv video,” in 2011International Conference on Computer Vision. IEEE, 2011,pp. 1559–1566.

[16] L. Wolf, T. Hassner, and I. Maoz, Face recognition in uncon-strained videos with matched background similarity. IEEE,2011.

[17] J. Sivic, M. Everingham, and A. Zisserman, “Person spotting:

Page 7: arXiv:1911.03567v1 [cs.CV] 8 Nov 2019Souhail Bakkali Zuheng Ming Muhammad Muzzamil Luqman Jean-Christophe Burie L3i, La Rochelle University, France fsouhail.bakkali, zuheng.ming, jcburie,

video shot retrieval for face sets,” in International conferenceon image and video retrieval. Springer, 2005, pp. 226–236.

[18] C. Lu and X. Tang, “Surpassing human-level face verificationperformance on lfw with gaussianface,” in Twenty-ninth AAAIconference on artificial intelligence, 2015.

[19] V. Starovoitov, D. Samal, and D. Briliuk, “Three approachesfor face recognition,” in The 6-th International Conference onPattern Recognition and Image Analysis October, 2002, pp.21–26.

[20] V. Starovoitov, D. Samal, and B. Sankur, “Matching of facesin camera images and document photographs,” in 2000 IEEEInternational Conference on Acoustics, Speech, and Sig-nal Processing. Proceedings (Cat. No. 00CH37100), vol. 4.IEEE, 2000, pp. 2349–2352.

[21] Y. Shi and A. K. Jain, “Docface: Matching id documentphotos to selfies,” in 2018 IEEE 9th International Conferenceon Biometrics Theory, Applications and Systems (BTAS).IEEE, 2018, pp. 1–8.

[22] D. E. King, “Dlib-ml: A machine learning toolkit,” Journal ofMachine Learning Research, vol. 10, no. Jul, pp. 1755–1758,2009.

[23] G. Bradski, “The opencv library. dr. dobb’s j. of softw. tools,”2000.

[24] V. Jain and E. Learned-Miller, “Fddb: A benchmark for facedetection in unconstrained settings,” 2010.

[25] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face: A facedetection benchmark,” in Proceedings of the IEEE conferenceon computer vision and pattern recognition, 2016, pp. 5525–5533.

[26] X. Zhu and D. Ramanan, “Face detection, pose estimation,and landmark localization in the wild,” in 2012 IEEE con-ference on computer vision and pattern recognition. IEEE,2012, pp. 2879–2886.

[27] M. Opitz, G. Waltner, G. Poier, H. Possegger, and H. Bischof,“Grid loss: Detecting occluded faces,” in European confer-ence on computer vision. Springer, 2016, pp. 386–402.